CN115798578A - Device and method for analyzing and detecting virus new epidemic variant strain - Google Patents

Device and method for analyzing and detecting virus new epidemic variant strain Download PDF

Info

Publication number
CN115798578A
CN115798578A CN202211554331.0A CN202211554331A CN115798578A CN 115798578 A CN115798578 A CN 115798578A CN 202211554331 A CN202211554331 A CN 202211554331A CN 115798578 A CN115798578 A CN 115798578A
Authority
CN
China
Prior art keywords
mutation
amino acid
sequence
core
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211554331.0A
Other languages
Chinese (zh)
Inventor
任洪广
胡明达
王辛
靳远
王博千
赵云祥
岳俊杰
梁龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Academy of Military Medical Sciences AMMS of PLA
Original Assignee
Academy of Military Medical Sciences AMMS of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Academy of Military Medical Sciences AMMS of PLA filed Critical Academy of Military Medical Sciences AMMS of PLA
Priority to CN202211554331.0A priority Critical patent/CN115798578A/en
Publication of CN115798578A publication Critical patent/CN115798578A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a device for detecting and analyzing a new epidemic variant strain of a virus, which comprises parameter acquisition equipment and a computer readable carrier; the functions of quickly, efficiently and automatically detecting and early warning the new popular variant strain of the new coronavirus are realized by the methods of mutation site analysis, core mutation statistics, high-frequency core mutation combination statistics, variant strain identification, propagation trend regression analysis and the like. The method has the advantages of large analyzable genome data quantity, high operation efficiency, accurate detection of new epidemic variant strains and capability of assisting in quickly and comprehensively tracking the variation trend and the historical process of the virus. The invention can automatically and rapidly detect the new epidemic variant strain of the new coronavirus and the core mutation characteristics of the new epidemic variant strain, does not need manual participation, and can provide important technical support for monitoring epidemic situation development, variant strain identification and vaccine drug research and development in real time.

Description

Device and method for analyzing and detecting virus new epidemic variant strain
Technical Field
The invention relates to the technical field of biology, in particular to an automatic detection and early warning method for a new epidemic variant strain of a virus based on a mutation combination distribution trend.
Background
The novel coronavirus has a plurality of novel variant strains with enhanced transmission capability, wherein four new epidemic variant strains with the highest threat are respectively named as an alpha variant strain, a beta variant strain, a gamma variant strain and a delta variant strain by the world health organization, wherein the delta variant strain appears late, the outbreak speed is high, and the hazard degree is high. The insufficient recognition and early warning capability of the variant strains can lead to serious delay of epidemic strategy from epidemic spread. Therefore, in the face of rapid virus mutation, how to automatically and rapidly detect a new epidemic variant strain of new coronavirus is a very important topic.
At present, the method for screening the virus new epidemic variant strain is mainly to adopt a method for constructing a phylogenetic tree, and then to identify and detect whether a new epidemic strain exists or not by depending on the growth condition of each branch in the phylogenetic tree through manual observation. And standardized designations are made according to the position of the strain in the evolutionary tree, such as the name of the delta variant of the new coronavirus B.1.617.2. However, this method has many disadvantages, firstly, the method needs to construct a phylogenetic tree for viruses, and the computational resources required for constructing the phylogenetic tree are in a positive growth relationship with the number of virus samples. Since the sample size of the novel coronavirus is over 400 ten thousand, the required computing resources and computing time are huge, and the computing power of the conventional computer cannot meet the requirement of constructing the phylogenetic tree of the novel coronavirus. Secondly, conventional methods rely on manual observation, and the current evolutionary tree of novel coronaviruses is extremely complex, and the number of leaf nodes to be observed is over 400 ten thousand, and it is difficult to rapidly detect a new epidemic variant strain by manual observation. Thirdly, the mutation characteristics of the new epidemic variant strains cannot be directly revealed by using the conventional method, and further, technical support cannot be provided for the rapid identification of the variant strains and the vaccine design, and finally, the consequence of implementing the slow epidemic prevention strategy is caused. In summary, conventional methods for detecting new prevalent variant strains have been difficult to meet the needs of epidemic situation prevention and control in the present day when new coronavirus is abused. An automatic rapid detection method for a new epidemic variant strain of the new coronavirus is urgently needed, and is used for assisting guidance of epidemic prevention early warning and vaccine design work.
Disclosure of Invention
It is an object of the present invention to provide an apparatus for detecting and analyzing a new epidemic variant of a virus.
The invention provides a device for detecting and analyzing a virus new epidemic variant strain, which comprises parameter acquisition equipment and a computer readable carrier; the parameter acquisition equipment comprises a step of continuously acquiring amino acid sequence data of the virus to be detected in a certain fixed area for a certain time;
the readable carrier comprises:
the data storage module is used for storing the collected amino acid sequence data of the virus to be detected and the amino acid sequence of the reference strain, and the amino acid sequence data of the virus to be detected can be collected by self or can be directly used by the sequence data in the database;
the quality control module is used for performing quality control and length screening on the gene to be tested stored in the data storage module to obtain a data set to be tested and transmitting the data set to the statistical module;
the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the gene to be tested with the reference gene, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;
the screening module receives the amino acid mutation frequency table transmitted by the statistical module and further screens to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;
the identification module is used for identifying the gene sequence of the data set to be detected by combining the core mutation site key mutation set to obtain a virus variant mutation characteristic set;
and the analysis module receives the virus variant mutation feature set obtained by the identification module, and performs propagation trend analysis on the virus variant mutation feature set and popular variant identification on the virus variant mutation feature set to obtain the popular variant.
The quality control module performs quality control on the gene to be detected stored in the data storage module, and the method comprises the following steps: respectively calculating the proportion of abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of abnormal characters is as follows:
Figure BDA0003982627920000021
the total length of the sequence to be detected is L, and the number of abnormal characters in the sequence is mu;
the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, wherein the length integrity is calculated by the formula:
Figure BDA0003982627920000022
the total length of the sequence to be detected is L, and the total length of the reference strain sequence is L 0
The method for acquiring the amino acid mutation frequency table in the statistical module comprises the following steps:
(1) Performing multi-sequence comparison on the sequences in the data set to be detected and the amino acid sequences of the reference strains one by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of multi-sequence comparison, comparing the amino acid sequence to be detected with the reference strain sequence bit by bit, recording the amino acid mutation condition in the sequence to be detected and generating a mutation information set of the amino acid sequence to be detected;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,AMino ln _ln_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an Amino acid mutation wherein l1 is the positional information of the site of the Amino acid mutation, amino l1 Amino acid code at position l1 of the reference Strain sequence, mutation l1 The amino acid code at the l1 th position of the sequence of the strain to be tested, MSet is a mutation information dictionary for recording all sequence mutation information sets, MSet [ Seq ] m ]Sequence of expression Seq m A set of mutation information of (a);
(3) And (3) analyzing the amino acid mutation records of all the sequences to be detected in the step (2), carrying out frequency statistics on the frequency of each amino acid mutation in all the sequences, and establishing an amino acid mutation frequency table of the viruses to be detected.
The method for screening the high-frequency amino acid mutation set in the screening module comprises the following steps:
(1) When the occurrence frequency of the amino acid mutation is more than one thousandth of the total number N of the sequences contained in the data set to be detected, the amino acid mutation is high-frequency amino acid mutation;
(2) Judging all amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm }。
the method for screening the key mutation set of the core mutation site in the screening module comprises the following steps:
(1) Counting the frequency of the amino acid mutation at each amino acid position, and calculating the formula as follows:
Figure BDA0003982627920000031
Figure BDA0003982627920000032
wherein AH ln Represents the amino acid frequency set of the ln th amino acid position, N is the total number of sequences contained in the data set to be tested,
Figure BDA0003982627920000033
indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position of the ln position,
Figure BDA0003982627920000034
represents the frequency of occurrence of Amino _ m at the Amino acid position of ln;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
Figure BDA0003982627920000035
if H is ln >0.1, recording the amino acid position ln as the core mutation position. Selecting the mutant Amino with the highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at that site;
(3) According to the method in the step (2), counting the key mutation of all core mutation sites to obtain a core mutation site key mutation set KLMS,
KLMS={Amino l1 _l1_Mutation max l1 ,…,Amino ln _ln_Mutation max ln }。
wherein each virus data in the data storage module comprises sampling time, sampling place and sequence information.
The method for obtaining the virus variant mutation feature set comprises the steps of identifying a core mutation combination in a mutation information dictionary MSet and performing propagation trend analysis and epidemic variant identification in an analysis module.
The method for identifying the core mutation combination in the mutation information dictionary MSet comprises the following steps:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain a core mutation combination of each sequence, wherein the calculation formula is as follows:
Figure BDA0003982627920000041
Figure BDA0003982627920000042
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;
Figure BDA0003982627920000043
is sequence Seq m A combination of core mutations comprised;
(3) Performing high-frequency screening on the core mutation combined dictionary in the step (2):
Figure BDA0003982627920000044
combining core mutation in core mutation combined dictionary
Figure BDA0003982627920000045
Is the number of
Figure BDA0003982627920000046
Larger than that contained in the data set to be testedWhen the sequence total number is one thousandth of N, the corresponding core mutation combination
Figure BDA0003982627920000047
For high frequency core mutation combinations, the virus strain comprising the core mutation combination is referred to as variant MC m And recording the virus variant strain in a virus variant strain list MCL;
the method for analyzing the propagation trend and identifying the epidemic variant strains in the analysis module comprises the following steps:
(1) And (3) counting the proportion of each variant strain in the virus transmission process according to the sampling time information of the sequence: the calculation formula is as follows:
Figure BDA0003982627920000048
where MC ∈ MCL, n denotes month,
Figure BDA0003982627920000051
indicates the number of sequences of the N-th variant strain MC in the sequence set, AN n Indicating the number of all sequences collected in month n of the sequence set,
Figure BDA0003982627920000052
the sequence number of the variant strain MC in the nth month is shown;
(2) And (2) performing propagation curve fitting on the monthly proportion condition of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
Figure BDA0003982627920000053
Figure BDA0003982627920000054
wherein X is a time variable matrix and Y is a variant MPropagation ratio matrix of C, beta n The slope of a transmission fitting curve of the nth month variant strain MC is obtained, and epsilon represents the intercept of the fitting curve;
(3) By fitting the slope beta of the propagation curve in step (2) n The identification of the virus epidemic variant strain is completed by the threshold judgment of (1): the calculation formula is as follows:
T=3×β n ×M
wherein beta is n The propagation fitting curve slope of the n-th month variant strain MC is shown, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month.
The invention also provides a method for automatically detecting and early warning the new epidemic variant strain of the new coronavirus based on the mutation combination distribution trend, which comprises the following steps:
1) Obtaining all genes to be detected and reference strain amino acid sequences with complete genome from an original database;
2) Performing quality control and length screening on each genome sequence obtained in the step 1);
3) Carrying out mutation site statistics on the data set DataBase _1 to be detected obtained in the step 2) to obtain an amino acid mutation frequency list and a mutation information dictionary;
4) Screening high-frequency amino acid mutation from the amino acid mutation frequency table obtained in the step 3);
5) Screening core mutation sites of the amino acid mutation frequency table in the step 3);
6) Identifying the core mutation combination in the mutation information dictionary in the step 3), naming the virus strains containing the core mutation combination as variant strains, and storing all identified variant strains to obtain a novel coronavirus variant strain list.
Wherein, the quality control and length screening method in the step 2) comprises the following steps:
(1) Calculating the proportion of abnormal characters in the sequence, recording the total length of the sequence as L, and the number of the abnormal characters in the sequence as mu, wherein the calculation formula is as follows:
Figure BDA0003982627920000061
if P >0.1, deleting the sequence from the amino acid sequence dataset;
(2) Calculating the sequence length integrity, recording the total sequence length as L, and recording the reference strain sequence length as L 0 The calculation formula is as follows:
Figure BDA0003982627920000062
if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, and reserving other amino acid sequences after screening to be used as a data set to be detected and marking as DataBase _1.
Wherein, the mutation site statistical method in the step 3) comprises the following steps:
(1) Performing multi-sequence comparison on the sequences in the data set DataBase _1 to be detected and the amino acid sequence of the reference strain by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of multi-sequence comparison, comparing the amino acid sequence to be detected with the reference strain sequence bit by bit, recording the amino acid mutation condition in the sequence to be detected and generating a mutation information set of the amino acid sequence to be detected;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,Amino ln _ln_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an amino acid mutation. Wherein l1 is the position information of the Amino acid mutation site, amino l1 To reference the amino acid code at position l1 of the strain sequence, mutation l1 Is the amino acid code of the l1 position of the sequence of the strain to be tested. For example, the amino acid mutation D614G indicates that the mutation occurs at position 614, the 614 th position of the reference sequence is D (aspartic acid), and the 614 th position of the test strain sequence is G (glycine). MSet is a mutation information dictionary that records a set of all sequence mutation information. MSet [ Seq ] m ]Representing a sequenceSeq m The mutation information set of (3);
(3) And (3) carrying out frequency statistics on the amino acid mutation records of all the sequences to be detected in the step (2) and establishing an amino acid mutation frequency table of the target gene.
Wherein, the specific method for screening high-frequency amino acid mutation from the amino acid mutation frequency list obtained in the step 3) in the step 4) comprises the following steps:
(1) Recording the total number of sequences contained in the data set DataBase _1 to be tested as N, and recording the Amino acid mutation Amino l1 _l1_Mutation l1 The frequency of occurrence is S l1 The calculation formula is as follows:
Figure BDA0003982627920000071
if SP l1 >0.001, the Amino acid mutation Amino is identified l1 _l1_Mutation l1 Is a high frequency amino acid mutation;
(2) Identifying the amino acid mutations in the amino acid mutation frequency table in the step 3) one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm
wherein, the method for screening the core mutation sites of the amino acid mutation frequency table in the step 3) in the step 5) comprises the following steps:
(1) Counting the amino acid mutation frequency of each amino acid site, wherein the calculation formula is as follows:
Figure BDA0003982627920000072
Figure BDA0003982627920000073
wherein AH ln Represents an amino acid at the ln-th amino acid positionA frequency set, N is the total number of sequences contained in the data set DataBase _1 to be tested,
Figure BDA0003982627920000074
indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position (ln),
Figure BDA0003982627920000075
represents the frequency of occurrence of Amino _ m at the Amino acid position of lnn;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
Figure BDA0003982627920000076
if H is ln >0.1, recording the amino acid position ln as the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at that site;
(3) Establishing key mutation set KLMS of a core mutation site,
KLMS={Amino l1 _l1_Mutation max l1 ,…,Amino ln _ln_Mutation max ln }。
the method for identifying the core mutation combination in the mutation information dictionary MSet in the step 3) in the step 6) comprises the following steps:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS obtained in the step 4) and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain the core mutation combination of each sequence
Figure BDA0003982627920000081
The calculation formula is as follows:
Figure BDA0003982627920000082
Figure BDA0003982627920000083
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;
(3) Performing high-frequency screening on the core mutation combination in the step (2): note the book
Figure BDA0003982627920000084
In KMSet with
Figure BDA0003982627920000085
The number of the same core mutation combinations, the total number of sequences contained in the data set to be tested DataBase _1 is N, and the calculation formula is as follows:
Figure BDA0003982627920000086
if it is used
Figure BDA0003982627920000087
Identifying core mutation combinations
Figure BDA0003982627920000088
For high frequency core mutation combinations, all combinations comprising core mutations are referred to
Figure BDA0003982627920000089
The virus strain of (A) is a variant strain MC m All identified variants are saved to the new coronavirus variant list MCL.
Based on the method, all variant strain lists MCL can be obtained, so that data support is provided for the subsequent analysis and detection of the novel epidemic coronavirus, and data support is provided for the subsequent design of vaccine and epidemic control.
Wherein, the method also comprises a step 7) of carrying out transmission trend analysis and epidemic variant strain identification on the novel coronavirus variant strain list obtained in the step 6) according to the following method:
(1) And counting the proportion of each variant strain in the virus propagation process according to the time-space labeling information of the sequence. The calculation formula is as follows:
Figure BDA00039826279200000810
where MC ∈ MCL, n denotes month,
Figure BDA00039826279200000811
the number of sequences of the n-th variant strain MC in the sequence set is shown. AN (Access network) n Indicating the number of all sequences collected in month n of the sequence set,
Figure BDA00039826279200000812
the sequence number of the variant strain MC in the nth month is shown as a ratio;
(2) And (3) fitting a propagation curve for the monthly ratio of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
Figure BDA0003982627920000091
Figure BDA0003982627920000092
wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta n The slope of the curve fitted to the transmission of the n-th month variant strain MC is shown as ε represents the intercept of the fitted curve.
(3) By opposite steppingFitting propagation curve slope beta in step (2) n The discrimination of the new coronavirus epidemic variant strain is completed. The calculation formula is as follows:
T=3×β n ×M
wherein beta is n The slope of the propagation fit curve for the nth variant strain MC, M is the total number of all variants in the variant list MCL. T represents an index of the prevalence spread of the N month variant strain MC. If T is more than or equal to 1, judging that the variant MC is a popular variant in the nth month;
(4) Counting the epidemic variant strains identified in the step (3), and marking the epidemic variant strains which do not appear before as new epidemic variant strains. Thus, the whole work of automatically detecting the new epidemic variant strain of the novel coronavirus is completed.
Wherein, the propagation trend analysis of the variant strain comprises the following four steps:
1) And calculating the proportion of each variant strain in each month, and drawing a propagation curve chart.
2) And (3) calculating the rising slope of the propagation curve of each variant in the current month by using the data of three months before the current month as fitting data by using a linear regression method.
3) And (4) carrying out threshold judgment on the rising slope of the transmission curve of each variant strain per month, and defining the variant strain exceeding the threshold as the circulating strain in the month.
4) Recording the current month epidemic strain, marking the epidemic strain which has not appeared before as new type coronavirus new epidemic variant strain.
Wherein, the virus genome data is large-scale novel coronavirus genome amino acid sequence data which is complete as much as possible.
The high-frequency amino acid mutation is a novel coronavirus genome sequence amino acid mutation with mutation rate exceeding one thousandth obtained through statistical screening.
Wherein, the core mutation site is a novel coronavirus genome sequence amino acid site with a Shannon entropy value of more than 0.1 obtained by Shannon entropy calculation and screening. The key mutation of the core mutation site is the amino acid mutation with the most occurrence on the amino acid site.
The core mutation is an important amino acid mutation with high occurrence frequency in a novel coronavirus genome, and the method is obtained by performing union operation on a high-frequency amino acid mutation set and a core mutation site key mutation set.
Wherein, the core mutation combination is a specific case of simultaneous occurrence between core mutations in the genome sequence of the novel coronavirus, and is also used as a gene mutation characteristic for describing a variant strain.
The application of the method in the identification and early warning operation of the new-epidemic variant strain of the novel coronavirus or the preparation of the identification and early warning operation product of the new-epidemic variant strain of the novel coronavirus also falls within the protection scope of the invention.
Experiments of the invention show that compared with the prior art, the invention has the advantages that: 1. aiming at large-scale genome data, the identification efficiency of the new epidemic variant strains of the novel coronavirus can be greatly accelerated by analyzing the distribution trend of the core mutation site combination in the novel coronavirus sequence. 2. By analyzing the core mutation combination, the evolutionary characteristics of the novel coronavirus new-epidemic variant strain can be accurately positioned, an effective data base 3 is provided for identification of variant strains, vaccine design and drug research and development, a linear regression method is used, prediction and early warning of the propagation trend of the novel coronavirus new-epidemic variant strain can be completed more efficiently, and a powerful data base is provided for designation of epidemic situation prevention and control strategies. 4. The method for detecting the new epidemic variant strains of the novel coronavirus based on the mutation combination distribution trend can be automatically processed in all processes, and is more convenient for establishing an efficient analysis early warning system.
In conclusion, the invention realizes the function of quickly, efficiently and automatically detecting and early warning the new epidemic variant strain of the new coronavirus through the methods of mutation site analysis, core mutation statistics, high-frequency core mutation combination statistics, variant strain identification, propagation trend regression analysis and the like. The method has the advantages of large analyzable genome data quantity, high operation efficiency, accurate detection of new epidemic variant strains and capability of assisting in quickly and comprehensively tracking the variation trend and the historical process of the virus. The method solves the problem that the efficiency of identifying the variant strains by constructing the evolutionary tree is low aiming at the insufficient analysis and processing capacity of large-scale genome data in the traditional method. The method has the main advantages that the method can automatically and quickly detect the new epidemic variant strains of the new coronavirus and the core mutation characteristics of the new coronavirus, does not need manual participation, and can provide important technical support for monitoring epidemic situation development, identifying variant strains and researching and developing vaccine medicaments in real time by utilizing the technical advantages.
Drawings
FIG. 1 is a schematic flow chart of a method for automatically detecting a new pandemic variant of a coronavirus based on a distribution trend of mutation combinations;
FIG. 2 is a statistical chart of amino acid mutation frequencies of novel coronavirus Spike proteins;
FIG. 3 is a Shannon entropy statistic chart of amino acid sites of novel coronavirus Spike protein;
FIG. 4 is a schematic flow chart of core mutation combinatorial screening and variant strain identification;
FIG. 5 is a graph showing the analysis of the distribution trend of the core mutation combinations;
FIG. 6 is a graph of regression analysis of the identification and propagation curves of newly circulating variants.
Detailed Description
The present invention is described in further detail below with reference to specific embodiments, and the examples are given only for illustrating the present invention and not for limiting the scope of the present invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.
The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1
The invention provides a device for detecting and analyzing a virus new epidemic variant strain, which comprises parameter acquisition equipment and a computer readable carrier; the parameter acquisition equipment comprises equipment for acquiring or collecting the amino acid sequence of the virus to be detected, such as equipment for inputting the amino acid sequence of the virus to be detected or equipment for detecting the amino acid sequence of the virus to be detected;
the readability carrier comprising:
the data storage module is used for storing the collected amino acid sequences of the virus gene to be tested and the reference gene;
the quality control module is used for performing quality control and length screening on the gene to be tested stored in the data storage module to obtain a data set to be tested and transmitting the data set to the statistical module;
the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the gene to be tested with the reference gene, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;
the screening module receives the amino acid mutation frequency table transmitted by the statistical module and further screens to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;
the identification module is used for identifying the gene sequence of the data set to be detected by combining the core mutation site key mutation set to obtain a virus variant mutation characteristic set;
and the analysis module receives the virus variant mutation characteristic set obtained by the identification module, performs propagation trend analysis on the virus variant mutation characteristic set, and identifies the epidemic variant strain to obtain a new epidemic variant strain.
The method for controlling the quality of the gene to be detected stored in the data storage module in the quality control module comprises the following steps: respectively calculating the proportion of abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of abnormal characters is as follows:
Figure BDA0003982627920000121
the total length of the sequence to be detected is L, and the number of abnormal characters in the sequence is mu;
the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, wherein the calculation formula of the length integrity is as follows:
Figure BDA0003982627920000122
the total length of the sequence to be detected is L, and the total length of the reference strain sequence is L 0
The method for acquiring the amino acid mutation frequency table in the statistical module comprises the following steps:
(1) Performing multi-sequence comparison on the sequences in the data set DataBase _1 to be detected and the amino acid sequences of the reference strains by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of multi-sequence comparison, comparing the amino acid sequence to be detected with the reference strain sequence bit by bit, recording the amino acid mutation condition in the sequence to be detected and generating a mutation information set of the amino acid sequence to be detected;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,Amino ln _ln_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an amino acid mutation. Wherein l1 is the position information of the Amino acid mutation site, amino l1 Amino acid code at position l1 of the reference Strain sequence, mutation l1 Amino acid code of l1 position of the sequence of the strain to be tested, MSet is a mutation information dictionary recording all sequence mutation information sets, MSet [ Seq ] m ]Sequence of expression Seq m The mutation information set of (3);
(3) And (3) carrying out frequency statistics on the amino acid mutation records of all the sequences to be detected in the step (2) and establishing an amino acid mutation frequency table of the target gene.
The method for screening the high-frequency amino acid mutation set in the screening module comprises the following steps:
(1) Recording the total number of sequences contained in the data set DataBase _1 to be tested as N, and recording the Amino acid mutation Amino l1 _l1_Mutation l1 The frequency of occurrence is S l1 The calculation formula is as follows:
Figure BDA0003982627920000131
if SP l1 >0.001, amino acid mutation Amino is identified l1 _l1_Mutation l1 Is a high frequency amino acid mutation;
(2) Identifying the amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm }。
the method for screening the key mutation set of the core mutation sites in the screening module comprises the following steps:
(1) Counting the amino acid mutation frequency of each amino acid site, wherein the calculation formula is as follows:
Figure BDA0003982627920000132
Figure BDA0003982627920000133
wherein AH ln Represents the amino acid frequency set of the ln-th amino acid position, N is the total number of sequences contained in the data set DataBase _1 to be tested,
Figure BDA0003982627920000134
indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position (ln),
Figure BDA0003982627920000135
represents the frequency of occurrence of Amino _ m at the Amino acid position of lnn;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
Figure BDA0003982627920000136
if H is present ln >0.1, recording the amino acid position ln as the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at that site;
(3) Establishing a key mutation set KLMS of a core mutation site,
KLMS={Amino l1 _l1_Mutation max l1 ,…,Amino ln _ln_Mutation max ln }。
the identification module is combined with the core mutation site key mutation set to identify the gene sequence of the data set to be detected, and the method for obtaining the virus variant mutation characteristic set comprises the following steps:
the method for identifying the core mutation combination in the mutation information dictionary MSet in the step 3) in the step 6) comprises the following steps:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS obtained in the step 4) and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain the core mutation combination of each sequence
Figure BDA0003982627920000141
The calculation formula is as follows:
Figure BDA0003982627920000142
Figure BDA0003982627920000143
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;
(3) Performing high-frequency screening on the core mutation combination in the step (2): note the book
Figure BDA0003982627920000144
In KMSet with
Figure BDA0003982627920000145
The number of the same core mutation combinations, the total number of sequences contained in the data set to be tested DataBase _1 is N, and the calculation formula is as follows:
Figure BDA0003982627920000146
if it is used
Figure BDA0003982627920000147
Identifying core mutation combinations
Figure BDA0003982627920000148
For high frequency core mutation combinations, all combinations comprising core mutations are referred to
Figure BDA0003982627920000149
The virus strain of (A) is a variant strain MC m Storing all identified variant strains into a novel coronavirus variant strain list MCL;
the method for analyzing the propagation trend and identifying the popular variant strains in the analysis module comprises the following steps:
(1) And (3) counting the proportion of each variant strain in the virus propagation process according to the time-space labeling information of the sequence: the calculation formula is as follows:
Figure BDA00039826279200001410
where MC ∈ MCL, n denotes month,
Figure BDA00039826279200001411
indicates the number of sequences of the Nth month variant strain MC in the sequence set, AN n Indicating the number of all sequences collected in month n of the sequence set,
Figure BDA00039826279200001412
the sequence number of the variant strain MC in the nth month is shown;
(2) And (3) fitting a propagation curve for the monthly ratio of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
Figure BDA00039826279200001413
Figure BDA0003982627920000151
wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta n The slope of a transmission fitting curve of the nth variant strain MC is shown, and epsilon represents the intercept of the fitting curve;
(3) By fitting the slope beta of the propagation curve to that in step (2) n The threshold value judgment of (2) to finish the identification of the novel coronavirus epidemic variant: the calculation formula is as follows:
T=3×β n ×M
wherein beta is n The propagation fitting curve slope of the n-th month variant strain MC is obtained, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month;
(4) And (4) counting the epidemic variant strains identified in the step (3), and marking the epidemic variant strains which do not appear before as new epidemic variant strains to realize the automatic detection of the new epidemic variant strains of the novel coronavirus.
Example 2 identification of new circulating variants of the novel coronavirus based on the trend of distribution of the combination of mutations in the S protein of the novel coronavirus.
Based on the device described in example 1, the present invention identifies a new prevalent variant strain of the novel coronavirus based on the distribution trend of the mutation combinations in the S protein of the novel coronavirus by the following method, as shown in fig. 1, firstly performing quality control and length screening on the amino acid sequence set of the gene to be tested of the novel coronavirus, then performing mutation site statistics on the screened data set, analyzing the core mutation existing in the virus by counting mutation frequency and calculating shannon entropy, then screening out the high-frequency core mutation combinations according to the distribution of the core mutations in the sequence to be tested, and identifying and naming each variant strain accordingly, then analyzing the propagation trend of each variant strain by using a linear regression method, and finally automatically detecting the new prevalent variant strain of the novel coronavirus by screening the rising slope of the propagation curve. The method comprises the following specific steps:
1. preparation of novel coronavirus Spike protein genome sample
The preparation of the genome sequence sample aiming at the novel coronavirus Spike protein comprises the following steps:
1. construction of novel coronavirus Spike protein genome database
The amino acid sequences of the novel coronavirus Spike protein with all complete genomes and full-length sequences are obtained from public databases, and the number of the novel coronavirus Spike protein sequences collected from the databases is about 320 ten thousand by 9/1 in 2021.
2. Sequence standardization naming and reference strain sequence extraction
And (3) carrying out standardized renaming treatment on the amino acid sequences in the novel coronavirus S protein genome database obtained in the step (1) one by one according to a sequence naming rule.
The sequence naming rule is:
(> Gene name | sequence name | time information | sequence number | geographic information)
According to the information of the suggested reference strains in the database, extracting sequences:
"> Spike | hCoV-19/Wuhan/Hu-1/2019 Shanxi (EPI_ISL_402125) Tena (China) is used as a reference strain sequence of a novel coronavirus Spike protein genome database.
2. Quality control and length screening of novel coronavirus Spike protein genome samples
Quality control and length screening against the novel coronavirus Spike protein genome database included the following steps:
1. calculating abnormal character ratio of each sequence in database
The calculation formula is as follows:
Figure BDA0003982627920000161
where μ is the number of anomalous characters 'X' in each sequence and L is the total sequence length. If P >0.1, the sequence is deleted.
2. Calculating the integrity of the genome sequence length
The total length of the sequence is recorded as L, and the total length of the reference strain sequence is recorded as L 0 The calculation formula is as follows:
Figure BDA0003982627920000162
wherein the total length L of the reference sequence of the novel coronavirus Spike protein 0 =1274. If LP < 0.8, the genomic sequence is deleted from the set of genomic sequence data. After screening, the number of the genome sequences of the novel coronavirus S protein genome DataBase is kept to be 246 ten thousand, and the genome sequences are used as a novel coronavirus Spike protein genome to-be-detected data set DataBase _1.
3. Statistics of mutation sites of novel coronavirus Spike protein genomes
Mutation site statistics is carried out on the data set DataBase _1 to be detected of the novel coronavirus Spike protein genome.
The method comprises the following specific steps:
1. and (3) performing multi-sequence alignment on the sequences in the data set DataBase _1 to be detected and the reference strain sequence by using a multi-sequence alignment tool, and calibrating the position information of each amino acid site in the sequence to be detected.
2. And comparing the amino acid conditions of each amino acid site in the sequence to be detected and the corresponding amino acid site of the reference sequence according to the multi-sequence comparison result. And recording the mutation condition of amino acids in the sequence to be detected and generating a mutation information set.
3. And carrying out frequency statistics on the amino acid mutation in the mutation information set, and establishing a Spike protein amino acid mutation frequency table.
4. High-frequency amino acid mutation screening of novel coronavirus Spike protein
Analyzing the Spike protein amino acid mutation frequency table, and screening the amino acid mutation with high frequency. The method comprises the following specific steps:
1. the frequency of each amino acid mutation in the Spike protein amino acid mutation frequency table is recorded as S 1 The total number of sequences in the data set DataBase _1 to be measured is about 246 ten thousand, and the calculation formula is as follows:
Figure BDA0003982627920000171
screening all SPs 1 >The 0.001 amino acid mutation is a high frequency amino acid mutation.
2. Recording all the high-frequency amino acid mutations in the step 1, and establishing a high-frequency amino acid mutation set HFMS, wherein the number of the high-frequency amino acid mutations contained in the HFMS is 151, the experimental result is shown in FIG. 2, FIG. 2 is a statistical graph of the amino acid mutation frequency of the novel coronavirus Spike protein, the abscissa is the name of the amino acid mutation, and the ordinate is the amino acid mutation frequency. The 151 high-frequency amino acid mutations are L5F, S12F, S13I, L18F, T19R, T20N, T20I, R21I, T22I, P26S, A27S, T29I, H49Y, Q52R, L54F, A67V, I68-, I68V, H69-, V70-, V70I, G75V, G75R, D80A, D80G, D80Y, S94F, T95I, S98F, D138Y, D138H, G142D, V143-, V143F, Y144-, Y144V, Y145-, W152R, W152L, W152C, M153I, M153T, E154K, E154-, S155-, E156G, E156-, F157-, F157L, F157S, R158-, R158G, L176F, D178H, G181V, L189F, R190S, D215H, D215G, L216F, S221L, A222V, L241-, L242-, A243-, D253G, S254F, S255F, S256L, W258L, A262S, P272L, V367F, K417T, K417N, N439K, N440K, L452R, S477N, T478K, E484K, E484Q, F490S, S494P, N501T, N501Y, A520S, A522S, K558N, A570D, T572I, E583D, Q613H, D614G, V622F, S640F, A653V, H655Y, Q675R, Q675H, Q677P, Q677H, N679K, P681H, P681R, A688V, A701V, S704L, T716I, T732A, G769V, V772I, E780Q, T791I, D796H, P809S, P812L, a845S, T859I, T859N, F888L, a899S, D936Y, S939F, D950N, D950H, Q957R, S982A, a1020S, T1027I, Q1071L, Q1071H, K1073N, a1078S, V1104L, D1118H, G1124V, P1162S, D1163Y, G1167V, V1176F, K1191N, G1219C, G1219V, V1228L, M1229I, M1237I, S1252F, P1263L, V1264L. Wherein the numbers indicate positions in the sequence, the original amino acids preceding the numbers and the mutated amino acids following the numbers, indicate amino acid deletions
5. Novel screening of core mutation sites of coronavirus Spike protein
Analyzing the Spike protein amino acid mutation frequency table, and screening the core mutation sites in the Spike protein amino acid sequence. The method comprises the following specific steps:
1. the mutation frequency at each amino acid position was calculated. The calculation formula is as follows:
Figure BDA0003982627920000181
Figure BDA0003982627920000182
wherein AH ln Represents the amino acid frequency set of the ln-th amino acid position, N is the total number of sequences contained in the data set DataBase _1 to be tested,
Figure BDA0003982627920000183
indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position of the ln position,
Figure BDA0003982627920000184
indicates the frequency of Amino _ m at the nth Amino acid position. (where Amino _ m is a whole, indicates a different Amino acid symbol appearing at the ln-th position, and m has no single meaning but distinguishes it from the different Amino acid symbols.)
2. Shannon entropy of amino acid positions. The calculation formula is as follows:
Figure BDA0003982627920000185
if H is present ln >0.1, recording the amino acid position ln as the core mutation position. Selecting the mutant Amino with the highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation ln As a key mutation at this site.
3. And (3) recording key mutations of the core mutation sites screened in the step (2) and establishing a set KLMS. The KLMS comprises 33 key mutations, the experimental result is shown in figure 3, figure 3 is a Shannon entropy statistical chart of amino acid sites of the novel coronavirus Spike protein, wherein the abscissa is the amino acid mutation site, the ordinate is the Shannon entropy value, and the 33 key mutations are respectively L5F, L18F, T19R, T20N, P26S, I68-, H69-, V70-, T95I, D138Y, G142D, Y144-, W152C, E156-, F157-, R158G, R190S, A222V, K417T, L452R, S477N, T478K, E484K, N501Y, A570D, H655Y, P681H, T716I, D950N, S982A, T1027I, D1118H, V6F. Where the numbers indicate positions in the sequence, the original amino acids are preceded by the numbers and the mutated amino acids are followed-indicating amino acid deletions.
6. Statistics and screening of novel coronavirus Spike protein core mutation combination
Comprehensively analyzing the amino acid mutation characteristics in the Spike protein genome, and screening the core mutation combination in the Spike protein by counting the occurrence frequency of different amino acid mutation combinations. The method comprises the following specific steps:
1. and (4) merging the high-frequency amino acid mutation set HFMS and the key mutation set KLMS obtained in the fourth step and the fifth step. A set of core mutations KMS is established. Wherein the number of core mutations contained in the KMS is 151.
2. Screening mutations in each amino acid sequence, performing intersection operation on a set KMS of core mutations, and establishing a core mutation combined information dictionary KMSet. The KMSet records the combination of core mutations contained in each sequence.
3. And counting the core mutation combination information dictionary KMSet, and selecting the core mutation combinations with high frequency. The specific calculation formula is as follows:
Figure BDA0003982627920000191
wherein
Figure BDA0003982627920000192
For the frequency of occurrence of the core mutation combinations, N is the total number of sequences. If it is
Figure BDA0003982627920000193
The core mutation combination is identified as a high frequency core mutation combination.
4. And recording the high-frequency core mutation combination, and establishing a high-frequency mutation combination information table. And regarding each high-frequency core mutation combination as the gene mutation characteristic of one variant strain, and regarding the sequences containing the same high-frequency core mutation combination as the same variant strain. The high-frequency core mutation combination information table comprises 16 high-frequency core mutation combinations, the experimental result is shown in fig. 4, fig. 4 is a high-frequency core mutation combination frequency statistical chart, the abscissa is mutation combination frequency, and the ordinate is high-frequency core mutation combination name. The 16 groups of high-frequency core mutation combinations are { H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H }; { L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R }; { H69-, A570D, S982A, N501Y, T716I, D614G, D1118H, V70-, P681H }; { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R }; { D253G, T95I, Q957R, L5F, D614G, S477N }; { A262S, P272L, A222V, D614G }; { G769V, W152L, D614G, E484K }; { H69-, A570D, S982A, N501Y, L5F, T716I, D614G, D1118H, Y144-, V70-, P681H }; { H69-, A570D, S982A, N501Y, K1191N, T716I, D614G, D1118H, Y144-, V70-, P681H }; { D253G, a701V, T95I, L5F, D614G, E484K }; { D614G, T732A, T478K, P681H }; { L452R, T95I, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R }; { L452R, S13I, W152C, D614G }; { D138Y, P26S, T20N, T1027I, K417T, R190S, N501Y, L18F, D614G, V1176F, H655Y, E484K }; { A222V, L18F, D614G }; { H69-, A570D, S982A, N501Y, T716I, D614G, D1118H, Y144-, V70-, P681H }. The experimental result shows that 16 virus variant strains are identified by the equipment through calculation, and the name of each variant strain is represented by the corresponding high-frequency core mutation combination.
7. High-frequency core mutation combination distribution trend analysis and recognition early warning of new popular variant strains
And C, analyzing the distribution trend of the high-frequency core mutation combination obtained in the sixth step, and finishing the identification of the new popular variant strain by combining with the propagation curve fitting. The method comprises the following specific steps:
1. according to the spatio-temporal labeling information of the sequence, the monthly proportion of each variant strain in the virus propagation process is counted, a distribution trend graph of high-frequency core mutation combination is drawn by using a Mathplotlib module in a Python program package, as shown in FIG. 5, FIG. 5 is a distribution trend analysis graph of the core mutation combination, wherein the abscissa is date, and the ordinate is the occurrence frequency of the mutation combination in the month, the change situation of the monthly propagation proportion of each variant strain (high-frequency core mutation combination) can be seen from FIG. 5, wherein the variant strain with gradually increased propagation proportion has the risk of becoming a popular variant strain, and the possibility of becoming the popular variant strain can be predicted through the fitting analysis of the propagation curve.
2. Propagation curve fitting is carried out on monthly occupation conditions of each variant strain by utilizing a linear regression method calculation tool in a Pyhton program module Sklearn.
3. The curvature of propagation of each variant was calculated, and variants above a defined threshold were considered as circulating variants in the month. The propagation curvature definition threshold calculation formula is as follows:
T=3×β n ×M
wherein beta is n The slope of the curve was fitted to the transmission of the nth variant, the total number of M all variants. If T is more than or equal to 1, the variant is judged to be the epidemic variant in the nth month, as shown in FIG. 6, FIG. 6 is a regression analysis chart of the identification and propagation curve of the newly epidemic variant, wherein the abscissa is the date, the ordinate is the propagation distribution ratio of the variant, and the dotted straight line in FIG. 6 represents the regression curve of the month in which the maximum propagation curvature of each variant is located. The circular mark points are the propagation curvatures of variant strains { H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H } in 1 month of 2021, and T =3.12. The apparatus determines the variant strain to be a prevalent variant strain in 1 month 2021. The triangular mark is the propagation curvature of variant { L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R } at 6 months of 2021, T =2.16. The apparatus determines the variant strain to be a circulating variant strain in 6 months of 2021. The square markers are the propagation curvatures of variants { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R } at 6 months of 2021, T =1.21. The facility identified the variant as a circulating variant in 2021, 6 months.
4. And (3) carrying out integration recording on the identified monthly epidemic variant strains, and marking the epidemic variant strains which do not appear before as new epidemic variant strains, so that all the work of automatic detection and early warning of the new coronavirus new epidemic variant strains is completed.
8. Experimental results and verification
1. Through the experimental steps, 3 new epidemic variant strains are automatically identified by the method. The method comprises the following steps:
(1) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein in month 1 of 2020- (H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H };
(2) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein of 6 months 2021- (L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K and P681R);
(3) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein in No. 6 of 2021- { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R }.
2. By inquiring the related information of the novel coronavirus published by the WHO official website and the GISAID data website, the new epidemic variant strain (1) identified by the method can be known to accord with the gene variation characteristic of the novel coronavirus alpha variant strain. The new epidemic variant strains (2) and (3) identified by the method conform to the gene variation characteristics of the novel coronavirus delta variant strain. In conclusion, the method can accurately and effectively identify the new-epidemic variant strain of the novel coronavirus and early warn whether the new-epidemic variant strain is epidemic in a large scale.
The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is made possible within the scope of the claims attached below.

Claims (9)

1. Apparatus for detecting and analyzing a new circulating variant of a virus, comprising a parameter acquisition device and a computer readable carrier; the parameter acquisition equipment comprises a parameter acquisition unit, a parameter acquisition unit and a parameter acquisition unit, wherein the parameter acquisition unit is used for continuously acquiring amino acid sequence data of a virus to be detected in a certain fixed area for a certain time;
the readable carrier comprises:
the data storage module is used for storing the collected amino acid sequence data of the virus to be detected and the amino acid sequence of the reference strain;
the quality control module is used for performing quality control and length screening on the genes to be tested stored in the data storage module, obtaining a data set to be tested after screening and transmitting the data set to the statistical module;
the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the genes to be tested in the data set to be tested with the reference genes, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;
the screening module receives the amino acid mutation frequency table transmitted by the counting module and screens the amino acid mutation frequency table to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;
the identification module is used for receiving the key mutation set of the core mutation site transmitted by the screening module to identify the gene sequence of the data set to be detected, so as to obtain a virus variant mutation feature set;
and the analysis module receives the virus variant mutation characteristic set obtained by the identification module, performs propagation trend analysis on the virus variant mutation characteristic set, and identifies the virus variant mutation characteristic set and the epidemic variant strain to obtain the epidemic variant strain.
2. The device of claim 1, wherein the quality control module performs quality control on the gene to be tested stored in the data storage module by: respectively calculating the proportion of the abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of the abnormal characters is as follows:
Figure FDA0003982627910000011
the total length of the sequence to be detected is L, and the number of abnormal characters in the sequence is mu;
the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP < 0.8, the sequence is deleted from the amino acid sequence data set and the length integrity is calculated as:
Figure FDA0003982627910000012
the total length of the sequence to be detected is L, and the total length of the reference strain sequence is L 0
3. The apparatus of claim 1, wherein the table of the frequency of the amino acid mutations in the statistical module is obtained by:
(1) Performing multi-sequence comparison on the sequences in the data set to be detected and the amino acid sequences of the reference strains one by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of the multi-sequence comparison, the amino acid sequence to be detected is compared with the reference strain sequence bit by bit, the amino acid mutation condition in the sequence to be detected is recorded, and the mutation information set of the amino acid sequence to be detected is generated;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,Amino ln _In_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an Amino acid mutation wherein l1 is the positional information of the site of the Amino acid mutation, amino l1 Amino acid code at position l1 of the reference Strain sequence, mutation l1 Amino acid code of l1 position of the sequence of the strain to be tested, MSet is a mutation information dictionary recording all sequence mutation information sets, MSet [ Seq ] m ]Sequence of expression Seq m A set of mutation information of (a);
(3) And (3) analyzing the amino acid mutation records of all the sequences to be detected in the step (2), carrying out frequency statistics on the frequency of each amino acid mutation in all the sequences, and establishing an amino acid mutation frequency table of the viruses to be detected.
4. The device of claim 1, wherein the screening module screens the high-frequency amino acid mutation set by:
(1) When the occurrence frequency of the amino acid mutation is more than one thousandth of the total number N of the sequences contained in the data set to be detected, the amino acid mutation is high-frequency amino acid mutation;
(2) Judging all amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm }。
5. the apparatus according to claim 1, wherein the screening module screens the key mutation sets of the core mutation sites by:
(1) Counting the frequency of the amino acid mutation at each amino acid position, and calculating the formula as follows:
Figure FDA0003982627910000021
Figure FDA0003982627910000022
wherein AH ln Represents the amino acid frequency set of the ln th amino acid position, N is the total number of sequences contained in the data set to be tested,
Figure FDA0003982627910000023
indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position (ln),
Figure FDA0003982627910000024
represents the frequency of occurrence of Amino _ m at the Amino acid position of lnn;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
Figure FDA0003982627910000031
if H is ln If the amino acid position ln is more than 0.1, the recorded amino acid position ln is the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at this site;
(3) According to the method in the step (2), counting the key mutation of all core mutation sites to obtain a core mutation site key mutation set KLMS,
KLMS={Amino l1 _l1_Mutation max l1 ,...,Amino ln _ln_Mutation max ln }。
6. the apparatus of claim 1, wherein each virus data in the data storage module comprises sampling time, sampling location and sequence information.
7. The device according to claim 6, wherein the identification module is used for identifying the gene sequences of the data set to be tested by combining the core mutation site key mutation set and the high-frequency amino acid mutation set, and the method for obtaining the virus variant mutation feature set comprises the steps of identifying the core mutation combination in the mutation information dictionary MSet and performing propagation trend analysis and epidemic variant identification in the analysis module.
8. The apparatus of claim 7, wherein the means for identifying core mutation combinations in the mutation information dictionary MSet comprises:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain a core mutation combination of each sequence, wherein the calculation formula is as follows:
Figure FDA0003982627910000032
Figure FDA0003982627910000033
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;
Figure FDA0003982627910000034
is sequence Seq m A combination of core mutations comprised;
(3) Performing high-frequency screening on the core mutation combined dictionary in the step (2):
Figure FDA0003982627910000035
combining core mutations in a dictionary of core mutation combinations
Figure FDA0003982627910000036
When the number of
Figure FDA0003982627910000037
When the number of the core mutation combinations is more than one thousandth of the total number N of the sequences contained in the data set to be tested, the corresponding core mutation combinations
Figure FDA0003982627910000038
For high frequency core mutation combinations, the virus strain comprising the core mutation combination is referred to as a variant MC m And recorded in a virus variant list MCL.
9. The apparatus of claim 7, wherein the analysis module performs a propagation trend analysis and a popular variant identification by:
(1) And (3) counting the proportion of each variant strain in the virus propagation process according to the sampling time information of the sequence: the calculation formula is as follows:
Figure FDA0003982627910000041
where MC ∈ MCL, n denotes month,
Figure FDA0003982627910000042
indicates the number of sequences of the Nth month variant strain MC in the sequence set, AN n Indicating the number of all sequences collected in month n of the sequence set,
Figure FDA0003982627910000043
the sequence number of the variant strain MC in the nth month is shown;
(2) And (3) fitting a propagation curve for the monthly ratio of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
Figure FDA0003982627910000044
Figure FDA0003982627910000045
wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta n The slope of a transmission fitting curve of the nth month variant strain MC is obtained, and epsilon represents the intercept of the fitting curve;
(3) By fitting the slope beta of the propagation curve in step (2) n The identification of the virus epidemic variant strain is completed by the threshold judgment of (1): the calculation formula is as follows:
T=3×β n ×M
wherein beta is n The propagation fitting curve slope of the n-th month variant strain MC is shown, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month.
CN202211554331.0A 2022-12-06 2022-12-06 Device and method for analyzing and detecting virus new epidemic variant strain Pending CN115798578A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211554331.0A CN115798578A (en) 2022-12-06 2022-12-06 Device and method for analyzing and detecting virus new epidemic variant strain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211554331.0A CN115798578A (en) 2022-12-06 2022-12-06 Device and method for analyzing and detecting virus new epidemic variant strain

Publications (1)

Publication Number Publication Date
CN115798578A true CN115798578A (en) 2023-03-14

Family

ID=85445984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211554331.0A Pending CN115798578A (en) 2022-12-06 2022-12-06 Device and method for analyzing and detecting virus new epidemic variant strain

Country Status (1)

Country Link
CN (1) CN115798578A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116741268A (en) * 2023-04-04 2023-09-12 中国人民解放军军事科学院军事医学研究院 Method, device and computer readable storage medium for screening key mutation of pathogen

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7255993B1 (en) * 2002-11-06 2007-08-14 The United States Of America As Represented By The Secretary Of The Department Of Health And Human Services Detection of mutational frequency and related methods
CN107423578A (en) * 2017-03-02 2017-12-01 北京诺禾致源科技股份有限公司 Detect the device of somatic mutation
CN108733975A (en) * 2018-03-29 2018-11-02 深圳裕策生物科技有限公司 Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations
US20200109452A1 (en) * 2017-03-31 2020-04-09 Premaitha Limited Method of detecting a fetal chromosomal abnormality
CN111524611A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Method, device and equipment for constructing infectious disease trend prediction model
CN111863271A (en) * 2020-06-08 2020-10-30 浙江大学 Early warning, prevention and control analysis system for serious infectious disease transmission risk of new coronary pneumonia
CN111933214A (en) * 2020-09-27 2020-11-13 至本医疗科技(上海)有限公司 Method and computing device for detecting RNA level somatic gene variation
CN112313748A (en) * 2018-06-20 2021-02-02 香港中文大学 Measurement and prediction of viral gene mutation patterns
CN113025759A (en) * 2021-05-11 2021-06-25 常州国药医学检验实验室有限公司 Novel coronavirus nucleic acid mutation detection technology based on fluorescent quantitative PCR technology and application thereof
CN113073106A (en) * 2021-06-04 2021-07-06 北京华芢生物技术有限公司 Novel coronavirus B.1.525 Nigeria mutant RBD gene and application thereof
CN113470745A (en) * 2021-08-25 2021-10-01 南京立顶医疗科技有限公司 Screening method of potential mutation site of SARS-CoV2 and its application
CN113593639A (en) * 2021-08-05 2021-11-02 湖南大学 Method and system for analyzing and monitoring virus genome variation
CN113980100A (en) * 2021-12-23 2022-01-28 中国人民解放军军事科学院军事医学研究院 Adenovirus vector recombinant new coronavirus B.1.429 variant vaccine and application thereof
CN113990390A (en) * 2021-06-07 2022-01-28 重庆南鹏人工智能科技研究院有限公司 Machine learning-based new coronavirus subgroup identification method
US20220068437A1 (en) * 2019-05-15 2022-03-03 Bgi Genomics Co., Ltd. Base mutation detection method and apparatus based on sequencing data, and storage medium
CN114464246A (en) * 2022-01-19 2022-05-10 华中科技大学同济医学院附属协和医院 Method for detecting mutation related to genetic increase based on CovMutt framework

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7255993B1 (en) * 2002-11-06 2007-08-14 The United States Of America As Represented By The Secretary Of The Department Of Health And Human Services Detection of mutational frequency and related methods
CN107423578A (en) * 2017-03-02 2017-12-01 北京诺禾致源科技股份有限公司 Detect the device of somatic mutation
US20200109452A1 (en) * 2017-03-31 2020-04-09 Premaitha Limited Method of detecting a fetal chromosomal abnormality
CN108733975A (en) * 2018-03-29 2018-11-02 深圳裕策生物科技有限公司 Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations
CN112313748A (en) * 2018-06-20 2021-02-02 香港中文大学 Measurement and prediction of viral gene mutation patterns
US20220068437A1 (en) * 2019-05-15 2022-03-03 Bgi Genomics Co., Ltd. Base mutation detection method and apparatus based on sequencing data, and storage medium
CN111524611A (en) * 2020-04-24 2020-08-11 腾讯科技(深圳)有限公司 Method, device and equipment for constructing infectious disease trend prediction model
CN111863271A (en) * 2020-06-08 2020-10-30 浙江大学 Early warning, prevention and control analysis system for serious infectious disease transmission risk of new coronary pneumonia
CN111933214A (en) * 2020-09-27 2020-11-13 至本医疗科技(上海)有限公司 Method and computing device for detecting RNA level somatic gene variation
CN113025759A (en) * 2021-05-11 2021-06-25 常州国药医学检验实验室有限公司 Novel coronavirus nucleic acid mutation detection technology based on fluorescent quantitative PCR technology and application thereof
CN113073106A (en) * 2021-06-04 2021-07-06 北京华芢生物技术有限公司 Novel coronavirus B.1.525 Nigeria mutant RBD gene and application thereof
CN113990390A (en) * 2021-06-07 2022-01-28 重庆南鹏人工智能科技研究院有限公司 Machine learning-based new coronavirus subgroup identification method
CN113593639A (en) * 2021-08-05 2021-11-02 湖南大学 Method and system for analyzing and monitoring virus genome variation
CN113470745A (en) * 2021-08-25 2021-10-01 南京立顶医疗科技有限公司 Screening method of potential mutation site of SARS-CoV2 and its application
CN113980100A (en) * 2021-12-23 2022-01-28 中国人民解放军军事科学院军事医学研究院 Adenovirus vector recombinant new coronavirus B.1.429 variant vaccine and application thereof
CN114464246A (en) * 2022-01-19 2022-05-10 华中科技大学同济医学院附属协和医院 Method for detecting mutation related to genetic increase based on CovMutt framework

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116741268A (en) * 2023-04-04 2023-09-12 中国人民解放军军事科学院军事医学研究院 Method, device and computer readable storage medium for screening key mutation of pathogen
CN116741268B (en) * 2023-04-04 2024-03-01 中国人民解放军军事科学院军事医学研究院 Method, device and computer readable storage medium for screening key mutation of pathogen

Similar Documents

Publication Publication Date Title
CN111009286B (en) Method and apparatus for microbiological analysis of a host sample
CN107577688B (en) Original article influence analysis system based on media information acquisition
CN103843003B (en) The method of recognition network fishing website
US20130166221A1 (en) Method and system for sequence correlation
CN107194208A (en) A kind of genetic analysis annotates method and apparatus
CN108847022B (en) Abnormal value detection method of microwave traffic data acquisition equipment
CN111522955B (en) Litigation case classification method, litigation case classification device, computer equipment and storage medium
CN115798578A (en) Device and method for analyzing and detecting virus new epidemic variant strain
CN115631789B (en) Group joint variation detection method based on pan genome
CN112185468A (en) Cloud management system and method for gene data analysis and processing
CN107705828A (en) Prejudge detection and processing method and processing device, terminal device, the storage medium of rule
WO2014144529A1 (en) Characterization of biological material using unassembled sequence information, probabilistic methods and trait-specific database catalogs
CN112599198A (en) Microorganism species and functional composition analysis method for metagenome sequencing data
CN111382948A (en) Method and device for quantitatively evaluating enterprise development potential
CN113096737A (en) Method and system for automatically analyzing pathogen types
CN110970093B (en) Method and device for screening primer design template and application
CN115862740A (en) Rapid distributed multi-sequence comparison method for large-scale virus genome data
CN113380318B (en) Artificial intelligence assisted flow cytometry 40CD immunophenotyping detection method and system
CN115842645A (en) UMAP-RF-based network attack traffic detection method and device and readable storage medium
CN114707030A (en) Medical data management method and system, device and medium
CN111681704B (en) Construction method of matK gene-based unknown plant species identification database and database
Silver Paternity testing
CN111833297A (en) Disease association method of marrow cell morphology automatic detection system
CN116646010B (en) Human virus detection method and device, equipment and storage medium
CN117708569B (en) Identification method, device, terminal and storage medium for pathogenic microorganism information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination