CN115798578A - Device and method for analyzing and detecting virus new epidemic variant strain - Google Patents
Device and method for analyzing and detecting virus new epidemic variant strain Download PDFInfo
- Publication number
- CN115798578A CN115798578A CN202211554331.0A CN202211554331A CN115798578A CN 115798578 A CN115798578 A CN 115798578A CN 202211554331 A CN202211554331 A CN 202211554331A CN 115798578 A CN115798578 A CN 115798578A
- Authority
- CN
- China
- Prior art keywords
- mutation
- amino acid
- sequence
- core
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 241000700605 Viruses Species 0.000 title claims abstract description 59
- 230000035772 mutation Effects 0.000 claims abstract description 332
- 230000008569 process Effects 0.000 claims abstract description 7
- 150000001413 amino acids Chemical class 0.000 claims description 208
- 238000004364 calculation method Methods 0.000 claims description 50
- 238000012216 screening Methods 0.000 claims description 50
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 claims description 46
- 230000036438 mutation frequency Effects 0.000 claims description 41
- 108090000623 proteins and genes Proteins 0.000 claims description 29
- 238000003908 quality control method Methods 0.000 claims description 22
- 238000013500 data storage Methods 0.000 claims description 14
- 230000002159 abnormal effect Effects 0.000 claims description 12
- 230000005540 biological transmission Effects 0.000 claims description 9
- 238000012417 linear regression Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000000630 rising effect Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 125000003275 alpha amino acid group Chemical group 0.000 claims 14
- 241000711573 Coronaviridae Species 0.000 abstract description 53
- 238000001514 detection method Methods 0.000 abstract description 7
- 229960005486 vaccine Drugs 0.000 abstract description 6
- 238000000611 regression analysis Methods 0.000 abstract description 4
- 239000003814 drug Substances 0.000 abstract description 3
- 238000011161 development Methods 0.000 abstract description 2
- 229940079593 drug Drugs 0.000 abstract description 2
- 238000012544 monitoring process Methods 0.000 abstract description 2
- 238000012827 research and development Methods 0.000 abstract description 2
- 235000001014 amino acid Nutrition 0.000 description 153
- 102200128238 rs201124247 Human genes 0.000 description 23
- 108010061994 Coronavirus Spike Glycoprotein Proteins 0.000 description 19
- 102220642430 Spindlin-1_P681R_mutation Human genes 0.000 description 18
- 229940096437 Protein S Drugs 0.000 description 10
- 101710198474 Spike protein Proteins 0.000 description 10
- 102220599406 Spindlin-1_N501Y_mutation Human genes 0.000 description 10
- 102220031793 rs431825282 Human genes 0.000 description 10
- 102220114694 rs763810935 Human genes 0.000 description 10
- 102220599400 Spindlin-1_D1118H_mutation Human genes 0.000 description 9
- 102220599627 Spindlin-1_D950N_mutation Human genes 0.000 description 9
- 102220590621 Spindlin-1_T19R_mutation Human genes 0.000 description 9
- 102220075059 rs529697285 Human genes 0.000 description 9
- 102220074121 rs796052019 Human genes 0.000 description 9
- 102220585969 Claspin_S982A_mutation Human genes 0.000 description 8
- 102220590684 Spindlin-1_T95I_mutation Human genes 0.000 description 8
- 102220046286 rs587782805 Human genes 0.000 description 8
- 102200024304 rs886037751 Human genes 0.000 description 6
- 102200056390 rs12204826 Human genes 0.000 description 5
- 102220058658 rs771746264 Human genes 0.000 description 5
- 102100031673 Corneodesmosin Human genes 0.000 description 4
- 101710139375 Corneodesmosin Proteins 0.000 description 4
- 102220590628 Spindlin-1_L18F_mutation Human genes 0.000 description 4
- 102220346940 c.454T>A Human genes 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 230000002265 prevention Effects 0.000 description 4
- 102220590625 Spindlin-1_P26S_mutation Human genes 0.000 description 3
- 102220599630 Spindlin-1_T1027I_mutation Human genes 0.000 description 3
- 102220590630 Spindlin-1_T20N_mutation Human genes 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 102220277108 rs1553412687 Human genes 0.000 description 3
- 102200088972 rs1801133 Human genes 0.000 description 3
- 102220058675 rs786203529 Human genes 0.000 description 3
- 102220029076 rs78775072 Human genes 0.000 description 3
- 102220077512 rs797044926 Human genes 0.000 description 3
- 206010064571 Gene mutation Diseases 0.000 description 2
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 description 2
- 102220590548 Spindlin-1_D253G_mutation Human genes 0.000 description 2
- 102220590680 Spindlin-1_S13I_mutation Human genes 0.000 description 2
- 102220599418 Spindlin-1_V1176F_mutation Human genes 0.000 description 2
- 102220429344 c.456G>T Human genes 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 102200109922 rs1085307741 Human genes 0.000 description 2
- 102200144284 rs235768 Human genes 0.000 description 2
- 102220045259 rs587781958 Human genes 0.000 description 2
- 102220030602 rs61750243 Human genes 0.000 description 2
- 102220278808 rs779625900 Human genes 0.000 description 2
- 238000002864 sequence alignment Methods 0.000 description 2
- 102220579649 ATP-dependent RNA helicase A_K417N_mutation Human genes 0.000 description 1
- 102220621821 Apelin receptor_S94F_mutation Human genes 0.000 description 1
- 102220518444 Enhancer of filamentation 1_T29I_mutation Human genes 0.000 description 1
- 102220564905 Exocyst complex component 3-like protein 4_Q675R_mutation Human genes 0.000 description 1
- 102220566496 GDNF family receptor alpha-1_M153T_mutation Human genes 0.000 description 1
- 239000004471 Glycine Substances 0.000 description 1
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 description 1
- 102220477246 Mis18-binding protein 1_E583D_mutation Human genes 0.000 description 1
- 102220592182 Spindlin-1_A222V_mutation Human genes 0.000 description 1
- 102220590697 Spindlin-1_A67V_mutation Human genes 0.000 description 1
- 102220599612 Spindlin-1_A701V_mutation Human genes 0.000 description 1
- 102220592185 Spindlin-1_D215G_mutation Human genes 0.000 description 1
- 102220599672 Spindlin-1_D614G_mutation Human genes 0.000 description 1
- 102220599605 Spindlin-1_D796H_mutation Human genes 0.000 description 1
- 102220590324 Spindlin-1_D80A_mutation Human genes 0.000 description 1
- 102220592189 Spindlin-1_F157L_mutation Human genes 0.000 description 1
- 102220599652 Spindlin-1_F490S_mutation Human genes 0.000 description 1
- 102220599417 Spindlin-1_G1219V_mutation Human genes 0.000 description 1
- 102220590693 Spindlin-1_G75V_mutation Human genes 0.000 description 1
- 102220599420 Spindlin-1_N501T_mutation Human genes 0.000 description 1
- 102220599613 Spindlin-1_N679K_mutation Human genes 0.000 description 1
- 102220599629 Spindlin-1_Q1071H_mutation Human genes 0.000 description 1
- 102220599614 Spindlin-1_Q677H_mutation Human genes 0.000 description 1
- 102220592191 Spindlin-1_R190S_mutation Human genes 0.000 description 1
- 102220599635 Spindlin-1_S982A_mutation Human genes 0.000 description 1
- 102220599611 Spindlin-1_T716I_mutation Human genes 0.000 description 1
- 102220599642 Spindlin-1_T859N_mutation Human genes 0.000 description 1
- 102220592204 Spindlin-1_W152C_mutation Human genes 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 235000003704 aspartic acid Nutrition 0.000 description 1
- OQFSQFPPLPISGP-UHFFFAOYSA-N beta-carboxyaspartic acid Natural products OC(=O)C(N)C(C(O)=O)C(O)=O OQFSQFPPLPISGP-UHFFFAOYSA-N 0.000 description 1
- 102220370907 c.16G>T Human genes 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 102200024354 rs1057519548 Human genes 0.000 description 1
- 102220223340 rs1060502592 Human genes 0.000 description 1
- 102220103674 rs147911699 Human genes 0.000 description 1
- 102220270930 rs1555505208 Human genes 0.000 description 1
- 102220282987 rs1555618704 Human genes 0.000 description 1
- 102220039285 rs199473349 Human genes 0.000 description 1
- 102200051957 rs2472553 Human genes 0.000 description 1
- 102220024392 rs267607495 Human genes 0.000 description 1
- 102200118196 rs33916541 Human genes 0.000 description 1
- 102200118205 rs33990858 Human genes 0.000 description 1
- 102220005147 rs34173382 Human genes 0.000 description 1
- 102220036433 rs35389822 Human genes 0.000 description 1
- 102220081228 rs372168541 Human genes 0.000 description 1
- 102220030033 rs398123766 Human genes 0.000 description 1
- 102220106470 rs569543350 Human genes 0.000 description 1
- 102220045931 rs587782500 Human genes 0.000 description 1
- 102220046173 rs587782706 Human genes 0.000 description 1
- 102200110598 rs61749448 Human genes 0.000 description 1
- 102200016054 rs61753224 Human genes 0.000 description 1
- 102220093940 rs730881419 Human genes 0.000 description 1
- 102220001216 rs74315456 Human genes 0.000 description 1
- 102220330432 rs756785454 Human genes 0.000 description 1
- 102220239420 rs764964175 Human genes 0.000 description 1
- 102220088189 rs773831600 Human genes 0.000 description 1
- 102220103309 rs774656101 Human genes 0.000 description 1
- 102220070865 rs776343307 Human genes 0.000 description 1
- 102220059328 rs786202822 Human genes 0.000 description 1
- 102220071580 rs794728396 Human genes 0.000 description 1
- 102220083436 rs863224678 Human genes 0.000 description 1
- 102220087615 rs864622785 Human genes 0.000 description 1
- 102220154333 rs886061216 Human genes 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 229940081330 tena Drugs 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Abstract
The invention discloses a device for detecting and analyzing a new epidemic variant strain of a virus, which comprises parameter acquisition equipment and a computer readable carrier; the functions of quickly, efficiently and automatically detecting and early warning the new popular variant strain of the new coronavirus are realized by the methods of mutation site analysis, core mutation statistics, high-frequency core mutation combination statistics, variant strain identification, propagation trend regression analysis and the like. The method has the advantages of large analyzable genome data quantity, high operation efficiency, accurate detection of new epidemic variant strains and capability of assisting in quickly and comprehensively tracking the variation trend and the historical process of the virus. The invention can automatically and rapidly detect the new epidemic variant strain of the new coronavirus and the core mutation characteristics of the new epidemic variant strain, does not need manual participation, and can provide important technical support for monitoring epidemic situation development, variant strain identification and vaccine drug research and development in real time.
Description
Technical Field
The invention relates to the technical field of biology, in particular to an automatic detection and early warning method for a new epidemic variant strain of a virus based on a mutation combination distribution trend.
Background
The novel coronavirus has a plurality of novel variant strains with enhanced transmission capability, wherein four new epidemic variant strains with the highest threat are respectively named as an alpha variant strain, a beta variant strain, a gamma variant strain and a delta variant strain by the world health organization, wherein the delta variant strain appears late, the outbreak speed is high, and the hazard degree is high. The insufficient recognition and early warning capability of the variant strains can lead to serious delay of epidemic strategy from epidemic spread. Therefore, in the face of rapid virus mutation, how to automatically and rapidly detect a new epidemic variant strain of new coronavirus is a very important topic.
At present, the method for screening the virus new epidemic variant strain is mainly to adopt a method for constructing a phylogenetic tree, and then to identify and detect whether a new epidemic strain exists or not by depending on the growth condition of each branch in the phylogenetic tree through manual observation. And standardized designations are made according to the position of the strain in the evolutionary tree, such as the name of the delta variant of the new coronavirus B.1.617.2. However, this method has many disadvantages, firstly, the method needs to construct a phylogenetic tree for viruses, and the computational resources required for constructing the phylogenetic tree are in a positive growth relationship with the number of virus samples. Since the sample size of the novel coronavirus is over 400 ten thousand, the required computing resources and computing time are huge, and the computing power of the conventional computer cannot meet the requirement of constructing the phylogenetic tree of the novel coronavirus. Secondly, conventional methods rely on manual observation, and the current evolutionary tree of novel coronaviruses is extremely complex, and the number of leaf nodes to be observed is over 400 ten thousand, and it is difficult to rapidly detect a new epidemic variant strain by manual observation. Thirdly, the mutation characteristics of the new epidemic variant strains cannot be directly revealed by using the conventional method, and further, technical support cannot be provided for the rapid identification of the variant strains and the vaccine design, and finally, the consequence of implementing the slow epidemic prevention strategy is caused. In summary, conventional methods for detecting new prevalent variant strains have been difficult to meet the needs of epidemic situation prevention and control in the present day when new coronavirus is abused. An automatic rapid detection method for a new epidemic variant strain of the new coronavirus is urgently needed, and is used for assisting guidance of epidemic prevention early warning and vaccine design work.
Disclosure of Invention
It is an object of the present invention to provide an apparatus for detecting and analyzing a new epidemic variant of a virus.
The invention provides a device for detecting and analyzing a virus new epidemic variant strain, which comprises parameter acquisition equipment and a computer readable carrier; the parameter acquisition equipment comprises a step of continuously acquiring amino acid sequence data of the virus to be detected in a certain fixed area for a certain time;
the readable carrier comprises:
the data storage module is used for storing the collected amino acid sequence data of the virus to be detected and the amino acid sequence of the reference strain, and the amino acid sequence data of the virus to be detected can be collected by self or can be directly used by the sequence data in the database;
the quality control module is used for performing quality control and length screening on the gene to be tested stored in the data storage module to obtain a data set to be tested and transmitting the data set to the statistical module;
the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the gene to be tested with the reference gene, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;
the screening module receives the amino acid mutation frequency table transmitted by the statistical module and further screens to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;
the identification module is used for identifying the gene sequence of the data set to be detected by combining the core mutation site key mutation set to obtain a virus variant mutation characteristic set;
and the analysis module receives the virus variant mutation feature set obtained by the identification module, and performs propagation trend analysis on the virus variant mutation feature set and popular variant identification on the virus variant mutation feature set to obtain the popular variant.
The quality control module performs quality control on the gene to be detected stored in the data storage module, and the method comprises the following steps: respectively calculating the proportion of abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of abnormal characters is as follows:the total length of the sequence to be detected is L, and the number of abnormal characters in the sequence is mu;
the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, wherein the length integrity is calculated by the formula:the total length of the sequence to be detected is L, and the total length of the reference strain sequence is L 0 。
The method for acquiring the amino acid mutation frequency table in the statistical module comprises the following steps:
(1) Performing multi-sequence comparison on the sequences in the data set to be detected and the amino acid sequences of the reference strains one by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of multi-sequence comparison, comparing the amino acid sequence to be detected with the reference strain sequence bit by bit, recording the amino acid mutation condition in the sequence to be detected and generating a mutation information set of the amino acid sequence to be detected;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,AMino ln _ln_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an Amino acid mutation wherein l1 is the positional information of the site of the Amino acid mutation, amino l1 Amino acid code at position l1 of the reference Strain sequence, mutation l1 The amino acid code at the l1 th position of the sequence of the strain to be tested, MSet is a mutation information dictionary for recording all sequence mutation information sets, MSet [ Seq ] m ]Sequence of expression Seq m A set of mutation information of (a);
(3) And (3) analyzing the amino acid mutation records of all the sequences to be detected in the step (2), carrying out frequency statistics on the frequency of each amino acid mutation in all the sequences, and establishing an amino acid mutation frequency table of the viruses to be detected.
The method for screening the high-frequency amino acid mutation set in the screening module comprises the following steps:
(1) When the occurrence frequency of the amino acid mutation is more than one thousandth of the total number N of the sequences contained in the data set to be detected, the amino acid mutation is high-frequency amino acid mutation;
(2) Judging all amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm }。
the method for screening the key mutation set of the core mutation site in the screening module comprises the following steps:
(1) Counting the frequency of the amino acid mutation at each amino acid position, and calculating the formula as follows:
wherein AH ln Represents the amino acid frequency set of the ln th amino acid position, N is the total number of sequences contained in the data set to be tested,indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position of the ln position,represents the frequency of occurrence of Amino _ m at the Amino acid position of ln;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
if H is ln >0.1, recording the amino acid position ln as the core mutation position. Selecting the mutant Amino with the highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at that site;
(3) According to the method in the step (2), counting the key mutation of all core mutation sites to obtain a core mutation site key mutation set KLMS,
KLMS={Amino l1 _l1_Mutation max l1 ,…,Amino ln _ln_Mutation max ln }。
wherein each virus data in the data storage module comprises sampling time, sampling place and sequence information.
The method for obtaining the virus variant mutation feature set comprises the steps of identifying a core mutation combination in a mutation information dictionary MSet and performing propagation trend analysis and epidemic variant identification in an analysis module.
The method for identifying the core mutation combination in the mutation information dictionary MSet comprises the following steps:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain a core mutation combination of each sequence, wherein the calculation formula is as follows:
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;is sequence Seq m A combination of core mutations comprised;
(3) Performing high-frequency screening on the core mutation combined dictionary in the step (2):combining core mutation in core mutation combined dictionaryIs the number ofLarger than that contained in the data set to be testedWhen the sequence total number is one thousandth of N, the corresponding core mutation combinationFor high frequency core mutation combinations, the virus strain comprising the core mutation combination is referred to as variant MC m And recording the virus variant strain in a virus variant strain list MCL;
the method for analyzing the propagation trend and identifying the epidemic variant strains in the analysis module comprises the following steps:
(1) And (3) counting the proportion of each variant strain in the virus transmission process according to the sampling time information of the sequence: the calculation formula is as follows:
where MC ∈ MCL, n denotes month,indicates the number of sequences of the N-th variant strain MC in the sequence set, AN n Indicating the number of all sequences collected in month n of the sequence set,the sequence number of the variant strain MC in the nth month is shown;
(2) And (2) performing propagation curve fitting on the monthly proportion condition of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
wherein X is a time variable matrix and Y is a variant MPropagation ratio matrix of C, beta n The slope of a transmission fitting curve of the nth month variant strain MC is obtained, and epsilon represents the intercept of the fitting curve;
(3) By fitting the slope beta of the propagation curve in step (2) n The identification of the virus epidemic variant strain is completed by the threshold judgment of (1): the calculation formula is as follows:
T=3×β n ×M
wherein beta is n The propagation fitting curve slope of the n-th month variant strain MC is shown, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month.
The invention also provides a method for automatically detecting and early warning the new epidemic variant strain of the new coronavirus based on the mutation combination distribution trend, which comprises the following steps:
1) Obtaining all genes to be detected and reference strain amino acid sequences with complete genome from an original database;
2) Performing quality control and length screening on each genome sequence obtained in the step 1);
3) Carrying out mutation site statistics on the data set DataBase _1 to be detected obtained in the step 2) to obtain an amino acid mutation frequency list and a mutation information dictionary;
4) Screening high-frequency amino acid mutation from the amino acid mutation frequency table obtained in the step 3);
5) Screening core mutation sites of the amino acid mutation frequency table in the step 3);
6) Identifying the core mutation combination in the mutation information dictionary in the step 3), naming the virus strains containing the core mutation combination as variant strains, and storing all identified variant strains to obtain a novel coronavirus variant strain list.
Wherein, the quality control and length screening method in the step 2) comprises the following steps:
(1) Calculating the proportion of abnormal characters in the sequence, recording the total length of the sequence as L, and the number of the abnormal characters in the sequence as mu, wherein the calculation formula is as follows:
if P >0.1, deleting the sequence from the amino acid sequence dataset;
(2) Calculating the sequence length integrity, recording the total sequence length as L, and recording the reference strain sequence length as L 0 The calculation formula is as follows:
if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, and reserving other amino acid sequences after screening to be used as a data set to be detected and marking as DataBase _1.
Wherein, the mutation site statistical method in the step 3) comprises the following steps:
(1) Performing multi-sequence comparison on the sequences in the data set DataBase _1 to be detected and the amino acid sequence of the reference strain by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of multi-sequence comparison, comparing the amino acid sequence to be detected with the reference strain sequence bit by bit, recording the amino acid mutation condition in the sequence to be detected and generating a mutation information set of the amino acid sequence to be detected;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,Amino ln _ln_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an amino acid mutation. Wherein l1 is the position information of the Amino acid mutation site, amino l1 To reference the amino acid code at position l1 of the strain sequence, mutation l1 Is the amino acid code of the l1 position of the sequence of the strain to be tested. For example, the amino acid mutation D614G indicates that the mutation occurs at position 614, the 614 th position of the reference sequence is D (aspartic acid), and the 614 th position of the test strain sequence is G (glycine). MSet is a mutation information dictionary that records a set of all sequence mutation information. MSet [ Seq ] m ]Representing a sequenceSeq m The mutation information set of (3);
(3) And (3) carrying out frequency statistics on the amino acid mutation records of all the sequences to be detected in the step (2) and establishing an amino acid mutation frequency table of the target gene.
Wherein, the specific method for screening high-frequency amino acid mutation from the amino acid mutation frequency list obtained in the step 3) in the step 4) comprises the following steps:
(1) Recording the total number of sequences contained in the data set DataBase _1 to be tested as N, and recording the Amino acid mutation Amino l1 _l1_Mutation l1 The frequency of occurrence is S l1 The calculation formula is as follows:
if SP l1 >0.001, the Amino acid mutation Amino is identified l1 _l1_Mutation l1 Is a high frequency amino acid mutation;
(2) Identifying the amino acid mutations in the amino acid mutation frequency table in the step 3) one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm 。
wherein, the method for screening the core mutation sites of the amino acid mutation frequency table in the step 3) in the step 5) comprises the following steps:
(1) Counting the amino acid mutation frequency of each amino acid site, wherein the calculation formula is as follows:
wherein AH ln Represents an amino acid at the ln-th amino acid positionA frequency set, N is the total number of sequences contained in the data set DataBase _1 to be tested,indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position (ln),represents the frequency of occurrence of Amino _ m at the Amino acid position of lnn;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
if H is ln >0.1, recording the amino acid position ln as the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at that site;
(3) Establishing key mutation set KLMS of a core mutation site,
KLMS={Amino l1 _l1_Mutation max l1 ,…,Amino ln _ln_Mutation max ln }。
the method for identifying the core mutation combination in the mutation information dictionary MSet in the step 3) in the step 6) comprises the following steps:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS obtained in the step 4) and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain the core mutation combination of each sequenceThe calculation formula is as follows:
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;
(3) Performing high-frequency screening on the core mutation combination in the step (2): note the bookIn KMSet withThe number of the same core mutation combinations, the total number of sequences contained in the data set to be tested DataBase _1 is N, and the calculation formula is as follows:
if it is usedIdentifying core mutation combinationsFor high frequency core mutation combinations, all combinations comprising core mutations are referred toThe virus strain of (A) is a variant strain MC m All identified variants are saved to the new coronavirus variant list MCL.
Based on the method, all variant strain lists MCL can be obtained, so that data support is provided for the subsequent analysis and detection of the novel epidemic coronavirus, and data support is provided for the subsequent design of vaccine and epidemic control.
Wherein, the method also comprises a step 7) of carrying out transmission trend analysis and epidemic variant strain identification on the novel coronavirus variant strain list obtained in the step 6) according to the following method:
(1) And counting the proportion of each variant strain in the virus propagation process according to the time-space labeling information of the sequence. The calculation formula is as follows:
where MC ∈ MCL, n denotes month,the number of sequences of the n-th variant strain MC in the sequence set is shown. AN (Access network) n Indicating the number of all sequences collected in month n of the sequence set,the sequence number of the variant strain MC in the nth month is shown as a ratio;
(2) And (3) fitting a propagation curve for the monthly ratio of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta n The slope of the curve fitted to the transmission of the n-th month variant strain MC is shown as ε represents the intercept of the fitted curve.
(3) By opposite steppingFitting propagation curve slope beta in step (2) n The discrimination of the new coronavirus epidemic variant strain is completed. The calculation formula is as follows:
T=3×β n ×M
wherein beta is n The slope of the propagation fit curve for the nth variant strain MC, M is the total number of all variants in the variant list MCL. T represents an index of the prevalence spread of the N month variant strain MC. If T is more than or equal to 1, judging that the variant MC is a popular variant in the nth month;
(4) Counting the epidemic variant strains identified in the step (3), and marking the epidemic variant strains which do not appear before as new epidemic variant strains. Thus, the whole work of automatically detecting the new epidemic variant strain of the novel coronavirus is completed.
Wherein, the propagation trend analysis of the variant strain comprises the following four steps:
1) And calculating the proportion of each variant strain in each month, and drawing a propagation curve chart.
2) And (3) calculating the rising slope of the propagation curve of each variant in the current month by using the data of three months before the current month as fitting data by using a linear regression method.
3) And (4) carrying out threshold judgment on the rising slope of the transmission curve of each variant strain per month, and defining the variant strain exceeding the threshold as the circulating strain in the month.
4) Recording the current month epidemic strain, marking the epidemic strain which has not appeared before as new type coronavirus new epidemic variant strain.
Wherein, the virus genome data is large-scale novel coronavirus genome amino acid sequence data which is complete as much as possible.
The high-frequency amino acid mutation is a novel coronavirus genome sequence amino acid mutation with mutation rate exceeding one thousandth obtained through statistical screening.
Wherein, the core mutation site is a novel coronavirus genome sequence amino acid site with a Shannon entropy value of more than 0.1 obtained by Shannon entropy calculation and screening. The key mutation of the core mutation site is the amino acid mutation with the most occurrence on the amino acid site.
The core mutation is an important amino acid mutation with high occurrence frequency in a novel coronavirus genome, and the method is obtained by performing union operation on a high-frequency amino acid mutation set and a core mutation site key mutation set.
Wherein, the core mutation combination is a specific case of simultaneous occurrence between core mutations in the genome sequence of the novel coronavirus, and is also used as a gene mutation characteristic for describing a variant strain.
The application of the method in the identification and early warning operation of the new-epidemic variant strain of the novel coronavirus or the preparation of the identification and early warning operation product of the new-epidemic variant strain of the novel coronavirus also falls within the protection scope of the invention.
Experiments of the invention show that compared with the prior art, the invention has the advantages that: 1. aiming at large-scale genome data, the identification efficiency of the new epidemic variant strains of the novel coronavirus can be greatly accelerated by analyzing the distribution trend of the core mutation site combination in the novel coronavirus sequence. 2. By analyzing the core mutation combination, the evolutionary characteristics of the novel coronavirus new-epidemic variant strain can be accurately positioned, an effective data base 3 is provided for identification of variant strains, vaccine design and drug research and development, a linear regression method is used, prediction and early warning of the propagation trend of the novel coronavirus new-epidemic variant strain can be completed more efficiently, and a powerful data base is provided for designation of epidemic situation prevention and control strategies. 4. The method for detecting the new epidemic variant strains of the novel coronavirus based on the mutation combination distribution trend can be automatically processed in all processes, and is more convenient for establishing an efficient analysis early warning system.
In conclusion, the invention realizes the function of quickly, efficiently and automatically detecting and early warning the new epidemic variant strain of the new coronavirus through the methods of mutation site analysis, core mutation statistics, high-frequency core mutation combination statistics, variant strain identification, propagation trend regression analysis and the like. The method has the advantages of large analyzable genome data quantity, high operation efficiency, accurate detection of new epidemic variant strains and capability of assisting in quickly and comprehensively tracking the variation trend and the historical process of the virus. The method solves the problem that the efficiency of identifying the variant strains by constructing the evolutionary tree is low aiming at the insufficient analysis and processing capacity of large-scale genome data in the traditional method. The method has the main advantages that the method can automatically and quickly detect the new epidemic variant strains of the new coronavirus and the core mutation characteristics of the new coronavirus, does not need manual participation, and can provide important technical support for monitoring epidemic situation development, identifying variant strains and researching and developing vaccine medicaments in real time by utilizing the technical advantages.
Drawings
FIG. 1 is a schematic flow chart of a method for automatically detecting a new pandemic variant of a coronavirus based on a distribution trend of mutation combinations;
FIG. 2 is a statistical chart of amino acid mutation frequencies of novel coronavirus Spike proteins;
FIG. 3 is a Shannon entropy statistic chart of amino acid sites of novel coronavirus Spike protein;
FIG. 4 is a schematic flow chart of core mutation combinatorial screening and variant strain identification;
FIG. 5 is a graph showing the analysis of the distribution trend of the core mutation combinations;
FIG. 6 is a graph of regression analysis of the identification and propagation curves of newly circulating variants.
Detailed Description
The present invention is described in further detail below with reference to specific embodiments, and the examples are given only for illustrating the present invention and not for limiting the scope of the present invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.
The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1
The invention provides a device for detecting and analyzing a virus new epidemic variant strain, which comprises parameter acquisition equipment and a computer readable carrier; the parameter acquisition equipment comprises equipment for acquiring or collecting the amino acid sequence of the virus to be detected, such as equipment for inputting the amino acid sequence of the virus to be detected or equipment for detecting the amino acid sequence of the virus to be detected;
the readability carrier comprising:
the data storage module is used for storing the collected amino acid sequences of the virus gene to be tested and the reference gene;
the quality control module is used for performing quality control and length screening on the gene to be tested stored in the data storage module to obtain a data set to be tested and transmitting the data set to the statistical module;
the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the gene to be tested with the reference gene, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;
the screening module receives the amino acid mutation frequency table transmitted by the statistical module and further screens to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;
the identification module is used for identifying the gene sequence of the data set to be detected by combining the core mutation site key mutation set to obtain a virus variant mutation characteristic set;
and the analysis module receives the virus variant mutation characteristic set obtained by the identification module, performs propagation trend analysis on the virus variant mutation characteristic set, and identifies the epidemic variant strain to obtain a new epidemic variant strain.
The method for controlling the quality of the gene to be detected stored in the data storage module in the quality control module comprises the following steps: respectively calculating the proportion of abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of abnormal characters is as follows:the total length of the sequence to be detected is L, and the number of abnormal characters in the sequence is mu;
the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, wherein the calculation formula of the length integrity is as follows:the total length of the sequence to be detected is L, and the total length of the reference strain sequence is L 0 。
The method for acquiring the amino acid mutation frequency table in the statistical module comprises the following steps:
(1) Performing multi-sequence comparison on the sequences in the data set DataBase _1 to be detected and the amino acid sequences of the reference strains by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of multi-sequence comparison, comparing the amino acid sequence to be detected with the reference strain sequence bit by bit, recording the amino acid mutation condition in the sequence to be detected and generating a mutation information set of the amino acid sequence to be detected;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,Amino ln _ln_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an amino acid mutation. Wherein l1 is the position information of the Amino acid mutation site, amino l1 Amino acid code at position l1 of the reference Strain sequence, mutation l1 Amino acid code of l1 position of the sequence of the strain to be tested, MSet is a mutation information dictionary recording all sequence mutation information sets, MSet [ Seq ] m ]Sequence of expression Seq m The mutation information set of (3);
(3) And (3) carrying out frequency statistics on the amino acid mutation records of all the sequences to be detected in the step (2) and establishing an amino acid mutation frequency table of the target gene.
The method for screening the high-frequency amino acid mutation set in the screening module comprises the following steps:
(1) Recording the total number of sequences contained in the data set DataBase _1 to be tested as N, and recording the Amino acid mutation Amino l1 _l1_Mutation l1 The frequency of occurrence is S l1 The calculation formula is as follows:
if SP l1 >0.001, amino acid mutation Amino is identified l1 _l1_Mutation l1 Is a high frequency amino acid mutation;
(2) Identifying the amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm }。
the method for screening the key mutation set of the core mutation sites in the screening module comprises the following steps:
(1) Counting the amino acid mutation frequency of each amino acid site, wherein the calculation formula is as follows:
wherein AH ln Represents the amino acid frequency set of the ln-th amino acid position, N is the total number of sequences contained in the data set DataBase _1 to be tested,indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position (ln),represents the frequency of occurrence of Amino _ m at the Amino acid position of lnn;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
if H is present ln >0.1, recording the amino acid position ln as the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at that site;
(3) Establishing a key mutation set KLMS of a core mutation site,
KLMS={Amino l1 _l1_Mutation max l1 ,…,Amino ln _ln_Mutation max ln }。
the identification module is combined with the core mutation site key mutation set to identify the gene sequence of the data set to be detected, and the method for obtaining the virus variant mutation characteristic set comprises the following steps:
the method for identifying the core mutation combination in the mutation information dictionary MSet in the step 3) in the step 6) comprises the following steps:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS obtained in the step 4) and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain the core mutation combination of each sequenceThe calculation formula is as follows:
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;
(3) Performing high-frequency screening on the core mutation combination in the step (2): note the bookIn KMSet withThe number of the same core mutation combinations, the total number of sequences contained in the data set to be tested DataBase _1 is N, and the calculation formula is as follows:
if it is usedIdentifying core mutation combinationsFor high frequency core mutation combinations, all combinations comprising core mutations are referred toThe virus strain of (A) is a variant strain MC m Storing all identified variant strains into a novel coronavirus variant strain list MCL;
the method for analyzing the propagation trend and identifying the popular variant strains in the analysis module comprises the following steps:
(1) And (3) counting the proportion of each variant strain in the virus propagation process according to the time-space labeling information of the sequence: the calculation formula is as follows:
where MC ∈ MCL, n denotes month,indicates the number of sequences of the Nth month variant strain MC in the sequence set, AN n Indicating the number of all sequences collected in month n of the sequence set,the sequence number of the variant strain MC in the nth month is shown;
(2) And (3) fitting a propagation curve for the monthly ratio of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta n The slope of a transmission fitting curve of the nth variant strain MC is shown, and epsilon represents the intercept of the fitting curve;
(3) By fitting the slope beta of the propagation curve to that in step (2) n The threshold value judgment of (2) to finish the identification of the novel coronavirus epidemic variant: the calculation formula is as follows:
T=3×β n ×M
wherein beta is n The propagation fitting curve slope of the n-th month variant strain MC is obtained, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month;
(4) And (4) counting the epidemic variant strains identified in the step (3), and marking the epidemic variant strains which do not appear before as new epidemic variant strains to realize the automatic detection of the new epidemic variant strains of the novel coronavirus.
Example 2 identification of new circulating variants of the novel coronavirus based on the trend of distribution of the combination of mutations in the S protein of the novel coronavirus.
Based on the device described in example 1, the present invention identifies a new prevalent variant strain of the novel coronavirus based on the distribution trend of the mutation combinations in the S protein of the novel coronavirus by the following method, as shown in fig. 1, firstly performing quality control and length screening on the amino acid sequence set of the gene to be tested of the novel coronavirus, then performing mutation site statistics on the screened data set, analyzing the core mutation existing in the virus by counting mutation frequency and calculating shannon entropy, then screening out the high-frequency core mutation combinations according to the distribution of the core mutations in the sequence to be tested, and identifying and naming each variant strain accordingly, then analyzing the propagation trend of each variant strain by using a linear regression method, and finally automatically detecting the new prevalent variant strain of the novel coronavirus by screening the rising slope of the propagation curve. The method comprises the following specific steps:
1. preparation of novel coronavirus Spike protein genome sample
The preparation of the genome sequence sample aiming at the novel coronavirus Spike protein comprises the following steps:
1. construction of novel coronavirus Spike protein genome database
The amino acid sequences of the novel coronavirus Spike protein with all complete genomes and full-length sequences are obtained from public databases, and the number of the novel coronavirus Spike protein sequences collected from the databases is about 320 ten thousand by 9/1 in 2021.
2. Sequence standardization naming and reference strain sequence extraction
And (3) carrying out standardized renaming treatment on the amino acid sequences in the novel coronavirus S protein genome database obtained in the step (1) one by one according to a sequence naming rule.
The sequence naming rule is:
(> Gene name | sequence name | time information | sequence number | geographic information)
According to the information of the suggested reference strains in the database, extracting sequences:
"> Spike | hCoV-19/Wuhan/Hu-1/2019 Shanxi (EPI_ISL_402125) Tena (China) is used as a reference strain sequence of a novel coronavirus Spike protein genome database.
2. Quality control and length screening of novel coronavirus Spike protein genome samples
Quality control and length screening against the novel coronavirus Spike protein genome database included the following steps:
1. calculating abnormal character ratio of each sequence in database
The calculation formula is as follows:
where μ is the number of anomalous characters 'X' in each sequence and L is the total sequence length. If P >0.1, the sequence is deleted.
2. Calculating the integrity of the genome sequence length
The total length of the sequence is recorded as L, and the total length of the reference strain sequence is recorded as L 0 The calculation formula is as follows:
wherein the total length L of the reference sequence of the novel coronavirus Spike protein 0 =1274. If LP < 0.8, the genomic sequence is deleted from the set of genomic sequence data. After screening, the number of the genome sequences of the novel coronavirus S protein genome DataBase is kept to be 246 ten thousand, and the genome sequences are used as a novel coronavirus Spike protein genome to-be-detected data set DataBase _1.
3. Statistics of mutation sites of novel coronavirus Spike protein genomes
Mutation site statistics is carried out on the data set DataBase _1 to be detected of the novel coronavirus Spike protein genome.
The method comprises the following specific steps:
1. and (3) performing multi-sequence alignment on the sequences in the data set DataBase _1 to be detected and the reference strain sequence by using a multi-sequence alignment tool, and calibrating the position information of each amino acid site in the sequence to be detected.
2. And comparing the amino acid conditions of each amino acid site in the sequence to be detected and the corresponding amino acid site of the reference sequence according to the multi-sequence comparison result. And recording the mutation condition of amino acids in the sequence to be detected and generating a mutation information set.
3. And carrying out frequency statistics on the amino acid mutation in the mutation information set, and establishing a Spike protein amino acid mutation frequency table.
4. High-frequency amino acid mutation screening of novel coronavirus Spike protein
Analyzing the Spike protein amino acid mutation frequency table, and screening the amino acid mutation with high frequency. The method comprises the following specific steps:
1. the frequency of each amino acid mutation in the Spike protein amino acid mutation frequency table is recorded as S 1 The total number of sequences in the data set DataBase _1 to be measured is about 246 ten thousand, and the calculation formula is as follows:
screening all SPs 1 >The 0.001 amino acid mutation is a high frequency amino acid mutation.
2. Recording all the high-frequency amino acid mutations in the step 1, and establishing a high-frequency amino acid mutation set HFMS, wherein the number of the high-frequency amino acid mutations contained in the HFMS is 151, the experimental result is shown in FIG. 2, FIG. 2 is a statistical graph of the amino acid mutation frequency of the novel coronavirus Spike protein, the abscissa is the name of the amino acid mutation, and the ordinate is the amino acid mutation frequency. The 151 high-frequency amino acid mutations are L5F, S12F, S13I, L18F, T19R, T20N, T20I, R21I, T22I, P26S, A27S, T29I, H49Y, Q52R, L54F, A67V, I68-, I68V, H69-, V70-, V70I, G75V, G75R, D80A, D80G, D80Y, S94F, T95I, S98F, D138Y, D138H, G142D, V143-, V143F, Y144-, Y144V, Y145-, W152R, W152L, W152C, M153I, M153T, E154K, E154-, S155-, E156G, E156-, F157-, F157L, F157S, R158-, R158G, L176F, D178H, G181V, L189F, R190S, D215H, D215G, L216F, S221L, A222V, L241-, L242-, A243-, D253G, S254F, S255F, S256L, W258L, A262S, P272L, V367F, K417T, K417N, N439K, N440K, L452R, S477N, T478K, E484K, E484Q, F490S, S494P, N501T, N501Y, A520S, A522S, K558N, A570D, T572I, E583D, Q613H, D614G, V622F, S640F, A653V, H655Y, Q675R, Q675H, Q677P, Q677H, N679K, P681H, P681R, A688V, A701V, S704L, T716I, T732A, G769V, V772I, E780Q, T791I, D796H, P809S, P812L, a845S, T859I, T859N, F888L, a899S, D936Y, S939F, D950N, D950H, Q957R, S982A, a1020S, T1027I, Q1071L, Q1071H, K1073N, a1078S, V1104L, D1118H, G1124V, P1162S, D1163Y, G1167V, V1176F, K1191N, G1219C, G1219V, V1228L, M1229I, M1237I, S1252F, P1263L, V1264L. Wherein the numbers indicate positions in the sequence, the original amino acids preceding the numbers and the mutated amino acids following the numbers, indicate amino acid deletions
5. Novel screening of core mutation sites of coronavirus Spike protein
Analyzing the Spike protein amino acid mutation frequency table, and screening the core mutation sites in the Spike protein amino acid sequence. The method comprises the following specific steps:
1. the mutation frequency at each amino acid position was calculated. The calculation formula is as follows:
wherein AH ln Represents the amino acid frequency set of the ln-th amino acid position, N is the total number of sequences contained in the data set DataBase _1 to be tested,indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position of the ln position,indicates the frequency of Amino _ m at the nth Amino acid position. (where Amino _ m is a whole, indicates a different Amino acid symbol appearing at the ln-th position, and m has no single meaning but distinguishes it from the different Amino acid symbols.)
2. Shannon entropy of amino acid positions. The calculation formula is as follows:
if H is present ln >0.1, recording the amino acid position ln as the core mutation position. Selecting the mutant Amino with the highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation ln As a key mutation at this site.
3. And (3) recording key mutations of the core mutation sites screened in the step (2) and establishing a set KLMS. The KLMS comprises 33 key mutations, the experimental result is shown in figure 3, figure 3 is a Shannon entropy statistical chart of amino acid sites of the novel coronavirus Spike protein, wherein the abscissa is the amino acid mutation site, the ordinate is the Shannon entropy value, and the 33 key mutations are respectively L5F, L18F, T19R, T20N, P26S, I68-, H69-, V70-, T95I, D138Y, G142D, Y144-, W152C, E156-, F157-, R158G, R190S, A222V, K417T, L452R, S477N, T478K, E484K, N501Y, A570D, H655Y, P681H, T716I, D950N, S982A, T1027I, D1118H, V6F. Where the numbers indicate positions in the sequence, the original amino acids are preceded by the numbers and the mutated amino acids are followed-indicating amino acid deletions.
6. Statistics and screening of novel coronavirus Spike protein core mutation combination
Comprehensively analyzing the amino acid mutation characteristics in the Spike protein genome, and screening the core mutation combination in the Spike protein by counting the occurrence frequency of different amino acid mutation combinations. The method comprises the following specific steps:
1. and (4) merging the high-frequency amino acid mutation set HFMS and the key mutation set KLMS obtained in the fourth step and the fifth step. A set of core mutations KMS is established. Wherein the number of core mutations contained in the KMS is 151.
2. Screening mutations in each amino acid sequence, performing intersection operation on a set KMS of core mutations, and establishing a core mutation combined information dictionary KMSet. The KMSet records the combination of core mutations contained in each sequence.
3. And counting the core mutation combination information dictionary KMSet, and selecting the core mutation combinations with high frequency. The specific calculation formula is as follows:
whereinFor the frequency of occurrence of the core mutation combinations, N is the total number of sequences. If it isThe core mutation combination is identified as a high frequency core mutation combination.
4. And recording the high-frequency core mutation combination, and establishing a high-frequency mutation combination information table. And regarding each high-frequency core mutation combination as the gene mutation characteristic of one variant strain, and regarding the sequences containing the same high-frequency core mutation combination as the same variant strain. The high-frequency core mutation combination information table comprises 16 high-frequency core mutation combinations, the experimental result is shown in fig. 4, fig. 4 is a high-frequency core mutation combination frequency statistical chart, the abscissa is mutation combination frequency, and the ordinate is high-frequency core mutation combination name. The 16 groups of high-frequency core mutation combinations are { H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H }; { L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R }; { H69-, A570D, S982A, N501Y, T716I, D614G, D1118H, V70-, P681H }; { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R }; { D253G, T95I, Q957R, L5F, D614G, S477N }; { A262S, P272L, A222V, D614G }; { G769V, W152L, D614G, E484K }; { H69-, A570D, S982A, N501Y, L5F, T716I, D614G, D1118H, Y144-, V70-, P681H }; { H69-, A570D, S982A, N501Y, K1191N, T716I, D614G, D1118H, Y144-, V70-, P681H }; { D253G, a701V, T95I, L5F, D614G, E484K }; { D614G, T732A, T478K, P681H }; { L452R, T95I, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R }; { L452R, S13I, W152C, D614G }; { D138Y, P26S, T20N, T1027I, K417T, R190S, N501Y, L18F, D614G, V1176F, H655Y, E484K }; { A222V, L18F, D614G }; { H69-, A570D, S982A, N501Y, T716I, D614G, D1118H, Y144-, V70-, P681H }. The experimental result shows that 16 virus variant strains are identified by the equipment through calculation, and the name of each variant strain is represented by the corresponding high-frequency core mutation combination.
7. High-frequency core mutation combination distribution trend analysis and recognition early warning of new popular variant strains
And C, analyzing the distribution trend of the high-frequency core mutation combination obtained in the sixth step, and finishing the identification of the new popular variant strain by combining with the propagation curve fitting. The method comprises the following specific steps:
1. according to the spatio-temporal labeling information of the sequence, the monthly proportion of each variant strain in the virus propagation process is counted, a distribution trend graph of high-frequency core mutation combination is drawn by using a Mathplotlib module in a Python program package, as shown in FIG. 5, FIG. 5 is a distribution trend analysis graph of the core mutation combination, wherein the abscissa is date, and the ordinate is the occurrence frequency of the mutation combination in the month, the change situation of the monthly propagation proportion of each variant strain (high-frequency core mutation combination) can be seen from FIG. 5, wherein the variant strain with gradually increased propagation proportion has the risk of becoming a popular variant strain, and the possibility of becoming the popular variant strain can be predicted through the fitting analysis of the propagation curve.
2. Propagation curve fitting is carried out on monthly occupation conditions of each variant strain by utilizing a linear regression method calculation tool in a Pyhton program module Sklearn.
3. The curvature of propagation of each variant was calculated, and variants above a defined threshold were considered as circulating variants in the month. The propagation curvature definition threshold calculation formula is as follows:
T=3×β n ×M
wherein beta is n The slope of the curve was fitted to the transmission of the nth variant, the total number of M all variants. If T is more than or equal to 1, the variant is judged to be the epidemic variant in the nth month, as shown in FIG. 6, FIG. 6 is a regression analysis chart of the identification and propagation curve of the newly epidemic variant, wherein the abscissa is the date, the ordinate is the propagation distribution ratio of the variant, and the dotted straight line in FIG. 6 represents the regression curve of the month in which the maximum propagation curvature of each variant is located. The circular mark points are the propagation curvatures of variant strains { H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H } in 1 month of 2021, and T =3.12. The apparatus determines the variant strain to be a prevalent variant strain in 1 month 2021. The triangular mark is the propagation curvature of variant { L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R } at 6 months of 2021, T =2.16. The apparatus determines the variant strain to be a circulating variant strain in 6 months of 2021. The square markers are the propagation curvatures of variants { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R } at 6 months of 2021, T =1.21. The facility identified the variant as a circulating variant in 2021, 6 months.
4. And (3) carrying out integration recording on the identified monthly epidemic variant strains, and marking the epidemic variant strains which do not appear before as new epidemic variant strains, so that all the work of automatic detection and early warning of the new coronavirus new epidemic variant strains is completed.
8. Experimental results and verification
1. Through the experimental steps, 3 new epidemic variant strains are automatically identified by the method. The method comprises the following steps:
(1) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein in month 1 of 2020- (H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H };
(2) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein of 6 months 2021- (L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K and P681R);
(3) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein in No. 6 of 2021- { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R }.
2. By inquiring the related information of the novel coronavirus published by the WHO official website and the GISAID data website, the new epidemic variant strain (1) identified by the method can be known to accord with the gene variation characteristic of the novel coronavirus alpha variant strain. The new epidemic variant strains (2) and (3) identified by the method conform to the gene variation characteristics of the novel coronavirus delta variant strain. In conclusion, the method can accurately and effectively identify the new-epidemic variant strain of the novel coronavirus and early warn whether the new-epidemic variant strain is epidemic in a large scale.
The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is made possible within the scope of the claims attached below.
Claims (9)
1. Apparatus for detecting and analyzing a new circulating variant of a virus, comprising a parameter acquisition device and a computer readable carrier; the parameter acquisition equipment comprises a parameter acquisition unit, a parameter acquisition unit and a parameter acquisition unit, wherein the parameter acquisition unit is used for continuously acquiring amino acid sequence data of a virus to be detected in a certain fixed area for a certain time;
the readable carrier comprises:
the data storage module is used for storing the collected amino acid sequence data of the virus to be detected and the amino acid sequence of the reference strain;
the quality control module is used for performing quality control and length screening on the genes to be tested stored in the data storage module, obtaining a data set to be tested after screening and transmitting the data set to the statistical module;
the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the genes to be tested in the data set to be tested with the reference genes, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;
the screening module receives the amino acid mutation frequency table transmitted by the counting module and screens the amino acid mutation frequency table to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;
the identification module is used for receiving the key mutation set of the core mutation site transmitted by the screening module to identify the gene sequence of the data set to be detected, so as to obtain a virus variant mutation feature set;
and the analysis module receives the virus variant mutation characteristic set obtained by the identification module, performs propagation trend analysis on the virus variant mutation characteristic set, and identifies the virus variant mutation characteristic set and the epidemic variant strain to obtain the epidemic variant strain.
2. The device of claim 1, wherein the quality control module performs quality control on the gene to be tested stored in the data storage module by: respectively calculating the proportion of the abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of the abnormal characters is as follows:the total length of the sequence to be detected is L, and the number of abnormal characters in the sequence is mu;
the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP < 0.8, the sequence is deleted from the amino acid sequence data set and the length integrity is calculated as:the total length of the sequence to be detected is L, and the total length of the reference strain sequence is L 0 。
3. The apparatus of claim 1, wherein the table of the frequency of the amino acid mutations in the statistical module is obtained by:
(1) Performing multi-sequence comparison on the sequences in the data set to be detected and the amino acid sequences of the reference strains one by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;
(2) According to the result of the multi-sequence comparison, the amino acid sequence to be detected is compared with the reference strain sequence bit by bit, the amino acid mutation condition in the sequence to be detected is recorded, and the mutation information set of the amino acid sequence to be detected is generated;
MSet[Seq m ]={Amino l1 _l1_Mutation l1 ,…,Amino ln _In_Mutation ln }
wherein Amino l1 _l1_Mutation l1 Represents an Amino acid mutation wherein l1 is the positional information of the site of the Amino acid mutation, amino l1 Amino acid code at position l1 of the reference Strain sequence, mutation l1 Amino acid code of l1 position of the sequence of the strain to be tested, MSet is a mutation information dictionary recording all sequence mutation information sets, MSet [ Seq ] m ]Sequence of expression Seq m A set of mutation information of (a);
(3) And (3) analyzing the amino acid mutation records of all the sequences to be detected in the step (2), carrying out frequency statistics on the frequency of each amino acid mutation in all the sequences, and establishing an amino acid mutation frequency table of the viruses to be detected.
4. The device of claim 1, wherein the screening module screens the high-frequency amino acid mutation set by:
(1) When the occurrence frequency of the amino acid mutation is more than one thousandth of the total number N of the sequences contained in the data set to be detected, the amino acid mutation is high-frequency amino acid mutation;
(2) Judging all amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:
HFMS={Amino l1 _l1_Mutation l1 ,…,Amino lm _lm_Mutation lm }。
5. the apparatus according to claim 1, wherein the screening module screens the key mutation sets of the core mutation sites by:
(1) Counting the frequency of the amino acid mutation at each amino acid position, and calculating the formula as follows:
wherein AH ln Represents the amino acid frequency set of the ln th amino acid position, N is the total number of sequences contained in the data set to be tested,indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position (ln),represents the frequency of occurrence of Amino _ m at the Amino acid position of lnn;
(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:
if H is ln If the amino acid position ln is more than 0.1, the recorded amino acid position ln is the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site ln _ln_Mutation max ln As a key mutation at this site;
(3) According to the method in the step (2), counting the key mutation of all core mutation sites to obtain a core mutation site key mutation set KLMS,
KLMS={Amino l1 _l1_Mutation max l1 ,...,Amino ln _ln_Mutation max ln }。
6. the apparatus of claim 1, wherein each virus data in the data storage module comprises sampling time, sampling location and sequence information.
7. The device according to claim 6, wherein the identification module is used for identifying the gene sequences of the data set to be tested by combining the core mutation site key mutation set and the high-frequency amino acid mutation set, and the method for obtaining the virus variant mutation feature set comprises the steps of identifying the core mutation combination in the mutation information dictionary MSet and performing propagation trend analysis and epidemic variant identification in the analysis module.
8. The apparatus of claim 7, wherein the means for identifying core mutation combinations in the mutation information dictionary MSet comprises:
(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:
KMS=HFMS∪KLMS;
(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain a core mutation combination of each sequence, wherein the calculation formula is as follows:
the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;is sequence Seq m A combination of core mutations comprised;
(3) Performing high-frequency screening on the core mutation combined dictionary in the step (2):combining core mutations in a dictionary of core mutation combinationsWhen the number ofWhen the number of the core mutation combinations is more than one thousandth of the total number N of the sequences contained in the data set to be tested, the corresponding core mutation combinationsFor high frequency core mutation combinations, the virus strain comprising the core mutation combination is referred to as a variant MC m And recorded in a virus variant list MCL.
9. The apparatus of claim 7, wherein the analysis module performs a propagation trend analysis and a popular variant identification by:
(1) And (3) counting the proportion of each variant strain in the virus propagation process according to the sampling time information of the sequence: the calculation formula is as follows:
where MC ∈ MCL, n denotes month,indicates the number of sequences of the Nth month variant strain MC in the sequence set, AN n Indicating the number of all sequences collected in month n of the sequence set,the sequence number of the variant strain MC in the nth month is shown;
(2) And (3) fitting a propagation curve for the monthly ratio of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:
Y=β n X+ε
wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta n The slope of a transmission fitting curve of the nth month variant strain MC is obtained, and epsilon represents the intercept of the fitting curve;
(3) By fitting the slope beta of the propagation curve in step (2) n The identification of the virus epidemic variant strain is completed by the threshold judgment of (1): the calculation formula is as follows:
T=3×β n ×M
wherein beta is n The propagation fitting curve slope of the n-th month variant strain MC is shown, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211554331.0A CN115798578A (en) | 2022-12-06 | 2022-12-06 | Device and method for analyzing and detecting virus new epidemic variant strain |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211554331.0A CN115798578A (en) | 2022-12-06 | 2022-12-06 | Device and method for analyzing and detecting virus new epidemic variant strain |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115798578A true CN115798578A (en) | 2023-03-14 |
Family
ID=85445984
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211554331.0A Pending CN115798578A (en) | 2022-12-06 | 2022-12-06 | Device and method for analyzing and detecting virus new epidemic variant strain |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115798578A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116741268A (en) * | 2023-04-04 | 2023-09-12 | 中国人民解放军军事科学院军事医学研究院 | Method, device and computer readable storage medium for screening key mutation of pathogen |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7255993B1 (en) * | 2002-11-06 | 2007-08-14 | The United States Of America As Represented By The Secretary Of The Department Of Health And Human Services | Detection of mutational frequency and related methods |
CN107423578A (en) * | 2017-03-02 | 2017-12-01 | 北京诺禾致源科技股份有限公司 | Detect the device of somatic mutation |
CN108733975A (en) * | 2018-03-29 | 2018-11-02 | 深圳裕策生物科技有限公司 | Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations |
US20200109452A1 (en) * | 2017-03-31 | 2020-04-09 | Premaitha Limited | Method of detecting a fetal chromosomal abnormality |
CN111524611A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Method, device and equipment for constructing infectious disease trend prediction model |
CN111863271A (en) * | 2020-06-08 | 2020-10-30 | 浙江大学 | Early warning, prevention and control analysis system for serious infectious disease transmission risk of new coronary pneumonia |
CN111933214A (en) * | 2020-09-27 | 2020-11-13 | 至本医疗科技(上海)有限公司 | Method and computing device for detecting RNA level somatic gene variation |
CN112313748A (en) * | 2018-06-20 | 2021-02-02 | 香港中文大学 | Measurement and prediction of viral gene mutation patterns |
CN113025759A (en) * | 2021-05-11 | 2021-06-25 | 常州国药医学检验实验室有限公司 | Novel coronavirus nucleic acid mutation detection technology based on fluorescent quantitative PCR technology and application thereof |
CN113073106A (en) * | 2021-06-04 | 2021-07-06 | 北京华芢生物技术有限公司 | Novel coronavirus B.1.525 Nigeria mutant RBD gene and application thereof |
CN113470745A (en) * | 2021-08-25 | 2021-10-01 | 南京立顶医疗科技有限公司 | Screening method of potential mutation site of SARS-CoV2 and its application |
CN113593639A (en) * | 2021-08-05 | 2021-11-02 | 湖南大学 | Method and system for analyzing and monitoring virus genome variation |
CN113980100A (en) * | 2021-12-23 | 2022-01-28 | 中国人民解放军军事科学院军事医学研究院 | Adenovirus vector recombinant new coronavirus B.1.429 variant vaccine and application thereof |
CN113990390A (en) * | 2021-06-07 | 2022-01-28 | 重庆南鹏人工智能科技研究院有限公司 | Machine learning-based new coronavirus subgroup identification method |
US20220068437A1 (en) * | 2019-05-15 | 2022-03-03 | Bgi Genomics Co., Ltd. | Base mutation detection method and apparatus based on sequencing data, and storage medium |
CN114464246A (en) * | 2022-01-19 | 2022-05-10 | 华中科技大学同济医学院附属协和医院 | Method for detecting mutation related to genetic increase based on CovMutt framework |
-
2022
- 2022-12-06 CN CN202211554331.0A patent/CN115798578A/en active Pending
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7255993B1 (en) * | 2002-11-06 | 2007-08-14 | The United States Of America As Represented By The Secretary Of The Department Of Health And Human Services | Detection of mutational frequency and related methods |
CN107423578A (en) * | 2017-03-02 | 2017-12-01 | 北京诺禾致源科技股份有限公司 | Detect the device of somatic mutation |
US20200109452A1 (en) * | 2017-03-31 | 2020-04-09 | Premaitha Limited | Method of detecting a fetal chromosomal abnormality |
CN108733975A (en) * | 2018-03-29 | 2018-11-02 | 深圳裕策生物科技有限公司 | Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations |
CN112313748A (en) * | 2018-06-20 | 2021-02-02 | 香港中文大学 | Measurement and prediction of viral gene mutation patterns |
US20220068437A1 (en) * | 2019-05-15 | 2022-03-03 | Bgi Genomics Co., Ltd. | Base mutation detection method and apparatus based on sequencing data, and storage medium |
CN111524611A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Method, device and equipment for constructing infectious disease trend prediction model |
CN111863271A (en) * | 2020-06-08 | 2020-10-30 | 浙江大学 | Early warning, prevention and control analysis system for serious infectious disease transmission risk of new coronary pneumonia |
CN111933214A (en) * | 2020-09-27 | 2020-11-13 | 至本医疗科技(上海)有限公司 | Method and computing device for detecting RNA level somatic gene variation |
CN113025759A (en) * | 2021-05-11 | 2021-06-25 | 常州国药医学检验实验室有限公司 | Novel coronavirus nucleic acid mutation detection technology based on fluorescent quantitative PCR technology and application thereof |
CN113073106A (en) * | 2021-06-04 | 2021-07-06 | 北京华芢生物技术有限公司 | Novel coronavirus B.1.525 Nigeria mutant RBD gene and application thereof |
CN113990390A (en) * | 2021-06-07 | 2022-01-28 | 重庆南鹏人工智能科技研究院有限公司 | Machine learning-based new coronavirus subgroup identification method |
CN113593639A (en) * | 2021-08-05 | 2021-11-02 | 湖南大学 | Method and system for analyzing and monitoring virus genome variation |
CN113470745A (en) * | 2021-08-25 | 2021-10-01 | 南京立顶医疗科技有限公司 | Screening method of potential mutation site of SARS-CoV2 and its application |
CN113980100A (en) * | 2021-12-23 | 2022-01-28 | 中国人民解放军军事科学院军事医学研究院 | Adenovirus vector recombinant new coronavirus B.1.429 variant vaccine and application thereof |
CN114464246A (en) * | 2022-01-19 | 2022-05-10 | 华中科技大学同济医学院附属协和医院 | Method for detecting mutation related to genetic increase based on CovMutt framework |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116741268A (en) * | 2023-04-04 | 2023-09-12 | 中国人民解放军军事科学院军事医学研究院 | Method, device and computer readable storage medium for screening key mutation of pathogen |
CN116741268B (en) * | 2023-04-04 | 2024-03-01 | 中国人民解放军军事科学院军事医学研究院 | Method, device and computer readable storage medium for screening key mutation of pathogen |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111009286B (en) | Method and apparatus for microbiological analysis of a host sample | |
CN107577688B (en) | Original article influence analysis system based on media information acquisition | |
CN103843003B (en) | The method of recognition network fishing website | |
US20130166221A1 (en) | Method and system for sequence correlation | |
CN107194208A (en) | A kind of genetic analysis annotates method and apparatus | |
CN108847022B (en) | Abnormal value detection method of microwave traffic data acquisition equipment | |
CN111522955B (en) | Litigation case classification method, litigation case classification device, computer equipment and storage medium | |
CN115798578A (en) | Device and method for analyzing and detecting virus new epidemic variant strain | |
CN115631789B (en) | Group joint variation detection method based on pan genome | |
CN112185468A (en) | Cloud management system and method for gene data analysis and processing | |
CN107705828A (en) | Prejudge detection and processing method and processing device, terminal device, the storage medium of rule | |
WO2014144529A1 (en) | Characterization of biological material using unassembled sequence information, probabilistic methods and trait-specific database catalogs | |
CN112599198A (en) | Microorganism species and functional composition analysis method for metagenome sequencing data | |
CN111382948A (en) | Method and device for quantitatively evaluating enterprise development potential | |
CN113096737A (en) | Method and system for automatically analyzing pathogen types | |
CN110970093B (en) | Method and device for screening primer design template and application | |
CN115862740A (en) | Rapid distributed multi-sequence comparison method for large-scale virus genome data | |
CN113380318B (en) | Artificial intelligence assisted flow cytometry 40CD immunophenotyping detection method and system | |
CN115842645A (en) | UMAP-RF-based network attack traffic detection method and device and readable storage medium | |
CN114707030A (en) | Medical data management method and system, device and medium | |
CN111681704B (en) | Construction method of matK gene-based unknown plant species identification database and database | |
Silver | Paternity testing | |
CN111833297A (en) | Disease association method of marrow cell morphology automatic detection system | |
CN116646010B (en) | Human virus detection method and device, equipment and storage medium | |
CN117708569B (en) | Identification method, device, terminal and storage medium for pathogenic microorganism information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |