CN114582523A - Novel coronavirus genome feature similarity measurement method - Google Patents

Novel coronavirus genome feature similarity measurement method Download PDF

Info

Publication number
CN114582523A
CN114582523A CN202210219753.6A CN202210219753A CN114582523A CN 114582523 A CN114582523 A CN 114582523A CN 202210219753 A CN202210219753 A CN 202210219753A CN 114582523 A CN114582523 A CN 114582523A
Authority
CN
China
Prior art keywords
coronavirus
novel coronavirus
genome
novel
infected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210219753.6A
Other languages
Chinese (zh)
Inventor
山丹
张永锋
丛国涛
李鹤楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Neusoft University of Information
Original Assignee
Dalian Neusoft University of Information
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Neusoft University of Information filed Critical Dalian Neusoft University of Information
Priority to CN202210219753.6A priority Critical patent/CN114582523A/en
Publication of CN114582523A publication Critical patent/CN114582523A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a novel coronavirus genome feature similarity measurement method, which comprises the following steps: obtaining the new coronavirus genome to be analyzed and the coronavirus genomes infected by other animals of a control; extracting virus genome characteristics of each coronavirus genome; calculating the frequency of appearance of the virus genome features; traversing and searching the common virus genome characteristics of the novel coronavirus, and performing numerical processing and normalization processing; carrying out fuzzy clustering on the common virus genome characteristics of the novel coronavirus after the normalization treatment to obtain a novel coronavirus clustering center; calculating the Euclidean distance between the normalized numerical characteristics of the coronavirus infected by other animals and the clustering center of the novel coronavirus, quantifying the similarity of the viruses, predicting the homology and affinity relationship of the viruses according to the quantification result, quantifying the similarity of the viruses, and predicting the homology and affinity relationship of the viruses according to the quantification result. The method has low cost and high speed, and can easily obtain experimental results.

Description

Novel coronavirus genome feature similarity measurement method
Technical Field
The invention relates to the field of virus genomes, in particular to a novel coronavirus genome feature similarity measurement method.
Background
Most current traditional medicine uses traditional bioinformatics tools, such as BLAST sequence alignment, to achieve a genome similarity metric. However, the traditional comparison method has the disadvantages of large investment, low speed, long period and high difficulty, and can not realize the rapid and accurate measurement of gene similarity. Particularly, under the condition of rapid virus propagation, the judgment cannot be made quickly, the virus homology can be analyzed timely and effectively, and a timely and reliable basis is provided for treatment.
Disclosure of Invention
The invention provides a novel coronavirus genome feature similarity measurement method, which aims to solve the technical problems of high investment, low speed, long period and high difficulty of the traditional gene similarity comparison method.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a novel coronavirus genome feature similarity measurement method is characterized by comprising the following steps:
step 1, obtaining a novel coronavirus genome to be analyzed and coronavirus genomes infected by other animals of a control;
step 2, extracting the genome characteristics of the novel coronavirus and the genome characteristics of the coronavirus infected by other animals;
step 3, calculating the occurrence frequency of the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and carrying out numerical processing on the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals;
step 4, traversing and searching the virus genome characteristics which are common with the novel coronavirus by utilizing the numerical treatment novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and normalizing the numerical treatment coronavirus genome characteristics infected by other animals to obtain the numerical characteristics normalized by the coronavirus infected by other animals;
step 5, normalizing the common virus genome characteristics of the novel coronavirus, and carrying out fuzzy clustering on the normalized common virus genome characteristics of the novel coronavirus to obtain a novel coronavirus clustering center;
and 6, calculating the Euclidean distance between the normalized numerical characteristics of the coronavirus infected by other animals and the clustering center of the novel coronavirus, quantifying the similarity of the viruses, and predicting the homology and affinity relationship of the viruses according to the quantification result.
Further, the step 5 of obtaining the virus clustering center specifically comprises:
step 5.1, initializing a novel coronavirus characteristic membership matrix uij
Step 5.2, according to the genome characteristic x of the novel coronavirusjAnd a novel coronavirus characteristic membership matrix uijObtaining a new coronavirus characteristic clustering center vi
Step 5.3, clustering the center v according to the characteristics of the novel coronavirusiUpdating the membership matrix u of the new coronavirus characteristicsij
Step 5.4, according to the updated membership degree matrix u of the novel coronavirus characteristicsijAnd a novel coronavirus feature clustering center viObtaining an objective function value, judging the size of the objective function value and a preset value, and if the objective function value is smaller than the preset value, outputting a novel coronavirus characteristic clustering center viIf the objective function value is larger than or equal to the preset value, returning to the step 5.2 to obtain the novel coronavirus characteristic clustering center v againi
Further, the step 5.1 initializes the membership degree matrix u of the characteristic of the novel coronavirusijThe specific calculation formula of (A) is as follows:
Figure BDA0003536476860000021
wherein c represents the number of fuzzy clusters, uijRepresents the membership of the ith novel coronavirus genome sample belonging to the jth class, and n represents the number of novel coronavirus genome samples.
Further, the new coronavirus characteristic clustering center v is obtained in the step 5.2iThe specific calculation formula of (A) is as follows:
Figure BDA0003536476860000022
wherein m is a real number greater than 1,
Figure BDA0003536476860000023
representing a membership matrix uijThe jth feature in (1) belongs to the i-th class of membership.
Further, the membership degree matrix u of the novel coronavirus characteristics is updated in the step 5.3ijThe specific calculation formula of (A) is as follows:
Figure BDA0003536476860000031
wherein v iskRepresenting the k-th cluster center.
Further, in the step 5.4, the membership degree matrix u is updated according to the updated characteristics of the novel coronavirusijAnd a novel coronavirus feature clustering center viThe specific calculation formula for obtaining the objective function value is as follows:
Figure BDA0003536476860000032
wherein Q is the objective function value.
Further, the specific calculation formula for calculating the euclidean distance between the normalized numerical features of the other coronavirus and the clustering center of the novel coronavirus in the step 6 is as follows:
Figure BDA0003536476860000033
wherein, the Distance is the Euclidean Distance between the virus clustering center and the novel coronavirus clustering center,
xj' is a normalized numerical characterization of other animals infected with coronavirus.
Has the advantages that: the invention conjectures virus homology through gene similarity. Firstly, calculating the frequency of the occurrence of the coronavirus genome characteristics in a gene character sequence to obtain the numerical characteristics of the gene sequence; the absolute value relation of the public gene characteristics is processed into a relative value relation through data normalization, so that the calculation is simplified; and calculating a clustering center through fuzzy mean clustering to further obtain genome characteristics, and judging the similarity and homology of the genome characteristics by calculating the Euclidean distance between the normalized numerical value of other genome characteristics and the clustering center. From the verification effect, the result is consistent with the conventional comparison method. Because the existing similarity and homology comparison amplification is essentially the alignment of fragment pairs, the basic process is as follows: the method comprises the steps of firstly finding out all segment pairs with the matching degree between a query sequence and a target sequence exceeding a certain threshold, then extending the segment pairs according to a given similarity threshold to obtain a similarity segment with a certain length, and finally giving out high-score segment pairs so as to extend and judge the similarity and homology of the sequences.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of the method for measuring similarity of genome features of the novel coronavirus of the present invention;
FIG. 2 is a graph showing the analysis of the similarity results after the present invention is applied.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
This example provides a method for measuring similarity of genome features of a novel coronavirus, as shown in fig. 1, comprising the following steps:
step 1, obtaining a novel coronavirus genome to be analyzed and coronavirus genomes infected by other animals of a control; the other animals are chicken, duck, cattle, bat; specifically, the common genome fasta type data is used to obtain the coronavirus genome;
step 2, extracting the genome characteristics of the novel coronavirus and the genome characteristics of the coronavirus infected by other animals;
step 3, calculating the occurrence frequency of the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and carrying out numerical processing on the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals;
step 4, traversing and searching the virus genome characteristics which are common with the novel coronavirus by utilizing the numerical treatment novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and normalizing the numerical treatment coronavirus genome characteristics infected by other animals to obtain the numerical characteristics normalized by the coronavirus infected by other animals; processing the absolute value relationship of the public gene characteristics into a relative value relationship through data normalization, and simplifying calculation;
step 5, carrying out normalization processing on the characteristics of the novel coronavirus common virus genomes, and carrying out fuzzy clustering on the characteristics of the novel coronavirus common virus genomes after the normalization processing to obtain novel coronavirus clustering centers;
and 6, calculating Euclidean distances between the normalized numerical characteristics of the coronaviruses infected by other animals and a novel coronavirus clustering center, and quantifying the similarity of the viruses, wherein when the Euclidean distances are multiple groups, the average value of the Euclidean distances is taken for quantification, and the homology and affinity relationship of the viruses is predicted according to the quantification result.
In a specific embodiment, the step 5 of obtaining the virus clustering center specifically includes:
step 5.1, initializing a novel coronavirus characteristic membership matrix uij
Step 5.2, according to the virus genome characteristics xjAnd a novel coronavirus characteristic membership matrix uijObtaining a new coronavirus characteristic clustering center vi
Step 5.3, clustering the center v according to the characteristics of the novel coronavirusiUpdating the membership matrix u of the new coronavirus characteristicsij
Step 5.4, according to the updated membership degree matrix u of the novel coronavirus characteristicijAnd a novel coronavirus feature clustering center viObtaining an objective function value, judging the size of the objective function value and a preset value epsilon, and if the objective function value is smaller than the preset value epsilon, outputting a novel coronavirus characteristic clustering center viIf the objective function value is more than or equal to the preset value epsilon, returning to the step 5.2 to obtain the novel coronavirus characteristic clustering center v againi
In a specific embodiment, said step 5.1 initializes the membership matrix u for the signature of the new coronavirusijThe specific calculation formula of (A) is as follows:
Figure BDA0003536476860000051
wherein c represents the number of fuzzy clusters, uijRepresents the membership of the ith novel coronavirus genome sample belonging to the jth class, and n represents the number of novel coronavirus genome samples.
In a specific embodiment, the new coronavirus characteristic clustering center v obtained in the step 5.2iThe specific calculation formula of (A) is as follows:
Figure BDA0003536476860000061
wherein m is a real number greater than 1,
Figure BDA0003536476860000062
representing a membership matrix uijThe jth feature in (1) belongs to the i-th class of membership.
In a specific embodiment, said step 5.3 updates the membership matrix u for the signature of the new coronavirusijThe specific calculation formula of (A) is as follows:
Figure BDA0003536476860000063
wherein v iskRepresenting the k-th cluster center.
In a specific embodiment, the membership matrix u in step 5.4 is updated according to the updated characteristic of the novel coronavirusijAnd a novel coronavirus feature clustering center viThe specific calculation formula for obtaining the objective function value is as follows:
Figure BDA0003536476860000064
wherein Q is the objective function value. In order to further verify the feasibility and conclusion consistency of the method, the clustering parameter c and the coefficient m can be traversed within a certain range.
In a specific embodiment, the specific calculation formula for calculating the euclidean distance between the normalized numerical features of the other coronavirus and the clustering center of the novel coronavirus in step 6 is as follows:
Figure BDA0003536476860000065
wherein, Distance is Euclidean Distance between the virus clustering center and the novel coronavirus clustering center, xj' is a normalized numerical characterization of other animals infected with coronavirus. The closer the distance is, the higher the similarity is, and the higher the homology probability is; conversely, the farther away the distance, the lower the similarity, and the less likely the homology.
FIG. 2 is a graph showing the results of similarity analysis using the present invention, specifically, the similarity (distance) between the four animals infected coronavirus and the novel coronavirus; as can be seen from FIG. 2, in the process of analyzing the gene similarity between the previously infected coronavirus sample and the novel coronavirus sample, it was found that the gene similarity between the previously infected coronavirus sample and the novel coronavirus sample is the highest (the Euclidean distance is the smallest), and thus it can be inferred that the possibility that the novel coronavirus is derived from the bat is the highest.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A novel coronavirus genome feature similarity measurement method is characterized by comprising the following steps:
step 1, obtaining a novel coronavirus genome to be analyzed and coronavirus genomes infected by other animals of a control;
step 2, extracting the genome characteristics of the novel coronavirus and the genome characteristics of the coronavirus infected by other animals;
step 3, calculating the occurrence frequency of the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and carrying out numerical processing on the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals;
step 4, traversing and searching the virus genome characteristics which are common with the novel coronavirus by utilizing the numerical treatment novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and normalizing the numerical treatment coronavirus genome characteristics infected by other animals to obtain the numerical characteristics normalized by the coronavirus infected by other animals;
step 5, carrying out normalization processing on the characteristics of the novel coronavirus common virus genomes, and carrying out fuzzy clustering on the characteristics of the novel coronavirus common virus genomes after the normalization processing to obtain novel coronavirus clustering centers;
and 6, calculating the Euclidean distance between the normalized numerical characteristics of the coronavirus infected by other animals and the clustering center of the novel coronavirus, quantifying the similarity of the viruses, and predicting the homology and affinity relationship of the viruses according to the quantification result.
2. The method for measuring similarity of genome features of coronavirus as claimed in claim 1, wherein the step 5 of obtaining the virus clustering center comprises:
step 5.1, initializing a novel coronavirus characteristic membership matrix uij
Step 5.2, according to the characteristics x of the common virus genome of the novel coronavirusjAnd a novel coronavirus characteristic membership matrix uijObtaining a new coronavirus characteristic clustering center vi
Step 5.3, clustering the center v according to the characteristics of the novel coronavirusiUpdating the membership matrix ui for the characteristics of the novel coronavirusj
Step 5.4, according to the updated membership degree matrix u of the novel coronavirus characteristicsijHe XinCharacteristic clustering center v of type coronavirusiObtaining an objective function value, judging the size of the objective function value and a preset value, and if the objective function value is smaller than the preset value, outputting a novel coronavirus characteristic clustering center viIf the objective function value is larger than or equal to the preset value, returning to the step 5.2 to obtain the novel coronavirus characteristic clustering center v againi
3. The method of claim 2, wherein the step 5.1 is to initialize the membership matrix u of the genome characteristics of the new coronavirusijThe specific calculation formula of (A) is as follows:
Figure FDA0003536476850000021
wherein c represents the number of fuzzy clusters, uijRepresents the membership of the ith novel coronavirus genome sample belonging to the jth class, and n represents the number of novel coronavirus genome samples.
4. The method of claim 3, wherein the step 5.2 of obtaining the clustering center v of the characteristic features of the novel coronavirusiThe specific calculation formula of (A) is as follows:
Figure FDA0003536476850000022
wherein m is a real number greater than 1,
Figure FDA0003536476850000023
representing a membership matrix uijThe jth feature in (1) belongs to the i-th class of membership.
5. The method of claim 4, wherein the step of determining the similarity of the genome features of the coronavirus is performed by5.3 updating membership matrix u for the novel coronavirus characteristicsijThe specific calculation formula of (A) is as follows:
Figure FDA0003536476850000024
wherein v iskRepresenting the k-th cluster center.
6. The method of claim 5, wherein the step 5.4 is performed according to the updated membership matrix u of the genome features of the novel coronavirusijAnd a novel coronavirus feature clustering center viThe specific calculation formula for obtaining the objective function value is as follows:
Figure FDA0003536476850000031
wherein Q is the objective function value.
7. The method for measuring similarity of genome features of a novel coronavirus according to claim 6, wherein the Euclidean distance between the normalized numerical features of other coronaviruses and the clustering center of the novel coronavirus is calculated in step 6 by using the following specific formula:
Figure FDA0003536476850000032
wherein, Distance is Euclidean Distance between the virus clustering center and the novel coronavirus clustering center, xj' is a normalized numerical characterization of other animals infected with coronavirus.
CN202210219753.6A 2022-03-08 2022-03-08 Novel coronavirus genome feature similarity measurement method Pending CN114582523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210219753.6A CN114582523A (en) 2022-03-08 2022-03-08 Novel coronavirus genome feature similarity measurement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210219753.6A CN114582523A (en) 2022-03-08 2022-03-08 Novel coronavirus genome feature similarity measurement method

Publications (1)

Publication Number Publication Date
CN114582523A true CN114582523A (en) 2022-06-03

Family

ID=81779338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210219753.6A Pending CN114582523A (en) 2022-03-08 2022-03-08 Novel coronavirus genome feature similarity measurement method

Country Status (1)

Country Link
CN (1) CN114582523A (en)

Similar Documents

Publication Publication Date Title
CN111209563B (en) Network intrusion detection method and system
CN111785328B (en) Coronavirus sequence identification method based on gated cyclic unit neural network
CN111800430B (en) Attack group identification method, device, equipment and medium
CN113221112B (en) Malicious behavior identification method, system and medium based on weak correlation integration strategy
CN111640468B (en) Method for screening disease-related protein based on complex network
WO2019213811A1 (en) Method, apparatus, and system for detecting chromosomal aneuploidy
Wang et al. Microarray missing value imputation: A regularized local learning method
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
CN112464232A (en) Android system malicious software detection method based on mixed feature combination classification
CN110046501B (en) Malicious code detection method inspired by biological genes
CN112259167A (en) Pathogen analysis method and device based on high-throughput sequencing and computer equipment
CN114266046A (en) Network virus identification method and device, computer equipment and storage medium
CN114242178A (en) Method for quantitatively predicting biological activity of ER alpha antagonist based on gradient lifting decision tree
Shivakumar et al. Sigmoni: classification of nanopore signal with a compressed pangenome index
CN111737694B (en) Malicious software homology analysis method based on behavior tree
CN114582523A (en) Novel coronavirus genome feature similarity measurement method
CN111783088A (en) Malicious code family clustering method and device and computer equipment
CN113836526B (en) Intrusion detection method based on improved immune network algorithm and application thereof
Nasser et al. Multiple sequence alignment using fuzzy logic
He et al. Inference of RNA structural contacts by direct coupling analysis
Kim et al. Computational prediction of pathogenic network modules in Fusarium verticillioides
CN113380330B (en) PHMM model-based differential identifiability gene sequence clustering method
Yan et al. Optimizing the accuracy of randomized embedding for sequence alignment
CN114124437B (en) Encrypted flow identification method based on prototype convolutional network
WO2023283967A1 (en) Optimized kraken2 algorithm and application thereof in second-generation sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination