CN114582523A

CN114582523A - Novel coronavirus genome feature similarity measurement method

Info

Publication number: CN114582523A
Application number: CN202210219753.6A
Authority: CN
Inventors: 山丹; 张永锋; 丛国涛; 李鹤楠
Original assignee: Dalian Neusoft University of Information
Current assignee: Dalian Neusoft University of Information
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-03

Abstract

The invention discloses a novel coronavirus genome feature similarity measurement method, which comprises the following steps: obtaining the new coronavirus genome to be analyzed and the coronavirus genomes infected by other animals of a control; extracting virus genome characteristics of each coronavirus genome; calculating the frequency of appearance of the virus genome features; traversing and searching the common virus genome characteristics of the novel coronavirus, and performing numerical processing and normalization processing; carrying out fuzzy clustering on the common virus genome characteristics of the novel coronavirus after the normalization treatment to obtain a novel coronavirus clustering center; calculating the Euclidean distance between the normalized numerical characteristics of the coronavirus infected by other animals and the clustering center of the novel coronavirus, quantifying the similarity of the viruses, predicting the homology and affinity relationship of the viruses according to the quantification result, quantifying the similarity of the viruses, and predicting the homology and affinity relationship of the viruses according to the quantification result. The method has low cost and high speed, and can easily obtain experimental results.

Description

Novel coronavirus genome feature similarity measurement method

Technical Field

The invention relates to the field of virus genomes, in particular to a novel coronavirus genome feature similarity measurement method.

Background

Most current traditional medicine uses traditional bioinformatics tools, such as BLAST sequence alignment, to achieve a genome similarity metric. However, the traditional comparison method has the disadvantages of large investment, low speed, long period and high difficulty, and can not realize the rapid and accurate measurement of gene similarity. Particularly, under the condition of rapid virus propagation, the judgment cannot be made quickly, the virus homology can be analyzed timely and effectively, and a timely and reliable basis is provided for treatment.

Disclosure of Invention

The invention provides a novel coronavirus genome feature similarity measurement method, which aims to solve the technical problems of high investment, low speed, long period and high difficulty of the traditional gene similarity comparison method.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a novel coronavirus genome feature similarity measurement method is characterized by comprising the following steps:

step 1, obtaining a novel coronavirus genome to be analyzed and coronavirus genomes infected by other animals of a control;

step 2, extracting the genome characteristics of the novel coronavirus and the genome characteristics of the coronavirus infected by other animals;

step 3, calculating the occurrence frequency of the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and carrying out numerical processing on the novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals;

step 4, traversing and searching the virus genome characteristics which are common with the novel coronavirus by utilizing the numerical treatment novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and normalizing the numerical treatment coronavirus genome characteristics infected by other animals to obtain the numerical characteristics normalized by the coronavirus infected by other animals;

step 5, normalizing the common virus genome characteristics of the novel coronavirus, and carrying out fuzzy clustering on the normalized common virus genome characteristics of the novel coronavirus to obtain a novel coronavirus clustering center;

and 6, calculating the Euclidean distance between the normalized numerical characteristics of the coronavirus infected by other animals and the clustering center of the novel coronavirus, quantifying the similarity of the viruses, and predicting the homology and affinity relationship of the viruses according to the quantification result.

Further, the step 5 of obtaining the virus clustering center specifically comprises:

step 5.1, initializing a novel coronavirus characteristic membership matrix u_ij；

Step 5.2, according to the genome characteristic x of the novel coronavirus_jAnd a novel coronavirus characteristic membership matrix u_ijObtaining a new coronavirus characteristic clustering center v_i；

Step 5.3, clustering the center v according to the characteristics of the novel coronavirus_iUpdating the membership matrix u of the new coronavirus characteristics_ij；

Step 5.4, according to the updated membership degree matrix u of the novel coronavirus characteristics_ijAnd a novel coronavirus feature clustering center v_iObtaining an objective function value, judging the size of the objective function value and a preset value, and if the objective function value is smaller than the preset value, outputting a novel coronavirus characteristic clustering center v_iIf the objective function value is larger than or equal to the preset value, returning to the step 5.2 to obtain the novel coronavirus characteristic clustering center v again_i。

Further, the step 5.1 initializes the membership degree matrix u of the characteristic of the novel coronavirus_ijThe specific calculation formula of (A) is as follows:

wherein c represents the number of fuzzy clusters, u_ijRepresents the membership of the ith novel coronavirus genome sample belonging to the jth class, and n represents the number of novel coronavirus genome samples.

Further, the new coronavirus characteristic clustering center v is obtained in the step 5.2_iThe specific calculation formula of (A) is as follows:

wherein m is a real number greater than 1,

representing a membership matrix u_ijThe jth feature in (1) belongs to the i-th class of membership.

Further, the membership degree matrix u of the novel coronavirus characteristics is updated in the step 5.3_ijThe specific calculation formula of (A) is as follows:

wherein v is_kRepresenting the k-th cluster center.

Further, in the step 5.4, the membership degree matrix u is updated according to the updated characteristics of the novel coronavirus_ijAnd a novel coronavirus feature clustering center v_iThe specific calculation formula for obtaining the objective function value is as follows:

wherein Q is the objective function value.

Further, the specific calculation formula for calculating the euclidean distance between the normalized numerical features of the other coronavirus and the clustering center of the novel coronavirus in the step 6 is as follows:

wherein, the Distance is the Euclidean Distance between the virus clustering center and the novel coronavirus clustering center,

x_j' is a normalized numerical characterization of other animals infected with coronavirus.

Has the advantages that: the invention conjectures virus homology through gene similarity. Firstly, calculating the frequency of the occurrence of the coronavirus genome characteristics in a gene character sequence to obtain the numerical characteristics of the gene sequence; the absolute value relation of the public gene characteristics is processed into a relative value relation through data normalization, so that the calculation is simplified; and calculating a clustering center through fuzzy mean clustering to further obtain genome characteristics, and judging the similarity and homology of the genome characteristics by calculating the Euclidean distance between the normalized numerical value of other genome characteristics and the clustering center. From the verification effect, the result is consistent with the conventional comparison method. Because the existing similarity and homology comparison amplification is essentially the alignment of fragment pairs, the basic process is as follows: the method comprises the steps of firstly finding out all segment pairs with the matching degree between a query sequence and a target sequence exceeding a certain threshold, then extending the segment pairs according to a given similarity threshold to obtain a similarity segment with a certain length, and finally giving out high-score segment pairs so as to extend and judge the similarity and homology of the sequences.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of the method for measuring similarity of genome features of the novel coronavirus of the present invention;

FIG. 2 is a graph showing the analysis of the similarity results after the present invention is applied.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

This example provides a method for measuring similarity of genome features of a novel coronavirus, as shown in fig. 1, comprising the following steps:

step 1, obtaining a novel coronavirus genome to be analyzed and coronavirus genomes infected by other animals of a control; the other animals are chicken, duck, cattle, bat; specifically, the common genome fasta type data is used to obtain the coronavirus genome;

step 4, traversing and searching the virus genome characteristics which are common with the novel coronavirus by utilizing the numerical treatment novel coronavirus genome characteristics and the coronavirus genome characteristics infected by other animals, and normalizing the numerical treatment coronavirus genome characteristics infected by other animals to obtain the numerical characteristics normalized by the coronavirus infected by other animals; processing the absolute value relationship of the public gene characteristics into a relative value relationship through data normalization, and simplifying calculation;

step 5, carrying out normalization processing on the characteristics of the novel coronavirus common virus genomes, and carrying out fuzzy clustering on the characteristics of the novel coronavirus common virus genomes after the normalization processing to obtain novel coronavirus clustering centers;

and 6, calculating Euclidean distances between the normalized numerical characteristics of the coronaviruses infected by other animals and a novel coronavirus clustering center, and quantifying the similarity of the viruses, wherein when the Euclidean distances are multiple groups, the average value of the Euclidean distances is taken for quantification, and the homology and affinity relationship of the viruses is predicted according to the quantification result.

In a specific embodiment, the step 5 of obtaining the virus clustering center specifically includes:

Step 5.2, according to the virus genome characteristics x_jAnd a novel coronavirus characteristic membership matrix u_ijObtaining a new coronavirus characteristic clustering center v_i；

Step 5.4, according to the updated membership degree matrix u of the novel coronavirus characteristic_ijAnd a novel coronavirus feature clustering center v_iObtaining an objective function value, judging the size of the objective function value and a preset value epsilon, and if the objective function value is smaller than the preset value epsilon, outputting a novel coronavirus characteristic clustering center v_iIf the objective function value is more than or equal to the preset value epsilon, returning to the step 5.2 to obtain the novel coronavirus characteristic clustering center v again_i。

In a specific embodiment, said step 5.1 initializes the membership matrix u for the signature of the new coronavirus_ijThe specific calculation formula of (A) is as follows:

In a specific embodiment, the new coronavirus characteristic clustering center v obtained in the step 5.2_iThe specific calculation formula of (A) is as follows:

wherein m is a real number greater than 1,

In a specific embodiment, said step 5.3 updates the membership matrix u for the signature of the new coronavirus_ijThe specific calculation formula of (A) is as follows:

wherein v is_kRepresenting the k-th cluster center.

In a specific embodiment, the membership matrix u in step 5.4 is updated according to the updated characteristic of the novel coronavirus_ijAnd a novel coronavirus feature clustering center v_iThe specific calculation formula for obtaining the objective function value is as follows:

wherein Q is the objective function value. In order to further verify the feasibility and conclusion consistency of the method, the clustering parameter c and the coefficient m can be traversed within a certain range.

In a specific embodiment, the specific calculation formula for calculating the euclidean distance between the normalized numerical features of the other coronavirus and the clustering center of the novel coronavirus in step 6 is as follows:

wherein, Distance is Euclidean Distance between the virus clustering center and the novel coronavirus clustering center, x_j' is a normalized numerical characterization of other animals infected with coronavirus. The closer the distance is, the higher the similarity is, and the higher the homology probability is; conversely, the farther away the distance, the lower the similarity, and the less likely the homology.

FIG. 2 is a graph showing the results of similarity analysis using the present invention, specifically, the similarity (distance) between the four animals infected coronavirus and the novel coronavirus; as can be seen from FIG. 2, in the process of analyzing the gene similarity between the previously infected coronavirus sample and the novel coronavirus sample, it was found that the gene similarity between the previously infected coronavirus sample and the novel coronavirus sample is the highest (the Euclidean distance is the smallest), and thus it can be inferred that the possibility that the novel coronavirus is derived from the bat is the highest.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A novel coronavirus genome feature similarity measurement method is characterized by comprising the following steps:

2. The method for measuring similarity of genome features of coronavirus as claimed in claim 1, wherein the step 5 of obtaining the virus clustering center comprises:

Step 5.2, according to the characteristics x of the common virus genome of the novel coronavirus_jAnd a novel coronavirus characteristic membership matrix u_ijObtaining a new coronavirus characteristic clustering center v_i；

Step 5.3, clustering the center v according to the characteristics of the novel coronavirus_iUpdating the membership matrix ui for the characteristics of the novel coronavirus_j；

Step 5.4, according to the updated membership degree matrix u of the novel coronavirus characteristics_ijHe XinCharacteristic clustering center v of type coronavirus_iObtaining an objective function value, judging the size of the objective function value and a preset value, and if the objective function value is smaller than the preset value, outputting a novel coronavirus characteristic clustering center v_iIf the objective function value is larger than or equal to the preset value, returning to the step 5.2 to obtain the novel coronavirus characteristic clustering center v again_i。

3. The method of claim 2, wherein the step 5.1 is to initialize the membership matrix u of the genome characteristics of the new coronavirus_ijThe specific calculation formula of (A) is as follows:

4. The method of claim 3, wherein the step 5.2 of obtaining the clustering center v of the characteristic features of the novel coronavirus_iThe specific calculation formula of (A) is as follows:

wherein m is a real number greater than 1,

5. The method of claim 4, wherein the step of determining the similarity of the genome features of the coronavirus is performed by5.3 updating membership matrix u for the novel coronavirus characteristics_ijThe specific calculation formula of (A) is as follows:

wherein v is_kRepresenting the k-th cluster center.

6. The method of claim 5, wherein the step 5.4 is performed according to the updated membership matrix u of the genome features of the novel coronavirus_ijAnd a novel coronavirus feature clustering center v_iThe specific calculation formula for obtaining the objective function value is as follows:

wherein Q is the objective function value.

7. The method for measuring similarity of genome features of a novel coronavirus according to claim 6, wherein the Euclidean distance between the normalized numerical features of other coronaviruses and the clustering center of the novel coronavirus is calculated in step 6 by using the following specific formula:

wherein, Distance is Euclidean Distance between the virus clustering center and the novel coronavirus clustering center, x_j' is a normalized numerical characterization of other animals infected with coronavirus.