CN112086127B

CN112086127B - Group genetic difference comparison method based on mutation function

Info

Publication number: CN112086127B
Application number: CN202010979785.7A
Authority: CN
Inventors: 尹继业; 王蕾云; 郭成贤
Original assignee: Xiangya Hospital of Central South University
Current assignee: Xiangya Hospital of Central South University
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2023-03-10
Anticipated expiration: 2040-09-17
Also published as: CN112086127A

Abstract

The invention discloses a method for comparing genetic variation difference aiming at group important function mutation, which comprises the steps of obtaining a functional mutation set by utilizing prediction software or an actual function verification test to evaluate mutation functions, then giving higher weight values to the functional mutations, obtaining genotype information of each individual in a certain group by utilizing a second-generation sequencing technology or a gene typing technology such as an SNP chip, and then calculating the gene frequency of a certain polymorphic site of the group. The method raises the genetic unit from a single polymorphic site to a single gene, so that the difference degree of all functional mutations of a certain gene among different populations can be compared, thereby predicting the difference of related phenotypes of certain genes among different populations and guiding the development of related genetic research of the genes among different populations.

Description

Group genetic difference comparison method based on mutation function

Technical Field

The invention particularly relates to a group genetic difference comparison method based on mutation functions, which is based on calculating mutation frequency difference of mutation sites potentially influencing gene functions among groups and evaluating the difference of the gene functions among the groups according to the mutation frequency difference, and belongs to the technical field of bioinformatics.

Background

As next-generation sequencing has become popular, more and more gene data has been generated in large quantities. In the context of big data, there is an increasing research on the contrast of genetic material between different populations. The gene comparison of different populations is currently performed by direct comparison of single mutation sites. However, often the sequence of a gene contains thousands of base pairs, and the mutation frequency at one or several sites is not different enough to describe the differences in the gene population as a whole.

Meanwhile, since the mutation site in one gene does not completely affect the function of the gene, it is not appropriate to consider all the mutation sites together. However, in genetic studies, determining the difference between genes in a population can help to screen for genes that differ significantly between different populations. Meanwhile, the functional research is selectively carried out on the sites with larger difference on the genes, so that the geneticist can be helped to find out the specific functional mutation sites in the population more quickly, accurately and quickly.

In conclusion, a high-throughput, simple and specific evaluation means for comparing the difference of genes among different populations is established, so that the screening efficiency of genetic research can be improved, and genes with larger difference among different populations can be determined to be preferentially researched in candidate populations to be researched.

Disclosure of Invention

The invention aims to overcome the imbalance between the existing comparison method and the comparison requirement, and provides a comparison method for evaluating the difference of gene functions among groups based on calculating the mutation frequency difference of mutation sites potentially influencing the gene functions among the groups.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

one of the technical schemes of the invention is as follows:

a method for comparing difference scores of each gene in different populations comprises the following steps:

(1) A certain weight value, such as 1, is given to the mutation site judged to be functional, and a weight value of 0 is assigned to the mutation site judged not to significantly affect the gene function;

(2) Detecting the genetic information of an individual and recording the genotype information of the individual; individuals with poor typing quality and failed typing were not included in the follow-up study;

(3) Typing of population results A Individual Gene of population F _a Making statistics, and recording genotype frequency F of the locus of the population b to be compared _b ；

(4) Repeating the statistics of step 3) in all mutation sites of all genes, and calculating F in units of genes _a -F _b Get G in order from big to small _ij (ii) a Wherein i is the number of a certain gene, and j is the number of a certain site on the gene, wherein G _i1 Is F _a -F _b Maximum value of (d);

(5) For all G _ij Calculating the mutation frequency sum S of all important mutation sites on the i gene at the point with the medium weight value of 1 _iG=1 A value, wherein G is a weight value, j ₁ The number of a certain G =1 mutation site on the i gene (distinguished from the number j, j) ₁ Numbering only the mutation sites of G = 1), n ₁ Number of mutation sites for all G =1 on the i gene:

，

(6) For all G _ij Calculating the sum S of the number of all non-important mutation points on the i gene at the point with the medium weight value of 0 _iG=0 A value, where G is a weight value, n ₀ Number of mutation sites for all G =0 on the i gene:

，

(7) Calculating mutation score S of each mutation site on the i gene _ij (ii) a Wherein i is the number of a certain gene, j is the number of a certain site on the gene, j ₁ The number of a certain G =1 mutation site on the i gene (distinguished from the number j, j) ₁ Numbering only the mutation sites of G = 1), q) ₁ The number of the nearest 1G =1 mutation site before the j site on the i gene (different from the number j, q) ₁ Numbering only the mutation sites of G = 1), q) ₀ The number of the nearest 1G =0 mutation site before the j site on the i gene (different from the number j, q) ₀ Numbering only the mutation sites for G = 0):

，

(8) Calculating the score S of a gene _i (ii) a Wherein i is the number of a certain gene, j is the number of a certain site on the gene, and n is the number of all mutation sites on the i gene:

，

S _i positive values indicate that the functional mutation of the gene in population a affects the gene to a greater extent than in population b; s _i A negative value indicates that the functional mutation of the gene in population a has a smaller effect on the gene than in population b;

(9) If the population a is required to be compared with the population typing result, the mutation site frequency F _a C, mutation site frequency F with other people group _c If instead F is calculated in step (4) _a -F _c And find G _ij And (5) and (8) are unchanged. Meanwhile, the formula is programmed by a computer, so that the comparison of a large number of gene difference values among different crowds can be realized in batch.

The second technical scheme of the invention is as follows:

a population genetic difference comparison method based on mutation functions comprises the following steps:

and (I) evaluating the importance degree of a certain mutation site and whether the importance degree of the mutation site can influence the function of the gene.

And (II) detecting the genetic information of the individual and recording the genotype information of the individual.

And (III) carrying out statistics on the genotype frequencies of individual samples of different populations. Genetic data obtained by different sequencing modes for different populations needs to be marked with undetected gene locus information.

And (IV) calculating the difference of different sites in two populations by taking the gene as a unit, and sequencing the sites according to the difference of mutation frequencies of the sites.

And (V) calculating the difference score of each gene in different populations by combining the function information of the mutation and the mutation frequency difference of the mutation site so as to compare the difference degree of the function mutation on each gene among different populations.

Preferably, the method of step (a) for assessing the degree of importance of a mutation site and whether it may affect the function of the gene includes, but is not limited to, the use of: molecular biology experiments, animal models, developed gene association studies or functional prediction software, and the like.

Preferably, the method for detecting genetic information of an individual in step (two) includes, but is not limited to, using a next generation sequencing technique or a genotyping technique such as: mass-Array nucleic acid Mass spectrometry system, SNP chip and the like.

Preferably, the occurrence of undetected gene loci in step (three) refers to the case where there is a difference in coverage of the detection sites in the two populations, and only common mutations that can be detected in the two populations are considered.

Preferably, the gene range differentiation in gene unit in the step (iv) is divided according to the genome of the current version. The reference database includes, but is not limited to: public databases such as NCBI, ensembl, etc. The partitioning of gene intervals may change with version updates.

Preferably, the functional information of the combined mutation and the mutation frequency difference of the mutation site in the step (five) are calculated by the specific method as shown in claim 1.

The invention realizes the formula design of high-flux gene score calculation in different crowds. Corresponding calculation software can be developed by utilizing the formula and is used as a powerful tool for comparing the genetic difference degree in the group genetics. At present, the genetic difference degree of different groups mostly comes from the comparison of mutation frequency difference of single mutation sites, and the simple calculation of the average value of all mutations is not suitable. The invention aims to solve the problem of developing an analysis method which can be used for comparing the population genetic difference degree by taking a gene as a unit and provides a new idea for comparing population genetic characteristics.

Drawings

FIG. 1 is a flow chart of the design method of the present invention in the actual operation process.

Detailed description of the preferred embodiments

The invention particularly relates to a group genetic difference comparison method based on mutation functions, which is based on comparing mutation frequency differences of mutation sites potentially having influence on gene functions among groups and evaluating the differences of the gene functions among the groups according to the mutation frequency differences, and belongs to the technical field of bioinformatics.

Taking the comparison of the difference of the gene ABCA1 between the population A and the population B (or the population C) as an example, the population genetic difference comparison method based on the mutation function comprises the following steps:

and (I) evaluating the importance degree of mutation sites on the ABCA1 gene and whether the importance degree possibly affects the functions of the genes by using prediction software SIFT and PROVEAN to obtain a score of the damage degree of each mutation.

And secondly, acquiring genetic information of exon regions of the crowd B (PoB) and the crowd C (PoC) from the exon database EXAC, and recording the genotype information of the genetic information.

And (III) carrying out statistics on the genotype frequencies of the PoB population and the PoC population. Since exon sequencing rarely covers introns and intergenic regions, and the functional impact of synonymous mutations is difficult to predict by SIFT and PROVEAN software, we analyzed only missense mutation populations in this example, and synonymous mutations were all labeled and not involved in this analysis.

And (IV) distinguishing the functions of missense mutation on the ABCA1 gene according to the SIFT and a boundary value predicted by PROVEAN (SIFT harmfulness judgment boundary value: 0.05, PROVEAN harmfulness judgment boundary value: -2.5).

(V) a certain weight value of 1 is given to the mutation site judged to be functional by SIFT or PROVEAN, and a weight value of 0 is assigned to the mutation site judged not to significantly affect the gene function.

And (VI) detecting the genetic information of the population to be detected and recording the genotype information of the population to be detected. Subsequent studies were not included for individuals with poor typing quality and failed typing. In this example, we replace the mutation frequency of the population to be tested after quality control with the mutation frequency of population a (PoA) in the exogenous database EXAC.

Seventhly, the frequency of a certain mutation site on the ABCA1 gene of the typing result (PoA) of the population A is recorded as F _a The genotype frequency of this site of the population B (PoB) to be compared is denoted as F _b 。

(eight) repeating the calculation of step (seven) in all mutation sites on the ABCA1 gene. If more genes need to be calculated, repeating the step (seven) and the step (eight) in sequence, and calculating F _a -F _b Get G in order from big to small _ij . Wherein i is the number of a certain gene, in this example, the ABCA1 gene is number 1, and j is the number of a certain site on the ABCA1 gene, and in this example, the missense mutation (chromosome number: 9, physical position: 107588033, base C before mutation, base T after mutation) of the ABCA1 gene, which has the greatest difference in mutation rate between the PoA population and the PoB population, is number 1.

(nine) calculating the mutation frequency sum S of all important mutation sites for the point with the weight value of 1 on the ABCA1 gene _1G=1 Value of where j ₁ The number of a mutation site of G =1 in the ABCA1 gene (different from the number j, j) ₁ Numbering only the mutation sites of G = 1), in this case n ₁ The number of mutation sites for all G =1 on ABCA1 gene, i.e. the number of all potentially deleterious mutations is 713:

，

(ten) at the same time, the sum S of the number of all the insignificant mutation sites was calculated for the site with a weight value of 0 on the ABCA1 gene (Gene No. 1) _1G=0 Value, n in this example ₀ The number of mutation sites for all G =0 on ABCA1 gene, i.e. the number of all predicted harmless mutations 607:

，

(eleven) calculation of mutation score S for each mutation site on ABCA1 Gene _1j . Wherein S _1j Wherein 1 is the number of the ABCA1 gene, j is the number of a certain site on the ABCA1 gene, j is the number of the ABCA1 gene ₁ The number of a mutation site of G =1 in the ABCA1 gene (different from the number j, j) ₁ Numbering only the mutation sites of G = 1), q) ₁ The number of the most recent 1G =1 mutation site before the j site on ABCA1 gene (different from the number j,q ₁ numbering only the mutation sites of G = 1), q) ₀ The number of the most recent 1G =0 mutation site before the j site on the ABCA1 gene (different from the number j, q) ₀ Numbering only the mutation sites with G = 0), we chose missense mutations when j =10 (chromosome number: 9, physical location: 107593982, base T before mutation, base C after mutation) are exemplified, in this case, 2 important missense mutations including 10 th missense mutation (j = 10) are important missense mutations, i.e., q ₁ =2; the 10 missense mutations including the 10 th missense mutation (j = 10) were 8 non-significant missense mutations, i.e. q ₀ =8：

，

(twelve) calculation of the score S on the ABCA1 Gene ₁ . Wherein 1 is the number of the ABCA1 gene, j is the number of a certain site on the gene, and n is the number of all mutation sites 1320 on the ABCA1 gene in the example:

，

s1 is a positive value, which indicates that the functional mutation of the gene in PoA population has larger influence on the gene compared with PoB population; a negative value for S1 indicates that the functional mutation of the gene in PoA population has a smaller effect on the gene than in PoB population. In this example, a negative value for S1 indicates that the ABCA1 gene has a smaller degree of functional mutation in the PoA population relative to the PoB population.

(thirteen) if necessary, the results of the population typing (PoA population, mutation site frequency F) _a ) Comparison with the population to be compared 2 (PoC population, mutation site frequency F) _c ) Then F is calculated in step (eight) instead _a -F _c And obtaining a new G _ij And (5) keeping the steps from (nine) to (twelve). Meanwhile, the formula is programmed by a computer, so that the comparison of a large number of gene difference values among different crowds can be realized in batch.

The above examples are only specific embodiments of the present invention, obviously, the present invention is not limited to the above embodiments, and the modifications related to the formula should be protected by the present invention.

Claims

1. A method for comparing difference scores of each gene in different populations is characterized by comprising the following steps:

(1) A weight value of 1 is given to a mutation site judged to be functional, and a weight value of 0 is given to a mutation site judged not to significantly affect the gene function;

(3) Typing of the population A results the frequency F of a certain mutation site on a single gene of the population a _a Making statistics, and recording genotype frequency F of the site of the population b to be compared _b ；

(4) Repeating the statistics of step 3) in all mutation sites of all genes, and calculating F in units of genes _a -F _b The difference value of G is obtained from large to small _ij (ii) a Wherein i is the number of a certain gene, and j is the number of a certain site on the gene, wherein G _i1 Is F _a -F _b Maximum value of (d);

(5) For all G _ij Calculating the mutation frequency sum S of all important mutation sites on the i gene at the point with the medium weight value of 1 _iG=1 Value, where G is the weight value, j ₁ Number of a certain G =1 mutation site on the i gene, j ₁ Distinguished from the number j, j ₁ Numbering only the G =1 mutation sites, n ₁ Number of mutation sites for all G =1 on the i gene:

，

(6) For all G _ij Calculating the sum S of the number of all non-important mutation points on the i gene at the point with the medium weight value of 0 _iG=0 A value, wherein G is a weight value, n ₀ Number of mutation sites for all G =0 on the i gene:

，

(7) Calculating mutation score S of each mutation site on the i gene _ij (ii) a Wherein i is the number of a certain gene, j is the number of a certain site on the gene, j ₁ The number of a certain G =1 mutation site on the i gene is different from the number j, j ₁ Numbering only the G =1 mutation sites, q ₁ The number of the most recent 1G =1 mutation site before the j site on the i gene, q ₁ Distinguished from the numbers j, q ₁ Numbering only the G =1 mutation sites, q ₀ The number of the most recent 1G =0 mutation site before the j site on the i gene, q ₀ Distinguished from the numbers j, q ₀ Only the mutation sites of G =0 were numbered:

，

，

(9) If the population a and the mutation site frequency F need to be compared with the population typing results _a， Mutation site frequency F of the other population c _c Then F is calculated in step 4) instead _a -F _c And find G _ij Step 5) and step 8) are not changed, and simultaneously, the formula is programmed by a computer so as to realize batch productionAnd the calculation of a large number of gene differences among different crowds is realized.

2. A method for comparing genetic differences in a population based on mutation function, said method comprising the steps of:

evaluating the importance degree of a certain mutation site and whether the importance degree of the certain mutation site can influence the function of the gene;

detecting the genetic information of an individual and recording the genotype information of the individual;

carrying out statistics on genotype frequencies of individual samples of different populations; for genetic data obtained by different people in different sequencing modes, marking gene locus information which is not detected;

calculating the difference of different sites in two populations by taking the gene as a unit, and sequencing the sites according to the difference of mutation frequencies of the sites;

and (V) calculating the difference score of each gene in different groups by combining the function information of the mutation and the mutation frequency difference of the mutation site so as to compare the difference degree of the function mutation on each gene among different groups in batches, wherein the calculation method specifically comprises the following steps:

(1) Giving a weight value of 1 to a mutation site judged to be functional, and assigning a weight value of 0 to a mutation site judged not to significantly affect the gene function;

(3) Typing of population results A Individual Gene of population F _a Making statistics, and recording genotype frequency F of the site of the population b to be compared _b ；

，

，

(7) Calculating the mutation score S of each mutation site on the i gene _ij (ii) a Wherein i is the number of a certain gene, j is the number of a certain site on the gene, j ₁ The number of a certain G =1 mutation site on the i gene is different from the number j, j ₁ Numbering only the G =1 mutation sites, q ₁ The number of the most recent 1G =1 mutation site before the j site on the i gene, q ₁ Distinguished from the numbers j, q ₁ Numbering only G =1 mutation sites, q ₀ The number of the most recent 1G =0 mutation site before the j site on the i gene, q ₀ Distinguished from the numbers j, q ₀ Only the mutation sites of G =0 were numbered:

，

，

S _i positive values indicate that the functional mutation of the gene in population a affects the gene to a greater extent than in population b; s. the _i A negative value indicates that the functional mutation of the gene in population a has a smaller effect on the gene than in population b;

(9) If the population a is required to be compared with the population typing result, the mutation site frequency F _a， Mutation site frequency F of the other population c _c Then F is calculated in step 4) instead _a -F _c And find G _ij And 5) the step 5) and the step 8) are not changed, and meanwhile, the formula is programmed by a computer, so that the calculation of a large number of gene difference values among different crowds can be realized in batch.

3. The method of claim 2, wherein the evaluation of the importance of a mutation site and its potential to affect the function of a gene includes but is not limited to the use of: molecular biology experiments, animal models, developed gene association studies or functional prediction software.

4. The method of claim 2, wherein the method for detecting genetic information of an individual includes but is not limited to using a second generation sequencing technology or a genotyping technology, including but not limited to a Mass-Array nucleic acid Mass spectrometry typing system or a SNP chip.

5. The method according to claim 2, wherein the genetic difference of the population based on the mutation function is that the coverage of the detected loci of two populations is different, for example, one population uses second-generation sequencing and has a wider coverage, while the other population uses a Mass-Array nucleic acid Mass spectrometry system and can only detect part of the mutation sequences, so that the mutation frequency of the undetected loci cannot be considered to be 0, and the mutation frequency cannot be directly compared and the markers should be removed.

6. The method as claimed in claim 2, wherein the differentiation of the gene range by gene is divided according to the genome of the current version, and the reference database includes but is not limited to: NCBI, ensembl public database.