CN111192626A

CN111192626A - Construction method, device, server and storage medium of Parkinson disease genomics association model

Info

Publication number: CN111192626A
Application number: CN201911424932.8A
Authority: CN
Inventors: 赵贵虎; 李津臣; 李滨; 唐北沙
Original assignee: Xiangya Hospital of Central South University
Current assignee: Xiangya Hospital of Central South University
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-22
Anticipated expiration: 2039-12-31
Also published as: CN111192626B

Abstract

The application discloses a construction method of a parkinsonism genomics association model. The method comprises the following steps: acquiring first genetic data and second genetic data; annotating the first genetic data to obtain annotated data; scoring the annotated data and the second genetic data according to a preset scoring rule; and (4) dividing the priority level according to the scoring result to construct a Parkinson association model. The application solves the technical problem that the correlation judgment of the Parkinson's disease and the gene cannot be simply and quickly realized.

Description

Construction method, device, server and storage medium of Parkinson disease genomics association model

Technical Field

The application relates to the field of model construction, in particular to a construction method, a device, a server and a storage medium of a Parkinson disease genomics association model.

Background

Parkinson's Disease (PD), also known as parkinsonism, is the second largest neurodegenerative disease after dementia, occurs well in the middle-aged and elderly, increases in incidence with age, and is widely distributed among different ethnic groups worldwide.

Epidemiological investigation shows that the morbidity of PD in developed countries is about 8-18/10 ten thousand per year, the morbidity of general population is about 0.3%, and can reach 1% and 3% respectively in the population over 60 years old and over 80 years old, epidemiological investigation paper published on Lancet in the early century shows that the prevalence of population over 65 years old in China is 1.7%, Parkinson's disease is mainly clinically manifested by four main characteristics of resting tremor, muscular rigidity, bradykinesia and postural gait irregularity, and can also be accompanied by various non-motor symptoms (NMS) such as hyposmia, constipation, depression, sleep disorder and the like, and pathologically is mainly manifested by degeneration and loss of mesogen Dopamine (DA) neurons in the middle brain, besides, accumulation of α -synuclein and formation of Lewy bodies (Lewy bodies) are also one of important manifestations of PD, PD patients progress slowly, PD disease is gradually aggravated with development of PD, is gradually aggravated until the long-term self-organizing delay of PD is gradually delayed, PD is still treated by high-aged PD, PD patients are seriously treated by high-aged PD, PD is proved by high-aged patients with high-aged PD, PD is proved to be treated by high-aged people, and the clinical diagnosis about PD is proved by about 94.

Currently, the pathogenesis of PD is not clear. The research shows that: genetic factors, environmental factors and aging act together to cause the disease. Approximately 10% -15% of PD patients have a family history. Since the first pathogenic gene, PARK1(SNCA), of PD was identified in a parkinson's line in 1997, the role of genetic factors in PD has been highlighted. With the rapid development of high-throughput sequencing technology and biological information analysis methods, 20 PD pathogenic genes have been successfully cloned at present through combination and supplementation of methods such as linkage association analysis, homozygous intersegmental location, whole exome sequencing, whole genome sequencing and the like. 10 genes related to the pathogenesis of autosomal dominant hereditary Parkinson disease (SNCA, UCHL1, LRRK2, HTRA2, GIGYF2, VPS35, EIF4G1, DNAJC13, CHCHHD 2 and TMEM 230); 9 genes related to the pathogenesis of autosomal recessive hereditary Parkinson disease (PRKN, DJ-1, PINK1, ATP13A2, PLA2G6, FBXO7, DNAJC6, SYNJ1 and VPS 13C); there are 1 gene related to the onset of X-linked hereditary Parkinson's disease (RAB 39B). Meanwhile, with the development of multiple Genome-wide association assays (GWAS) research related to PD, the discovery of more than 20 susceptibility genes and loci provides a large amount of data for explaining the heritability of PD from the perspective of population genetics.

According to incomplete statistics, at present, over 3000 SCI documents report rare variation, common variation and copy number variation of PD pathogenic genes found in families or sporadic cases. And with the development of epigenetics, DNA methylation has a significant impact on gene translation and expression. Meanwhile, a plurality of documents report researches for searching pathogenic genes related to the Parkinson's disease through gene differential expression in a crowd queue. However, the inventor finds that the research is lack of a comprehensive arrangement and data analysis platform, and the Parkinson's disease and gene association judgment cannot be simply and rapidly realized.

Aiming at the problem that the correlation judgment of the Parkinson disease and the gene cannot be simply and rapidly realized in the related technology, an effective solution is not provided at present.

Disclosure of Invention

The application mainly aims to provide a construction method, a device, a server and a storage medium of a parkinsonism genomics association model so as to solve the problem that parkinsonism and gene association judgment cannot be simply and quickly realized.

In order to achieve the above objects, according to one aspect of the present application, there is provided a method for constructing a parkinsonism genomics association model.

The construction method of the parkinsonism genomics correlation model comprises the following steps: acquiring first genetic data and second genetic data; annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data; scoring the first annotated data and the second genetic data according to a preset scoring rule; and (4) dividing the priority level according to the scoring result to construct a Parkinson association model.

Further, obtaining the first genetic data and the second genetic data comprises: acquiring literature data in PubMed and gene data submitted by a user; cleaning, denoising and homogeneity processing are carried out on the document data to obtain rare variation data; the genetic data and rare variation data are taken as first genetic data and the genetically related data are taken as second genetic data.

Further, annotating the first genetic data and the second genetic data, the obtaining the first annotated data and the second annotated data comprising: annotating the rare variant data through 23 software cooperation algorithms to obtain first annotated data; and annotating the second genetic data with 63 pieces of software to obtain second annotated data.

Further, scoring the annotated data and the second genetic data according to a preset scoring rule comprises: identifying genomic data information for the first annotated data and the second genetic data; determining a first score for a single occurrence of the species of gene or variation in a single document from the genomic data information in a preset score-data table; counting the occurrence frequency of the gene or the variation of the species in a single document; and inputting the genomics data information, the occurrence times and the first score into a scoring model to obtain the total score of the type of the gene or the variation.

Further, the step of establishing the parkinson association model by dividing the priority levels according to the scoring results comprises: dividing the scoring result into a plurality of score areas from high to low by adopting a dividing algorithm; giving confidence level to each score area according to the scores; and constructing a Parkinson association model according to the confidence level.

Further, the method further comprises the following steps of dividing priority levels according to the scoring results and constructing the Parkinson association model: adopting a PHP-based web framework Lavarel to build an online database and an analysis framework; performing an evaluation analysis on the first genetic data and the second genetic data in the analysis framework in combination with 63 pieces of software or algorithms.

Further, the method further comprises the following steps of dividing priority levels according to the scoring results and constructing the Parkinson association model: receiving a genomics file uploaded by a user; and evaluating the reliability of the genes in the genomics file related to the Parkinson's disease according to a Parkinson association model.

In order to achieve the above object, according to another aspect of the present application, there is provided an apparatus for constructing a parkinsonism genomics related model.

The device for constructing the parkinsonism genomics correlation model comprises the following components: an acquisition module for acquiring first genetic data and second genetic data; the annotation module is used for annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data; a scoring module for scoring the first annotated data and the second genetic data according to a preset scoring rule; and the construction module is used for dividing priority levels according to the scoring results and constructing the Parkinson association model.

To achieve the above object, according to another aspect of the present application, there is provided a server.

The server according to the present application includes: the construction method of the Parkinson disease genomics correlation model.

In order to achieve the above object, according to another aspect of the present application, there is provided a storage medium.

The storage medium according to the present application includes: the construction method of the Parkinson disease genomics correlation model.

In the embodiment of the application, a method for constructing a Parkinson association model is adopted, and first genetic data and second genetic data are obtained; annotating the first genetic data to obtain annotated data; scoring the annotated data and the second genetic data according to a preset scoring rule; dividing priority levels according to the scoring results to construct a Parkinson association model; the purpose of judging the correlation between the Parkinson's disease and the gene through the Parkinson correlation model is achieved, so that the technical effect of simply and quickly realizing the correlation judgment between the Parkinson's disease and the gene is realized, and the technical problem that the correlation judgment between the Parkinson's disease and the gene cannot be simply and quickly realized is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a first embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a second embodiment of the present application;

FIG. 3 is a schematic flow chart of a method for constructing a Parkinson's disease genomics correlation model according to a third embodiment of the present application;

FIG. 4 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a fourth embodiment of the present application;

FIG. 5 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a fifth embodiment of the present application;

FIG. 6 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a sixth embodiment of the present application;

FIG. 7 is a schematic flow chart of a method for constructing a Parkinson's disease genomics correlation model according to a seventh embodiment of the present application;

FIG. 8 is a schematic structural diagram of a device for constructing a Parkinson's disease genomics correlation model according to a first embodiment of the present application;

FIG. 9 is a table diagram according to a preferred embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the invention and its embodiments and are not intended to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meanings of these terms in the present invention can be understood by those skilled in the art as appropriate.

Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meanings of the above terms in the present invention can be understood by those of ordinary skill in the art according to specific situations.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

According to an embodiment of the present invention, there is provided a method for constructing a parkinson' S disease genomics association model, as shown in fig. 1, the method includes steps S100 to S106 as follows:

step S100, acquiring first genetic data and second genetic data;

according to an embodiment of the present invention, preferably, as shown in fig. 2, the acquiring the first genetic data and the second genetic data includes:

s200, acquiring literature data in PubMed and gene data submitted by a user;

step S202, cleaning, denoising and homogeneity processing are carried out on the document data to obtain rare variation data;

and step S204, taking the gene data and the rare variation data as first genetic data, and taking the genetic related data as second genetic data.

PubMed refers to a literature database for recording relevant data information; and establishing connection with PubMed by adopting an interface, so that document data can be acquired from the connection. The gene data is uploaded to the server by a user through an APP or computer software and then stored by the server. The document data contains not many types of data, a large part of the data is not required by the invention, and the data types are incompatible, so that the acquired document data is cleaned, subjected to noise reduction and homogeneity treatment, and acquired rare variant data is acquired, so that the acquired data is more accurate, and the data can be conveniently used in the next step.

In this embodiment, the rare variant data is first genetic data, and these data need to be annotated by matching with software and algorithm; namely, the data which needs to be annotated and then scored is classified as the first genetic data, and the annotation and scoring are waited for.

In this embodiment, the genetic data submitted by the user is the second genetic data, and the data is scored without annotation; can be directly applied to the next grade division.

The second genetic data includes, but is not limited to, rare variations, single nucleotide polymorphisms, copy number variations, differentially expressed genes, DNA methylation genes, and the like.

Step S102, annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;

according to an embodiment of the present invention, preferably, as shown in fig. 3, annotating the first genetic data and the second genetic data, and obtaining the first annotated data and the second annotated data includes:

s300, annotating the rare variant data through a 23-pattern software cooperation algorithm to obtain first annotated data;

and step S302, annotating the second genetic data by using 63 pieces of software to obtain second annotated data.

The 23 software programs are respectively as follows: SIFT, PolyPhon 2 HDIV, PolyPhon 2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, MetaSVM, MetaLR, VEST3, M-CAP, CADD, GERP + +, DANN, FATHMM-MKL, Eigen, GenoCanyon, ftCons, PhylloP, P hastCons, SiPhy, REVEL, and Reve. The software is adopted to annotate the rare variant data with frequencies in different populations, amino acid changes caused in different transcripts, predicted harmfulness of different prediction software, and the like; therefore, when a subsequent model is built or a person directly calls and uses the data, the data after the annotations can be directly used, the research efficiency is improved, and the model building efficiency is also improved.

Similarly, 63 pieces of software were used to annotate Gene data, including basic information on genes (UniProt, NCBI Gene, BioSystem) Gene intolerance to mutation information (RVIS, LoFtool); protein interactions (inbimap); gene differential tissue expression (GTEx); annotation of drug gene interaction information (DGIdb), etc.; therefore, when a subsequent model is built or a person directly calls and uses the data, the data after the annotations can be directly used, the research efficiency is improved, and the model building efficiency is also improved.

Step S104, scoring the annotated data and the second genetic data according to a preset scoring rule;

according to an embodiment of the present invention, preferably, as shown in fig. 4, scoring the first annotated data and the second genetic data according to a preset scoring rule includes:

step S400, identifying genomics data information of the first annotated data and the second genetic data;

step S402, determining a first score of the single occurrence of the type of gene or variation in a single document in a preset score-data table according to the genomics data information;

step S404, counting the occurrence frequency of the gene or the mutation of the species in a single document;

step S406, the genomics data information, the occurrence times and the first score are input into a scoring model, and the total score of the type of the gene or the variation is obtained.

By adopting an identification algorithm, the genetics type and mutation sites in the gene data submitted by the annotated rare variation data user can be identified; such as: the rare variation can be divided into four types of LOF, harmful variation, tolerant missense mutation and other variation, and after the rare variation is identified by the algorithm, the currently identified data can be determined to be the types.

After the identification is finished, the score of a certain mutation or gene of the category can be found out according to the identification result by referring to a score-data chart (part, only for explanation) shown in fig. 9; in one document, it is highly probable that the type of gene or mutation appears many times, and it is obvious that the final total score calculation is also affected, so that statistics of the number of occurrences is performed on a per-unit basis, and finally, the number of occurrences and a mutation site (at which position the mutation occurs) are input into a scoring model, so that the total score of a certain type of mutation or gene at a certain position can be calculated, and the score can reflect the probability of parkinson's disease to some extent.

And S106, dividing the priority level according to the scoring result, and constructing a Parkinson correlation model.

According to the embodiment of the present invention, preferably, as shown in fig. 5, the establishing the parkinson association model by dividing the priority levels according to the scoring results includes:

s500, dividing a scoring result into a plurality of score areas from high to low by adopting a division algorithm;

step S502, giving a confidence level to each score area according to the level of the score;

and step S504, constructing a Parkinson association model according to the confidence level.

The higher the score, the greater the probability of illness; based on the logic, five areas can be divided, namely a high-area, a middle-high area, a middle-low area and a low-area; and then associating the confidence level with the five score areas, wherein the high-confidence areas correspond to the Parkinson's disease, and sequentially decreasing the confidence levels according to the five areas, so that a Parkinson association model is constructed, and personnel can check whether the genes provided by the personnel are associated or not so as to judge the Parkinson's disease probability.

According to the embodiment of the present invention, preferably, as shown in fig. 6, the prioritizing according to the scoring result, and after the parkinson association model is constructed, the method further includes:

s600, building an analysis framework by using a PHP-based web frame Lavarel;

and step S602, performing evaluation analysis on the first genetic data and the second genetic data in the analysis framework by combining 63 pieces of software or algorithms.

A network online analysis framework is built by utilizing the PHP-based web framework Lavarel to realize online analysis of data, and frequency of mutation sites in different races, tolerance of genes, interaction among the genes, interaction among proteins, signal paths and the like are evaluated and analyzed by utilizing more than 60 software or algorithm. Furthermore, by establishing a gene frequency database, pathogenicity analysis software, gene expression and other models of various related data analysis software, more diversified software services are provided for users, the user experience is improved, and the research efficiency is improved.

According to the embodiment of the present invention, preferably, as shown in fig. 7, the prioritizing according to the scoring result, and after the parkinson association model is constructed, the method further includes:

step S700, receiving a genomics file uploaded by a user;

and S702, evaluating the reliability of the genes in the genomics file related to the Parkinson disease according to a Parkinson association model.

Users only need to enter a website through a computer or a mobile phone and upload genomics data files in a VCF4 format, and the reliability of the provided genes related to the Parkinson disease can be automatically evaluated. The purpose of simply and quickly judging the relevance of the Parkinson's disease and the genes is achieved.

From the above description, it can be seen that the present invention achieves the following technical effects:

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

According to an embodiment of the present invention, there is also provided an apparatus for implementing the method for constructing a parkinson's disease genomics association model, as shown in fig. 8, the apparatus including:

an acquisition module 10 for acquiring first genetic data and second genetic data;

according to an embodiment of the present invention, preferably, the acquiring the first genetic data and the second genetic data includes:

acquiring literature data in PubMed and gene data submitted by a user;

cleaning, denoising and homogeneity processing are carried out on the document data to obtain rare variation data;

the genetic data and rare variation data are taken as first genetic data and the genetically related data are taken as second genetic data.

An annotating module 20 for annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;

according to an embodiment of the present invention, preferably, annotating the first genetic data and the second genetic data, and obtaining the first annotated data and the second annotated data includes:

annotating the rare variant data through 23 software cooperation algorithms to obtain first annotated data;

and annotating the second genetic data with 63 pieces of software to obtain second annotated data.

A scoring module 30 for scoring the annotated data and the second genetic data according to a preset scoring rule;

according to an embodiment of the present invention, preferably, scoring the first annotated data and the second genetic data according to a preset scoring rule comprises:

identifying genomic data information for the first annotated data and the second genetic data;

determining a first score for a single occurrence of the species of gene or variation in a single document from the genomic data information in a preset score-data table;

counting the occurrence frequency of the gene or the variation of the species in a single document;

and inputting the genomics data information, the occurrence times and the first score into a scoring model to obtain the total score of the type of the gene or the variation.

And the building module 40 is used for dividing the priority levels according to the scoring results and building the Parkinson correlation model.

According to the embodiment of the present invention, preferably, the establishing the parkinson association model by dividing the priority levels according to the scoring results includes:

dividing the scoring result into a plurality of score areas from high to low by adopting a dividing algorithm;

giving confidence level to each score area according to the scores;

and constructing a Parkinson association model according to the confidence level.

According to the embodiment of the present invention, preferably, the prioritizing according to the scoring result, and after the parkinson association model is constructed, the method further includes:

building an analysis frame by using a PHP-based web frame Lavarel;

performing an evaluation analysis on the first genetic data and the second genetic data in the analysis framework in combination with 63 pieces of software or algorithms.

receiving a genomics file uploaded by a user;

and evaluating the reliability of the genes in the genomics file related to the Parkinson's disease according to a Parkinson association model.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A construction method of a Parkinson disease genomics correlation model is characterized by comprising the following steps:

acquiring first genetic data and second genetic data;

annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;

scoring the first annotated data and the second genetic data according to a preset scoring rule;

and (4) dividing the priority level according to the scoring result to construct a Parkinson association model.

2. The construction method according to claim 1, wherein acquiring the first genetic data and the second genetic data comprises:

acquiring literature data in PubMed and gene data submitted by a user;

3. The construction method according to claim 1, wherein annotating the first genetic data and the second genetic data, resulting in first annotated data and second annotated data comprises:

4. The construction method according to claim 1, wherein scoring the annotated data and the second genetic data according to a preset scoring rule comprises:

5. The construction method of claim 1, wherein the prioritizing according to the scoring results and constructing the parkinson correlation model comprises:

giving confidence level to each score area according to the scores;

6. The construction method according to claim 1, wherein the prioritizing according to the scoring results further comprises, after constructing the parkinson correlation model:

adopting a PHP-based web framework Lavarel to build an online database and an analysis framework;

7. The construction method according to claim 1, wherein the prioritizing according to the scoring results further comprises, after constructing the parkinson correlation model:

receiving a genomics file uploaded by a user;

8. A device for constructing a Parkinson disease genomics correlation model is characterized by comprising the following components:

an acquisition module for acquiring first genetic data and second genetic data;

the annotation module is used for annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;

a scoring module for scoring the first annotated data and the second genetic data according to a preset scoring rule;

and the construction module is used for dividing priority levels according to the scoring results and constructing the Parkinson association model.

9. A server, comprising: the method of constructing a parkinsonism genomics linked model of any one of claims 1 to 7.

10. A storage medium, comprising: the method of constructing a parkinsonism genomics linked model of any one of claims 1 to 7.