CN111192626A - Construction method, device, server and storage medium of Parkinson disease genomics association model - Google Patents

Construction method, device, server and storage medium of Parkinson disease genomics association model Download PDF

Info

Publication number
CN111192626A
CN111192626A CN201911424932.8A CN201911424932A CN111192626A CN 111192626 A CN111192626 A CN 111192626A CN 201911424932 A CN201911424932 A CN 201911424932A CN 111192626 A CN111192626 A CN 111192626A
Authority
CN
China
Prior art keywords
data
parkinson
genetic data
scoring
genetic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911424932.8A
Other languages
Chinese (zh)
Other versions
CN111192626B (en
Inventor
赵贵虎
李津臣
李滨
唐北沙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangya Hospital of Central South University
Original Assignee
Xiangya Hospital of Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangya Hospital of Central South University filed Critical Xiangya Hospital of Central South University
Priority to CN201911424932.8A priority Critical patent/CN111192626B/en
Publication of CN111192626A publication Critical patent/CN111192626A/en
Application granted granted Critical
Publication of CN111192626B publication Critical patent/CN111192626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Physiology (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application discloses a construction method of a parkinsonism genomics association model. The method comprises the following steps: acquiring first genetic data and second genetic data; annotating the first genetic data to obtain annotated data; scoring the annotated data and the second genetic data according to a preset scoring rule; and (4) dividing the priority level according to the scoring result to construct a Parkinson association model. The application solves the technical problem that the correlation judgment of the Parkinson's disease and the gene cannot be simply and quickly realized.

Description

Construction method, device, server and storage medium of Parkinson disease genomics association model
Technical Field
The application relates to the field of model construction, in particular to a construction method, a device, a server and a storage medium of a Parkinson disease genomics association model.
Background
Parkinson's Disease (PD), also known as parkinsonism, is the second largest neurodegenerative disease after dementia, occurs well in the middle-aged and elderly, increases in incidence with age, and is widely distributed among different ethnic groups worldwide.
Epidemiological investigation shows that the morbidity of PD in developed countries is about 8-18/10 ten thousand per year, the morbidity of general population is about 0.3%, and can reach 1% and 3% respectively in the population over 60 years old and over 80 years old, epidemiological investigation paper published on Lancet in the early century shows that the prevalence of population over 65 years old in China is 1.7%, Parkinson's disease is mainly clinically manifested by four main characteristics of resting tremor, muscular rigidity, bradykinesia and postural gait irregularity, and can also be accompanied by various non-motor symptoms (NMS) such as hyposmia, constipation, depression, sleep disorder and the like, and pathologically is mainly manifested by degeneration and loss of mesogen Dopamine (DA) neurons in the middle brain, besides, accumulation of α -synuclein and formation of Lewy bodies (Lewy bodies) are also one of important manifestations of PD, PD patients progress slowly, PD disease is gradually aggravated with development of PD, is gradually aggravated until the long-term self-organizing delay of PD is gradually delayed, PD is still treated by high-aged PD, PD patients are seriously treated by high-aged PD, PD is proved by high-aged patients with high-aged PD, PD is proved to be treated by high-aged people, and the clinical diagnosis about PD is proved by about 94.
Currently, the pathogenesis of PD is not clear. The research shows that: genetic factors, environmental factors and aging act together to cause the disease. Approximately 10% -15% of PD patients have a family history. Since the first pathogenic gene, PARK1(SNCA), of PD was identified in a parkinson's line in 1997, the role of genetic factors in PD has been highlighted. With the rapid development of high-throughput sequencing technology and biological information analysis methods, 20 PD pathogenic genes have been successfully cloned at present through combination and supplementation of methods such as linkage association analysis, homozygous intersegmental location, whole exome sequencing, whole genome sequencing and the like. 10 genes related to the pathogenesis of autosomal dominant hereditary Parkinson disease (SNCA, UCHL1, LRRK2, HTRA2, GIGYF2, VPS35, EIF4G1, DNAJC13, CHCHHD 2 and TMEM 230); 9 genes related to the pathogenesis of autosomal recessive hereditary Parkinson disease (PRKN, DJ-1, PINK1, ATP13A2, PLA2G6, FBXO7, DNAJC6, SYNJ1 and VPS 13C); there are 1 gene related to the onset of X-linked hereditary Parkinson's disease (RAB 39B). Meanwhile, with the development of multiple Genome-wide association assays (GWAS) research related to PD, the discovery of more than 20 susceptibility genes and loci provides a large amount of data for explaining the heritability of PD from the perspective of population genetics.
According to incomplete statistics, at present, over 3000 SCI documents report rare variation, common variation and copy number variation of PD pathogenic genes found in families or sporadic cases. And with the development of epigenetics, DNA methylation has a significant impact on gene translation and expression. Meanwhile, a plurality of documents report researches for searching pathogenic genes related to the Parkinson's disease through gene differential expression in a crowd queue. However, the inventor finds that the research is lack of a comprehensive arrangement and data analysis platform, and the Parkinson's disease and gene association judgment cannot be simply and rapidly realized.
Aiming at the problem that the correlation judgment of the Parkinson disease and the gene cannot be simply and rapidly realized in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The application mainly aims to provide a construction method, a device, a server and a storage medium of a parkinsonism genomics association model so as to solve the problem that parkinsonism and gene association judgment cannot be simply and quickly realized.
In order to achieve the above objects, according to one aspect of the present application, there is provided a method for constructing a parkinsonism genomics association model.
The construction method of the parkinsonism genomics correlation model comprises the following steps: acquiring first genetic data and second genetic data; annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data; scoring the first annotated data and the second genetic data according to a preset scoring rule; and (4) dividing the priority level according to the scoring result to construct a Parkinson association model.
Further, obtaining the first genetic data and the second genetic data comprises: acquiring literature data in PubMed and gene data submitted by a user; cleaning, denoising and homogeneity processing are carried out on the document data to obtain rare variation data; the genetic data and rare variation data are taken as first genetic data and the genetically related data are taken as second genetic data.
Further, annotating the first genetic data and the second genetic data, the obtaining the first annotated data and the second annotated data comprising: annotating the rare variant data through 23 software cooperation algorithms to obtain first annotated data; and annotating the second genetic data with 63 pieces of software to obtain second annotated data.
Further, scoring the annotated data and the second genetic data according to a preset scoring rule comprises: identifying genomic data information for the first annotated data and the second genetic data; determining a first score for a single occurrence of the species of gene or variation in a single document from the genomic data information in a preset score-data table; counting the occurrence frequency of the gene or the variation of the species in a single document; and inputting the genomics data information, the occurrence times and the first score into a scoring model to obtain the total score of the type of the gene or the variation.
Further, the step of establishing the parkinson association model by dividing the priority levels according to the scoring results comprises: dividing the scoring result into a plurality of score areas from high to low by adopting a dividing algorithm; giving confidence level to each score area according to the scores; and constructing a Parkinson association model according to the confidence level.
Further, the method further comprises the following steps of dividing priority levels according to the scoring results and constructing the Parkinson association model: adopting a PHP-based web framework Lavarel to build an online database and an analysis framework; performing an evaluation analysis on the first genetic data and the second genetic data in the analysis framework in combination with 63 pieces of software or algorithms.
Further, the method further comprises the following steps of dividing priority levels according to the scoring results and constructing the Parkinson association model: receiving a genomics file uploaded by a user; and evaluating the reliability of the genes in the genomics file related to the Parkinson's disease according to a Parkinson association model.
In order to achieve the above object, according to another aspect of the present application, there is provided an apparatus for constructing a parkinsonism genomics related model.
The device for constructing the parkinsonism genomics correlation model comprises the following components: an acquisition module for acquiring first genetic data and second genetic data; the annotation module is used for annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data; a scoring module for scoring the first annotated data and the second genetic data according to a preset scoring rule; and the construction module is used for dividing priority levels according to the scoring results and constructing the Parkinson association model.
To achieve the above object, according to another aspect of the present application, there is provided a server.
The server according to the present application includes: the construction method of the Parkinson disease genomics correlation model.
In order to achieve the above object, according to another aspect of the present application, there is provided a storage medium.
The storage medium according to the present application includes: the construction method of the Parkinson disease genomics correlation model.
In the embodiment of the application, a method for constructing a Parkinson association model is adopted, and first genetic data and second genetic data are obtained; annotating the first genetic data to obtain annotated data; scoring the annotated data and the second genetic data according to a preset scoring rule; dividing priority levels according to the scoring results to construct a Parkinson association model; the purpose of judging the correlation between the Parkinson's disease and the gene through the Parkinson correlation model is achieved, so that the technical effect of simply and quickly realizing the correlation judgment between the Parkinson's disease and the gene is realized, and the technical problem that the correlation judgment between the Parkinson's disease and the gene cannot be simply and quickly realized is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a first embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a second embodiment of the present application;
FIG. 3 is a schematic flow chart of a method for constructing a Parkinson's disease genomics correlation model according to a third embodiment of the present application;
FIG. 4 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a fourth embodiment of the present application;
FIG. 5 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a fifth embodiment of the present application;
FIG. 6 is a schematic flow chart of a method for constructing a parkinsonism genomics relevance model according to a sixth embodiment of the present application;
FIG. 7 is a schematic flow chart of a method for constructing a Parkinson's disease genomics correlation model according to a seventh embodiment of the present application;
FIG. 8 is a schematic structural diagram of a device for constructing a Parkinson's disease genomics correlation model according to a first embodiment of the present application;
FIG. 9 is a table diagram according to a preferred embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the invention and its embodiments and are not intended to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.
Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meanings of these terms in the present invention can be understood by those skilled in the art as appropriate.
Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meanings of the above terms in the present invention can be understood by those of ordinary skill in the art according to specific situations.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
According to an embodiment of the present invention, there is provided a method for constructing a parkinson' S disease genomics association model, as shown in fig. 1, the method includes steps S100 to S106 as follows:
step S100, acquiring first genetic data and second genetic data;
according to an embodiment of the present invention, preferably, as shown in fig. 2, the acquiring the first genetic data and the second genetic data includes:
s200, acquiring literature data in PubMed and gene data submitted by a user;
step S202, cleaning, denoising and homogeneity processing are carried out on the document data to obtain rare variation data;
and step S204, taking the gene data and the rare variation data as first genetic data, and taking the genetic related data as second genetic data.
PubMed refers to a literature database for recording relevant data information; and establishing connection with PubMed by adopting an interface, so that document data can be acquired from the connection. The gene data is uploaded to the server by a user through an APP or computer software and then stored by the server. The document data contains not many types of data, a large part of the data is not required by the invention, and the data types are incompatible, so that the acquired document data is cleaned, subjected to noise reduction and homogeneity treatment, and acquired rare variant data is acquired, so that the acquired data is more accurate, and the data can be conveniently used in the next step.
In this embodiment, the rare variant data is first genetic data, and these data need to be annotated by matching with software and algorithm; namely, the data which needs to be annotated and then scored is classified as the first genetic data, and the annotation and scoring are waited for.
In this embodiment, the genetic data submitted by the user is the second genetic data, and the data is scored without annotation; can be directly applied to the next grade division.
The second genetic data includes, but is not limited to, rare variations, single nucleotide polymorphisms, copy number variations, differentially expressed genes, DNA methylation genes, and the like.
Step S102, annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;
according to an embodiment of the present invention, preferably, as shown in fig. 3, annotating the first genetic data and the second genetic data, and obtaining the first annotated data and the second annotated data includes:
s300, annotating the rare variant data through a 23-pattern software cooperation algorithm to obtain first annotated data;
and step S302, annotating the second genetic data by using 63 pieces of software to obtain second annotated data.
The 23 software programs are respectively as follows: SIFT, PolyPhon 2 HDIV, PolyPhon 2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, MetaSVM, MetaLR, VEST3, M-CAP, CADD, GERP + +, DANN, FATHMM-MKL, Eigen, GenoCanyon, ftCons, PhylloP, P hastCons, SiPhy, REVEL, and Reve. The software is adopted to annotate the rare variant data with frequencies in different populations, amino acid changes caused in different transcripts, predicted harmfulness of different prediction software, and the like; therefore, when a subsequent model is built or a person directly calls and uses the data, the data after the annotations can be directly used, the research efficiency is improved, and the model building efficiency is also improved.
Similarly, 63 pieces of software were used to annotate Gene data, including basic information on genes (UniProt, NCBI Gene, BioSystem) Gene intolerance to mutation information (RVIS, LoFtool); protein interactions (inbimap); gene differential tissue expression (GTEx); annotation of drug gene interaction information (DGIdb), etc.; therefore, when a subsequent model is built or a person directly calls and uses the data, the data after the annotations can be directly used, the research efficiency is improved, and the model building efficiency is also improved.
Step S104, scoring the annotated data and the second genetic data according to a preset scoring rule;
according to an embodiment of the present invention, preferably, as shown in fig. 4, scoring the first annotated data and the second genetic data according to a preset scoring rule includes:
step S400, identifying genomics data information of the first annotated data and the second genetic data;
step S402, determining a first score of the single occurrence of the type of gene or variation in a single document in a preset score-data table according to the genomics data information;
step S404, counting the occurrence frequency of the gene or the mutation of the species in a single document;
step S406, the genomics data information, the occurrence times and the first score are input into a scoring model, and the total score of the type of the gene or the variation is obtained.
By adopting an identification algorithm, the genetics type and mutation sites in the gene data submitted by the annotated rare variation data user can be identified; such as: the rare variation can be divided into four types of LOF, harmful variation, tolerant missense mutation and other variation, and after the rare variation is identified by the algorithm, the currently identified data can be determined to be the types.
After the identification is finished, the score of a certain mutation or gene of the category can be found out according to the identification result by referring to a score-data chart (part, only for explanation) shown in fig. 9; in one document, it is highly probable that the type of gene or mutation appears many times, and it is obvious that the final total score calculation is also affected, so that statistics of the number of occurrences is performed on a per-unit basis, and finally, the number of occurrences and a mutation site (at which position the mutation occurs) are input into a scoring model, so that the total score of a certain type of mutation or gene at a certain position can be calculated, and the score can reflect the probability of parkinson's disease to some extent.
And S106, dividing the priority level according to the scoring result, and constructing a Parkinson correlation model.
According to the embodiment of the present invention, preferably, as shown in fig. 5, the establishing the parkinson association model by dividing the priority levels according to the scoring results includes:
s500, dividing a scoring result into a plurality of score areas from high to low by adopting a division algorithm;
step S502, giving a confidence level to each score area according to the level of the score;
and step S504, constructing a Parkinson association model according to the confidence level.
The higher the score, the greater the probability of illness; based on the logic, five areas can be divided, namely a high-area, a middle-high area, a middle-low area and a low-area; and then associating the confidence level with the five score areas, wherein the high-confidence areas correspond to the Parkinson's disease, and sequentially decreasing the confidence levels according to the five areas, so that a Parkinson association model is constructed, and personnel can check whether the genes provided by the personnel are associated or not so as to judge the Parkinson's disease probability.
According to the embodiment of the present invention, preferably, as shown in fig. 6, the prioritizing according to the scoring result, and after the parkinson association model is constructed, the method further includes:
s600, building an analysis framework by using a PHP-based web frame Lavarel;
and step S602, performing evaluation analysis on the first genetic data and the second genetic data in the analysis framework by combining 63 pieces of software or algorithms.
A network online analysis framework is built by utilizing the PHP-based web framework Lavarel to realize online analysis of data, and frequency of mutation sites in different races, tolerance of genes, interaction among the genes, interaction among proteins, signal paths and the like are evaluated and analyzed by utilizing more than 60 software or algorithm. Furthermore, by establishing a gene frequency database, pathogenicity analysis software, gene expression and other models of various related data analysis software, more diversified software services are provided for users, the user experience is improved, and the research efficiency is improved.
According to the embodiment of the present invention, preferably, as shown in fig. 7, the prioritizing according to the scoring result, and after the parkinson association model is constructed, the method further includes:
step S700, receiving a genomics file uploaded by a user;
and S702, evaluating the reliability of the genes in the genomics file related to the Parkinson disease according to a Parkinson association model.
Users only need to enter a website through a computer or a mobile phone and upload genomics data files in a VCF4 format, and the reliability of the provided genes related to the Parkinson disease can be automatically evaluated. The purpose of simply and quickly judging the relevance of the Parkinson's disease and the genes is achieved.
From the above description, it can be seen that the present invention achieves the following technical effects:
in the embodiment of the application, a method for constructing a Parkinson association model is adopted, and first genetic data and second genetic data are obtained; annotating the first genetic data to obtain annotated data; scoring the annotated data and the second genetic data according to a preset scoring rule; dividing priority levels according to the scoring results to construct a Parkinson association model; the purpose of judging the correlation between the Parkinson's disease and the gene through the Parkinson correlation model is achieved, so that the technical effect of simply and quickly realizing the correlation judgment between the Parkinson's disease and the gene is realized, and the technical problem that the correlation judgment between the Parkinson's disease and the gene cannot be simply and quickly realized is solved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to an embodiment of the present invention, there is also provided an apparatus for implementing the method for constructing a parkinson's disease genomics association model, as shown in fig. 8, the apparatus including:
an acquisition module 10 for acquiring first genetic data and second genetic data;
according to an embodiment of the present invention, preferably, the acquiring the first genetic data and the second genetic data includes:
acquiring literature data in PubMed and gene data submitted by a user;
cleaning, denoising and homogeneity processing are carried out on the document data to obtain rare variation data;
the genetic data and rare variation data are taken as first genetic data and the genetically related data are taken as second genetic data.
PubMed refers to a literature database for recording relevant data information; and establishing connection with PubMed by adopting an interface, so that document data can be acquired from the connection. The gene data is uploaded to the server by a user through an APP or computer software and then stored by the server. The document data contains not many types of data, a large part of the data is not required by the invention, and the data types are incompatible, so that the acquired document data is cleaned, subjected to noise reduction and homogeneity treatment, and acquired rare variant data is acquired, so that the acquired data is more accurate, and the data can be conveniently used in the next step.
In this embodiment, the rare variant data is first genetic data, and these data need to be annotated by matching with software and algorithm; namely, the data which needs to be annotated and then scored is classified as the first genetic data, and the annotation and scoring are waited for.
In this embodiment, the genetic data submitted by the user is the second genetic data, and the data is scored without annotation; can be directly applied to the next grade division.
The second genetic data includes, but is not limited to, rare variations, single nucleotide polymorphisms, copy number variations, differentially expressed genes, DNA methylation genes, and the like.
An annotating module 20 for annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;
according to an embodiment of the present invention, preferably, annotating the first genetic data and the second genetic data, and obtaining the first annotated data and the second annotated data includes:
annotating the rare variant data through 23 software cooperation algorithms to obtain first annotated data;
and annotating the second genetic data with 63 pieces of software to obtain second annotated data.
The 23 software programs are respectively as follows: SIFT, PolyPhon 2 HDIV, PolyPhon 2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, MetaSVM, MetaLR, VEST3, M-CAP, CADD, GERP + +, DANN, FATHMM-MKL, Eigen, GenoCanyon, ftCons, PhylloP, P hastCons, SiPhy, REVEL, and Reve. The software is adopted to annotate the rare variant data with frequencies in different populations, amino acid changes caused in different transcripts, predicted harmfulness of different prediction software, and the like; therefore, when a subsequent model is built or a person directly calls and uses the data, the data after the annotations can be directly used, the research efficiency is improved, and the model building efficiency is also improved.
Similarly, 63 pieces of software were used to annotate Gene data, including basic information on genes (UniProt, NCBI Gene, BioSystem) Gene intolerance to mutation information (RVIS, LoFtool); protein interactions (inbimap); gene differential tissue expression (GTEx); annotation of drug gene interaction information (DGIdb), etc.; therefore, when a subsequent model is built or a person directly calls and uses the data, the data after the annotations can be directly used, the research efficiency is improved, and the model building efficiency is also improved.
A scoring module 30 for scoring the annotated data and the second genetic data according to a preset scoring rule;
according to an embodiment of the present invention, preferably, scoring the first annotated data and the second genetic data according to a preset scoring rule comprises:
identifying genomic data information for the first annotated data and the second genetic data;
determining a first score for a single occurrence of the species of gene or variation in a single document from the genomic data information in a preset score-data table;
counting the occurrence frequency of the gene or the variation of the species in a single document;
and inputting the genomics data information, the occurrence times and the first score into a scoring model to obtain the total score of the type of the gene or the variation.
By adopting an identification algorithm, the genetics type and mutation sites in the gene data submitted by the annotated rare variation data user can be identified; such as: the rare variation can be divided into four types of LOF, harmful variation, tolerant missense mutation and other variation, and after the rare variation is identified by the algorithm, the currently identified data can be determined to be the types.
After the identification is finished, the score of a certain mutation or gene of the category can be found out according to the identification result by referring to a score-data chart (part, only for explanation) shown in fig. 9; in one document, it is highly probable that the type of gene or mutation appears many times, and it is obvious that the final total score calculation is also affected, so that statistics of the number of occurrences is performed on a per-unit basis, and finally, the number of occurrences and a mutation site (at which position the mutation occurs) are input into a scoring model, so that the total score of a certain type of mutation or gene at a certain position can be calculated, and the score can reflect the probability of parkinson's disease to some extent.
And the building module 40 is used for dividing the priority levels according to the scoring results and building the Parkinson correlation model.
According to the embodiment of the present invention, preferably, the establishing the parkinson association model by dividing the priority levels according to the scoring results includes:
dividing the scoring result into a plurality of score areas from high to low by adopting a dividing algorithm;
giving confidence level to each score area according to the scores;
and constructing a Parkinson association model according to the confidence level.
The higher the score, the greater the probability of illness; based on the logic, five areas can be divided, namely a high-area, a middle-high area, a middle-low area and a low-area; and then associating the confidence level with the five score areas, wherein the high-confidence areas correspond to the Parkinson's disease, and sequentially decreasing the confidence levels according to the five areas, so that a Parkinson association model is constructed, and personnel can check whether the genes provided by the personnel are associated or not so as to judge the Parkinson's disease probability.
According to the embodiment of the present invention, preferably, the prioritizing according to the scoring result, and after the parkinson association model is constructed, the method further includes:
building an analysis frame by using a PHP-based web frame Lavarel;
performing an evaluation analysis on the first genetic data and the second genetic data in the analysis framework in combination with 63 pieces of software or algorithms.
A network online analysis framework is built by utilizing the PHP-based web framework Lavarel to realize online analysis of data, and frequency of mutation sites in different races, tolerance of genes, interaction among the genes, interaction among proteins, signal paths and the like are evaluated and analyzed by utilizing more than 60 software or algorithm. Furthermore, by establishing a gene frequency database, pathogenicity analysis software, gene expression and other models of various related data analysis software, more diversified software services are provided for users, the user experience is improved, and the research efficiency is improved.
According to the embodiment of the present invention, preferably, the prioritizing according to the scoring result, and after the parkinson association model is constructed, the method further includes:
receiving a genomics file uploaded by a user;
and evaluating the reliability of the genes in the genomics file related to the Parkinson's disease according to a Parkinson association model.
Users only need to enter a website through a computer or a mobile phone and upload genomics data files in a VCF4 format, and the reliability of the provided genes related to the Parkinson disease can be automatically evaluated. The purpose of simply and quickly judging the relevance of the Parkinson's disease and the genes is achieved.
From the above description, it can be seen that the present invention achieves the following technical effects:
in the embodiment of the application, a method for constructing a Parkinson association model is adopted, and first genetic data and second genetic data are obtained; annotating the first genetic data to obtain annotated data; scoring the annotated data and the second genetic data according to a preset scoring rule; dividing priority levels according to the scoring results to construct a Parkinson association model; the purpose of judging the correlation between the Parkinson's disease and the gene through the Parkinson correlation model is achieved, so that the technical effect of simply and quickly realizing the correlation judgment between the Parkinson's disease and the gene is realized, and the technical problem that the correlation judgment between the Parkinson's disease and the gene cannot be simply and quickly realized is solved.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A construction method of a Parkinson disease genomics correlation model is characterized by comprising the following steps:
acquiring first genetic data and second genetic data;
annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;
scoring the first annotated data and the second genetic data according to a preset scoring rule;
and (4) dividing the priority level according to the scoring result to construct a Parkinson association model.
2. The construction method according to claim 1, wherein acquiring the first genetic data and the second genetic data comprises:
acquiring literature data in PubMed and gene data submitted by a user;
cleaning, denoising and homogeneity processing are carried out on the document data to obtain rare variation data;
the genetic data and rare variation data are taken as first genetic data and the genetically related data are taken as second genetic data.
3. The construction method according to claim 1, wherein annotating the first genetic data and the second genetic data, resulting in first annotated data and second annotated data comprises:
annotating the rare variant data through 23 software cooperation algorithms to obtain first annotated data;
and annotating the second genetic data with 63 pieces of software to obtain second annotated data.
4. The construction method according to claim 1, wherein scoring the annotated data and the second genetic data according to a preset scoring rule comprises:
identifying genomic data information for the first annotated data and the second genetic data;
determining a first score for a single occurrence of the species of gene or variation in a single document from the genomic data information in a preset score-data table;
counting the occurrence frequency of the gene or the variation of the species in a single document;
and inputting the genomics data information, the occurrence times and the first score into a scoring model to obtain the total score of the type of the gene or the variation.
5. The construction method of claim 1, wherein the prioritizing according to the scoring results and constructing the parkinson correlation model comprises:
dividing the scoring result into a plurality of score areas from high to low by adopting a dividing algorithm;
giving confidence level to each score area according to the scores;
and constructing a Parkinson association model according to the confidence level.
6. The construction method according to claim 1, wherein the prioritizing according to the scoring results further comprises, after constructing the parkinson correlation model:
adopting a PHP-based web framework Lavarel to build an online database and an analysis framework;
performing an evaluation analysis on the first genetic data and the second genetic data in the analysis framework in combination with 63 pieces of software or algorithms.
7. The construction method according to claim 1, wherein the prioritizing according to the scoring results further comprises, after constructing the parkinson correlation model:
receiving a genomics file uploaded by a user;
and evaluating the reliability of the genes in the genomics file related to the Parkinson's disease according to a Parkinson association model.
8. A device for constructing a Parkinson disease genomics correlation model is characterized by comprising the following components:
an acquisition module for acquiring first genetic data and second genetic data;
the annotation module is used for annotating the first genetic data and the second genetic data to obtain first annotated data and second annotated data;
a scoring module for scoring the first annotated data and the second genetic data according to a preset scoring rule;
and the construction module is used for dividing priority levels according to the scoring results and constructing the Parkinson association model.
9. A server, comprising: the method of constructing a parkinsonism genomics linked model of any one of claims 1 to 7.
10. A storage medium, comprising: the method of constructing a parkinsonism genomics linked model of any one of claims 1 to 7.
CN201911424932.8A 2019-12-31 2019-12-31 Construction method, device, server and storage medium of Parkinson disease genomics association model Active CN111192626B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911424932.8A CN111192626B (en) 2019-12-31 2019-12-31 Construction method, device, server and storage medium of Parkinson disease genomics association model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911424932.8A CN111192626B (en) 2019-12-31 2019-12-31 Construction method, device, server and storage medium of Parkinson disease genomics association model

Publications (2)

Publication Number Publication Date
CN111192626A true CN111192626A (en) 2020-05-22
CN111192626B CN111192626B (en) 2021-05-28

Family

ID=70710662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911424932.8A Active CN111192626B (en) 2019-12-31 2019-12-31 Construction method, device, server and storage medium of Parkinson disease genomics association model

Country Status (1)

Country Link
CN (1) CN111192626B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114496080A (en) * 2022-01-17 2022-05-13 中国人民解放军总医院第一医学中心 Deafness pathogenicity gene screening method and device, storage medium and server
CN114496072A (en) * 2022-01-17 2022-05-13 北京安琪尔基因医学科技有限公司 Deafness pathogenic analysis grade classification method and device, computer readable storage medium and server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169310A (en) * 2017-03-20 2017-09-15 上海基银生物科技有限公司 A kind of genetic test construction of knowledge base method and system
CN109686439A (en) * 2018-12-04 2019-04-26 东莞博奥木华基因科技有限公司 Data analysing method, system and the storage medium of hereditary disease genetic test
CN110570905A (en) * 2019-07-22 2019-12-13 中国人民解放军总医院 method and device for constructing omics data analysis platform and computer equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169310A (en) * 2017-03-20 2017-09-15 上海基银生物科技有限公司 A kind of genetic test construction of knowledge base method and system
CN109686439A (en) * 2018-12-04 2019-04-26 东莞博奥木华基因科技有限公司 Data analysing method, system and the storage medium of hereditary disease genetic test
CN110570905A (en) * 2019-07-22 2019-12-13 中国人民解放军总医院 method and device for constructing omics data analysis platform and computer equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114496080A (en) * 2022-01-17 2022-05-13 中国人民解放军总医院第一医学中心 Deafness pathogenicity gene screening method and device, storage medium and server
CN114496072A (en) * 2022-01-17 2022-05-13 北京安琪尔基因医学科技有限公司 Deafness pathogenic analysis grade classification method and device, computer readable storage medium and server

Also Published As

Publication number Publication date
CN111192626B (en) 2021-05-28

Similar Documents

Publication Publication Date Title
Azaiez et al. Genomic landscape and mutational signatures of deafness-associated genes
Escott‐Price et al. Polygenic risk score analysis of pathologically confirmed Alzheimer disease
Luo et al. Exploring the genetic architecture of inflammatory bowel disease by whole-genome sequencing identifies association at ADCY7
Smith et al. The C9ORF72 expansion mutation is a common cause of ALS+/− FTD in Europe and has a single founder
Stukenbrock et al. Fusion of two divergent fungal individuals led to the recent emergence of a unique widespread pathogen species
Stavropoulos et al. Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine
Bekheirnia et al. Whole-exome sequencing in the molecular diagnosis of individuals with congenital anomalies of the kidney and urinary tract and identification of a new causative gene
Teo et al. Methodological challenges of genome-wide association analysis in Africa
Selmecki et al. Polyploidy can drive rapid adaptation in yeast
Shearer et al. Utilizing ethnic-specific differences in minor allele frequency to recategorize reported pathogenic deafness variants
Bandres‐Ciga et al. The genetic architecture of Parkinson disease in Spain: characterizing population‐specific risk, differential haplotype structures, and providing etiologic insight
Ehrhart et al. Current developments in the genetics of Rett and Rett-like syndrome
Peloso et al. Phenotypic extremes in rare variant study designs
Choi et al. Genomic diversity in Onchocerca volvulus and its Wolbachia endosymbiont
Huyghe et al. Genome-wide SNP-based linkage scan identifies a locus on 8q24 for an age-related hearing impairment trait
Zhang et al. Chromosome-level genome assembly of golden pompano (Trachinotus ovatus) in the family Carangidae
Sambarey et al. Meta-analysis of host response networks identifies a common core in tuberculosis
Natri et al. Genome-wide DNA methylation and gene expression patterns reflect genetic ancestry and environmental differences across the Indonesian archipelago
Liu et al. Identification of candidate genes for familial early-onset essential tremor
WO2013026411A1 (en) Single cell classification method, gene screening method and device thereof
Kreiner-Møller et al. Improving accuracy of rare variant imputation with a two-step imputation approach
CN111192626B (en) Construction method, device, server and storage medium of Parkinson disease genomics association model
CN111139291A (en) High-throughput sequencing analysis method for monogenic hereditary diseases
CN111883210B (en) Single-gene disease name recommendation method and system based on clinical features and sequence variation
da Silva Montenegro et al. Meta‐analyses support previous and novel autism candidate genes: Outcomes of an unexplored Brazilian cohort

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Li Bin

Inventor after: Li Jinchen

Inventor after: Zhao Guihu

Inventor after: Tang Beisha

Inventor before: Zhao Guihu

Inventor before: Li Jinchen

Inventor before: Li Bin

Inventor before: Tang Beisha

GR01 Patent grant
GR01 Patent grant