CN111128308B - New mutation information knowledge platform for neuropsychiatric diseases - Google Patents

New mutation information knowledge platform for neuropsychiatric diseases Download PDF

Info

Publication number
CN111128308B
CN111128308B CN201911365589.4A CN201911365589A CN111128308B CN 111128308 B CN111128308 B CN 111128308B CN 201911365589 A CN201911365589 A CN 201911365589A CN 111128308 B CN111128308 B CN 111128308B
Authority
CN
China
Prior art keywords
data
module
mutation
platform
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911365589.4A
Other languages
Chinese (zh)
Other versions
CN111128308A (en
Inventor
林关宁
王晗
王卫娣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mental Health Center Shanghai Psychological Counselling Training Center
Original Assignee
Shanghai Mental Health Center Shanghai Psychological Counselling Training Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mental Health Center Shanghai Psychological Counselling Training Center filed Critical Shanghai Mental Health Center Shanghai Psychological Counselling Training Center
Priority to CN201911365589.4A priority Critical patent/CN111128308B/en
Publication of CN111128308A publication Critical patent/CN111128308A/en
Application granted granted Critical
Publication of CN111128308B publication Critical patent/CN111128308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Acyclic And Carbocyclic Compounds In Medicinal Compositions (AREA)

Abstract

The invention discloses a new mutation information knowledge platform for neuropsychiatric diseases, which comprises a data acquisition and processing layer, a multi-dimensional data storage layer, a multi-dimensional data integration processing layer and a data visualization and access layer, wherein the data acquisition and processing layer is in communication connection with the data acquisition and processing layer; the data acquisition and processing layer comprises a multi-dimensional data acquisition module, a data analysis module and a data storage module; the multidimensional data storage layer is a big data platform controlled by a high-performance non-SQL database management system; the multidimensional data integration processing layer comprises a redundancy removal processing module, a characteristic analysis module and a classification management module; and the data visualization and access layer is used for displaying and inquiring data in a real-time mapping or drawing mode through a WEB interface. According to the method and the system, through the construction of a new mutation information knowledge platform of the neuropsychiatric disease and the big data platform, the history and the latest data are recorded and called out, the data can be mapped or mapped and displayed in real time for a user researching the neuropsychiatric disease, and the visualization and the efficiency of scientific research are improved.

Description

New mutation information knowledge platform for neuropsychiatric diseases
Technical Field
The invention relates to a mutation information processing technology of mental diseases, in particular to a new mutation information knowledge platform of neuropsychiatric diseases.
Background
In addition to inheriting half of each parental genome, everyone has a naturally small set of new genetic changes that occur during gametogenesis, called new onset variations (DNVs). These variations have been identified in parental studies from parent to offspring, ranging in size from single nucleotide variations to small insertions and deletions (indels) as new mutations (DNMs), and larger structural variations as new Copy Number Variations (CNVs), have been implicated in various human diseases.
Over the past few years, a large number of DNVs have been discovered by whole exome sequencing and whole genome sequencing, and explored and analyzed at the gene level, with great success in assessing their contribution to complex diseases. However, it is estimated that up to 95% of the genes are subject to Alternative Splicing (AS) to produce various transcripts to increase human transcriptome and proteome diversity, with approximately 4 to 7 transcripts per gene. Transcripts are highly specific, and their expression is usually restricted to certain organs, tissues, and even cell types within the same tissue. Notably, it occurs at high frequency in brain tissue and regulates biological processes occurring during neural development, including cell fate determination, neuronal migration, axonal guidance, and synaptogenesis. At present, no biological data knowledge base provides the exploration, and the main defects of the biological data knowledge base are as follows:
1. because exons are used differently in transcripts of the same gene, disease mutations may only selectively affect transcripts with exons carrying the mutation. Furthermore, if certain transcripts are not expressed in a particular developmental stage or a particular tissue, disease mutations affecting these transcripts may not exhibit their functional effects in that stage or tissue. However, no database knowledge platform has been known to correlate tissue-specific transcripts with disease mutations;
2. since the brain is one of the most abundant tissues for AS events, it is necessary to study mutations and brain-specific expression associated with brain diseases at the isomeric level of transcripts. However, the association between transcripts and DNMs in developmental and neuropsychiatric diseases, such as Autism (ASD), schizophrenia (SCZ), early onset Alzheimer's Disease (AD) and Congenital Heart Disease (CHD), rarely occurs on a large scale due to sample tissue specificity.
Therefore, for the study of neurological diseases, an information platform capable of efficient, fast and one-stop data query and data feature extraction is urgently needed to improve data support and fast and efficient relationship study.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a new mutation information knowledge platform for neuropsychiatric diseases, which can solve the related problems.
The purpose of the invention is realized by adopting the following technical scheme:
a new mutation information knowledge platform for neuropsychiatric diseases is characterized in that: the information knowledge platform comprises a data acquisition and processing layer, a multi-dimensional data storage layer, a multi-dimensional data integration processing layer and a data visualization and access layer which are in communication connection; the data acquisition and processing layer comprises a multi-dimensional data acquisition module, a data analysis module and a data storage module; the multi-dimensional data acquisition module acquires new mutation data information of historical neuropsychiatric diseases, the data analysis module analyzes the acquired new mutation data of the historical neuropsychiatric diseases by using a sample identifier, reference and alternative allele chromosome positions and verification state information, and stores analysis results through the data storage module; the multidimensional data storage layer is a big data platform controlled by a high-performance non-SQL database management system, the big data platform receives new mutation data information analyzed and stored by the data acquisition and processing layer, and new mutation research documents and data of the neuropsychiatric diseases are acquired in real time through manual acquisition and networking; the multidimensional data integration processing layer comprises a redundancy removal processing module, a feature analysis module and a classification management module; the redundancy removal processing module adopts a built-in script written by python language, and realizes the functions of carrying out duplicate removal processing and standardized processing on mutation, gene and expression data; the characteristic analysis module collects the biological data types and carries out classified characteristic processing; the classification management module is written by using a python language, and processes, stores and calls the original data and the intermediate data result processed by the redundancy removing module and the characteristic analysis module; and the data visualization and access layer displays query data for real-time mapping or drawing display in a WEB interface mode.
Preferably, in the multidimensional data acquisition module, the new mutations are classified into a mutant DNM comprising a new site mutation and a small indel and a new copy number variation CNV, wherein the CNV comprises a deletion or duplication of the copy number of the DNA region.
Preferably, the high-performance non-SQL data processing system is MongoDB, so that the big data platform has the functions of real-time updating, data integration and module expansion.
Preferably, the built-in script algorithm flow includes: (1) carrying out standardization processing on the data; (2) and performing duplication elimination and data compression according to the unique identifier of the data in the data source and the corresponding key value.
Preferably, the categorical feature processing is classified as (1) scoring DNM; (2) selecting a regulatory element and constructing a mutation map; (3) constructing the protein interaction network where the mutation exists.
Preferably, the algorithm idea of the classification management module includes: (1) preprocessing original data and an intermediate data result subjected to characteristic analysis according to the data type, and finally integrating according to a gene unique identifier (Entrez ID) in the data to generate a dictionary taking the gene unique identifier (Entrez ID) as a key value; (2) calling a PyMongo module in python to control the Mongo DB, and storing the dictionary generated in the last step into corresponding aggregation; (3) and calling a PyMongo module in python to control the Mongo DB for data reading.
Compared with the prior art, the invention has the beneficial effects that: through the construction of a new mutation information knowledge platform of the neuropsychiatric disease and a big data platform, the knowledge platform covers the input and the call of history and latest data, genetic and expression information with the new mutation as a center is obtained, query data can be mapped or mapped and displayed for a user for researching the neuropsychiatric disease in real time, and the visualization and the efficiency of research are improved.
Drawings
FIG. 1 is a flow chart of a model framework of a new mutation information knowledge platform for neuropsychiatric diseases according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Referring to fig. 1, the information knowledge platform (PsyMuKB) for new onset mutation of neuropsychiatric diseases comprises a data acquisition and processing layer, a multidimensional data storage layer, a multidimensional data integration processing layer and a data visualization and access layer which are in communication connection.
Data acquisition and processing layer
The data acquisition and processing layer is responsible for acquiring original data, downloads, analyzes and stores the data according to corresponding configuration files of a data source, and comprises a multidimensional data acquisition module, a data analysis module and a data storage module.
The multi-dimensional data acquisition module acquires new sudden change data information of the historical neuropsychiatric diseases and automatically downloads data according to a data source needing downloading, wherein the data source comprises FTP, HTTP and other specific downloading implementation mechanisms.
The data analysis module analyzes the collected historical new mutation data by sample identifiers, reference and alternative allele chromosome positions and verification state information, and stores the analysis result through the data storage module.
Further, the data analysis module automatically allocates corresponding analyzers to perform data analysis according to downloaded data file information, analysis results are sent to the data storage module in a uniform data transmission format, and the data storage module stores the analysis results according to the storage design of the PsyMuKB website.
Further, in the multidimensional data acquisition module, the new mutations are divided into two types, namely, a mutation DNM including a new site mutation and a small insertion deletion and a new copy number variation CNV, wherein the CNV includes deletion or duplication of the copy number of the DNA region.
In one example, the coordinates of all variants of DNM and CNV are displayed in GRCh37 (human reference genome hg 19) version in the neuropsychiatric disease new mutation information knowledge platform (PsyMuKB) to ensure consistency of annotation.
Multidimensional dataStorage layer
The multidimensional data storage layer is a big data platform controlled by a high-performance non-SQL database management system, the big data platform receives new mutation data information analyzed and stored by the data acquisition and processing layer, and new mutation research documents and data of the neuropsychiatric diseases are acquired in real time through manual acquisition and networking.
Furthermore, the high-performance non-SQL data processing system is MongoDB, so that the big data platform has the functions of real-time updating, data integration and module expansion.
Multidimensional data integration processing layer
The multidimensional data integration processing layer comprises a redundancy removal processing module, a feature analysis module and a classification management module.
The redundancy removing processing module adopts a built-in script written by python language, and realizes the functions of removing the duplication and standardizing the mutation, gene and expression data. Further, the built-in script algorithm flow includes: (1) carrying out standardization processing on the data; (2) and performing duplication elimination and data compression according to the unique identification of the data in the data source and the corresponding key value.
The characteristic analysis module collects biological data types and carries out classification characteristic processing. The classification characteristic processing is divided into (1) DNM scoring evaluation; (2) selecting a regulatory element and constructing a mutation map; (3) constructing the protein interaction network where the mutation exists.
The classification management module is written by using a python language, and processes, stores and calls the original data and the intermediate data result processed by the redundancy removing module and the feature analysis module.
Further, the algorithm idea of the classification management module includes: (1) preprocessing original data and an intermediate data result subjected to characteristic analysis according to the data type, and finally integrating according to a gene unique identifier (Entrez ID) in the data to generate a dictionary taking the gene unique identifier (Entrez ID) as a key value; (2) calling a PyMongo module in python to control the Mongo DB, and storing the dictionary generated in the last step into corresponding aggregation; (3) and calling a PyMongo module in python to control the Mongo DB for data reading.
In particular examples, most DNM studies published and contained in the neuropsychiatric disease new mutation information knowledge platform (PsyMuKB) used massively parallel sequencing methods, primarily using WES or WGS, combined with large sample volumes (hundreds to thousands of samples). These were collected primarily from the pedigree, and by comparing DNA sequences obtained from sick children with DNA sequences obtained from parents, false positive DNMs could be filtered out. During the data collection and management process, it was ensured that all DNM data contained in the new mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) came from discovery methods with reasonable quality parameters, such as those used by Werling et al in the 2018 study. Next, all DNMs collected were batch processed using the anovar annotation platform for systematic annotation, including annotations such as variant function (exon, intron, intergenic, UTR, etc.), exon variant function (non-synonymous, synonym, etc.), amino acid changes, frequency in 1000 genomic and ExAC databases, and functional prediction by SIFT, polyphen2, GERP + + and CADD. Since many of the available functional annotations for a mutation focus on non-coding regions, the DeepSea score is included in the mutation annotation table to help the user assess the impact of the mutation at non-coding locations. In addition, for each gene, we included a haploinsufficiency score to assess the likelihood that the gene exhibited a haploinsufficiency, and a loss of function (LoF) intolerance (pLI) score to assess the likelihood of the occurrence of a haploinsufficiency.
Selection of regulatory elements and construction of mutation maps. Since more than 90% of all reported DNMs are located in non-coding regions of the genome and, unlike coding regions, there is no clear hypothesis to determine which non-coding regions would cause rare variations in disease in humans, nor is it understood which particular alleles are not tolerant to mutations in those non-coding regions. To facilitate the use of these variations and to better explore the potential impact of the untranslated genomic regions known for mutation, the neuropsychiatric disease new mutation information knowledge platform (PsyMuKB) also provides regulatory element annotations to help understand whether noncoding mutations are located on regulatory elements and thus may affect downstream genes/isoforms. GeneHancer defines 250,733 gene enhancer regions, and FANTOM5 defines 82,149 promoters. We have mapped DNMs localized in non-coding regions of the genome to all regulatory regions and listed them as part of the mutation annotation.
PsyMuKB extracts PPI data from BioGRID to construct a comprehensive map of human interacting proteins. After removing the non-physical interactions defined in BioGRID, psynukb obtained 409,173 personal PPIs for annotation integration, thereby enabling the user to explore potential functional pathways involving the affected proteins.
Data visualization and access layer
The data visualization and access layer maps or draws and displays query data to a user in real time in a WEB interface mode, and mainly comprises a visualization processing module, a data deployment module and a data access module. The Web interface and data visualization of a new neuropsychiatric disease mutation information knowledge platform (PsyMuKB) are mainly realized in a Python script based on HTML5, a Cascading Style Sheet (CSS) and JavaScript (JS). Expression data visualization and adjustment element mapping are achieved using Plotly. The visualization of the interaction network is implemented using cytoscape. A3D schematic diagram of mutation sites in a protein structure can be provided by pointing to a corresponding visual link provided by a muPIT Interactive Web server (http:// muPIT. Icm. Jhu. Edu/muPIT _ Interactive /).
In a specific example, all forms of metadata in the new neuropsychiatric disease mutation information knowledge platform (PsyMuKB) are stored in the montogdb database, and the new neuropsychiatric disease mutation information knowledge platform (PsyMuKB) first investigates all published studies of human DNVs that have been identified on a genome-wide basis, and then obtains basic information for each DNV, including sample identifiers, chromosomal locations of reference and surrogate alleles, validation status, and the like. For DNM and new CNV, the coordinates of all variants are shown to be displayed in GRCh37 (human reference genome hg 19). If the source variant coordinates were not originally provided in GRCh37, the coordinates are transformed using "LiftOver" of UCSC genome browser (http:// genome. UCSC. Edu/cgi-bin/hgLiftOver); and real-time mapping and mapping of graphical representations, such as expression profiles, to mutations in transcripts and PPI networks, when relevant data is queried.
In summary, the neuropsychiatric disease new mutation information knowledge platform (PsyMuKB) searches and reviews genes for each mutated genomic position, transcriptional signature and genomic structure of the transcript by gene ID, gene symbol or genomic coordinates, and provides detailed genetic information including description and abstract, exon and intron structures of the transcript, gene or protein expression and protein interactions in various tissues. Thus, psyMuKB is a comprehensive resource to explore disease risk factors through transcriptional and translational information and related visualization.
The new mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) has collected DNM variations from various studies based on four major clinical phenotypes: psychiatric disorders, neurological disorders, congenital defect disorders and control studies.
Of the 8 major progressive psychiatric diseases, most (93.7%) DNMs were from ASD studies (n =312,167), followed by Developmental Delay (DD) (n = 8513), SCZ (n = 3610) and intellectual Impairment (ID) (n = 2585). In neurological diseases, most DNMs are from Epileptic Encephalopathy (EE) (n = 564), as well as Developmental and Epileptic Encephalopathy (DEE) (n = 508). In the congenital defect disease, most DNMs are from coronary heart disease (97%, n =1,884). For DNMs, half of the variation was located in the intergenic region (n =442,200), while the mutations affecting the exon regions accounted for only about 4.3% (n =28,259), while the mutations located in the UTR, intron, or upstream or downstream regions of the transcript accounted for 38.7%, while the remaining 6.6% of the DNMs were located in the non-coding RNA. PsyMuKB screened 841 newly initiated CNVs from reported genome-scale studies, covering 8 different clinical phenotypes, affecting 369 non-overlapping genomic regions ranging from 1Kb to 600Mb.
Finally, the console, i.e., the data visualization and access layer, provides the selection mode, process results display in the form of an interactive interface, allowing flexible filtering and exploration of transcripts affected by user-specified selections for mutations and/or brain expression.
Features and advantages of the invention
The new neuropsychiatric disease mutation information knowledge platform (PsyMuKB) contains a database and a Web interface, and a set of network interfaces that support the options of searching, filtering, visualizing, and sharing query data.
The retrieval and visualization of gene level information in the new onset mutation information knowledge platform (PsyMuKB) of neuropsychiatric diseases is achieved in three different ways: "Gene ID" or "gene symbol", "chromosomal region" and "mutation". "Gene ID" or "gene symbol" searches are provided in both basic and advanced searches. This functionality is provided in both "basic region" and "advanced" searches when the user is interested in retrieving all genes and variations that are located within a particular region. In addition, the new mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) allows users to "browse" the genes in the tab in alphabetical order of their formal gene symbols. The "browse" tab also allows the user to browse through different sets of genes associated with neurodevelopmental mental disease. After selection of a gene, the results are displayed in the same manner as they would be displayed by the "search" option.
When a user makes a genetic query, the new mutation information for neuropsychiatric disease knowledge platform (PsyMuKB) will have the user a page with a table showing all genes with completely and partially matching ID or gene symbols. The table provides two clickable links: "Gene information" and "mutation information". The first is linked to a gene information page, which contains five different subsections: (1) "genetic information" which includes detailed information about the functional description; (2) "expression" includes gene and protein expression in different tissues; (3) "New mutation" outlines the available DNV of the gene under query; (4) "transcripts" providing genomic structural information of all transcripts of the gene in question; (5) "protein-protein interactions" list all physical interactions involved in the gene under query. In the section "genetic information", the knowledge platform for new mutations in neuropsychiatric diseases (PsyMuKB) also provides an "assessment table" which includes some genetic features related to brain or disease, such as pLI scores, single underdose score ratings, expression in brain tissue, etc.
The new mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) can access DNV by two different approaches: (1) A statistical table of "new mutations" through the Gene information page, which lists all the reported variant genes, after searching by "Gene ID" or "Gene Symbol"; (2) Results were narrowed by specifying chromosomal regions, types of variation or clinical phenotypes in advanced searches. The variations are grouped by the gene with which they are associated. Thus, if a user queries for a gene, all relevant variants will be displayed together in two tables: coding mutations and non-coding mutations. The mutation table contains information about mutations, such as location, mutation type, case or control, disease phenotype, mutation sites in protein structure, validation status, frequency of major population database (1000 genomes, exAC, gnomAD). Importantly, the new mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) provides an annotation for the assessment of "potential severity level", where three severity levels are defined: 1) High severity: the coding variation was a LoF mutation, or predicted to be deleterious by at least three of the five widely used pathogenicity prediction tools (SIFT, polyphen2, GERP + +, CADD and ClinVar). (ii) a 2) Medium: predicted to be harmful by one or two of five prediction tools; 3) Low severity: all other coding variations.
The neuropsychiatric disease new mutation information knowledge platform (PsyMuKB) also provides basic genomic information for annotated regulatory elements (such as promoters and enhancers) by visualizing the location of genes on mRNA transcripts. Furthermore, all reported DNMs are localized and visualized on the exon-intron structure of mRNA transcripts and their regulatory elements. In addition, the new mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) utilizes alternative splice isoforms with tissue-specific expression information.
The knowledge platform for new mutations in neuropsychiatric diseases (PsyMuKB) also provides the interaction map of the proteins queried. The interactive network is constructed by using primary and secondary interactions and is visualized by using cytoscape. Primary interactions are defined as the interactions between all proteins and the protein of interest. Secondary interactions are defined as all interactions between interacting proteins of the protein in question. The line thickness of the interaction represents the number of supporting proofs the interaction has. The new mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) defines evidence as a single reported evidence or a single supported experiment. If the number of PPI network nodes of the queried protein exceeds 200, the network will only show the interaction of at least two evidence items. In addition to visualization, we also provide a PPI table that lists all interaction information, including experimental detection methods, literature sources of reports, and total evidence counts.
The new onset mutation information knowledge platform for neuropsychiatric diseases (PsyMuKB) annotated the DNM, transcripts expressed in the brain and identified them as "brain expression" mutations, and identified "non-brain expression" mutations. Although DNMs may occur anywhere in the genome, in the study of human diseases, exome or protein coding regions of the genome are typically first studied.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (6)

1. A new mutation information knowledge platform for neuropsychiatric diseases is characterized in that: the information knowledge platform comprises a data acquisition and processing layer, a multi-dimensional data storage layer, a multi-dimensional data integration processing layer and a data visualization and access layer which are in communication connection;
the data acquisition and processing layer comprises a multi-dimensional data acquisition module, a data analysis module and a data storage module; the multi-dimensional data acquisition module acquires new mutation data information of historical neuropsychiatric diseases, the data analysis module analyzes the acquired new mutation data of the historical neuropsychiatric diseases by using sample identifiers, reference and alternative allele chromosome positions and verification state information, and the analysis result is stored by the data storage module;
the multidimensional data storage layer is a big data platform controlled by a high-performance non-SQL database management system, the big data platform receives new mutation data information analyzed and stored by the data acquisition and processing layer, and new mutation research documents and data of the neuropsychiatric diseases are acquired in real time through manual acquisition and networking;
the multidimensional data integration processing layer comprises a redundancy removal processing module, a feature analysis module and a classification management module; the redundancy removing processing module adopts a built-in script written by python language, so that the functions of removing the duplication and standardizing processing of mutation, gene and expression data are realized; the characteristic analysis module collects the biological data types and carries out classified characteristic processing; the classification management module is written by using a python language, and processes, stores and calls the original data and the intermediate data result processed by the redundancy removing module and the characteristic analysis module;
and the data visualization and access layer displays the query data for real-time mapping or drawing display through a WEB interface.
2. The information knowledge platform of claim 1, wherein: in the multidimensional data acquisition module, new mutations are classified into two classes, namely, a mutation DNM including new site mutations and small insertion deletions, and a new copy number variation CNV, wherein the CNV includes deletions or duplications of the copy number of the DNA region.
3. The information knowledge platform of claim 1, wherein: the high-performance non-SQL data processing system is MongoDB, so that the big data platform has the functions of real-time updating, data integration and module expansion.
4. The information knowledge platform of claim 1, wherein: the built-in script algorithm flow comprises the following steps: (1) carrying out standardization processing on the data; (2) and performing duplication elimination and data compression according to the unique identification of the data in the data source and the corresponding key value.
5. The information knowledge platform of claim 1, wherein: the classification characteristic processing is divided into (1) evaluation of DNM scoring; (2) selecting a regulatory element and constructing a mutation map; (3) constructing the protein interaction network where the mutation exists.
6. The information knowledge platform of claim 1, wherein: the algorithm idea of the classification management module comprises the following steps: (1) preprocessing original data and intermediate data results subjected to characteristic analysis according to data types, and integrating unique gene identifiers in the data to generate a dictionary with the unique gene identifiers as key values; (2) calling a PyMongo module in python to control a Mongo DB, and storing the dictionary generated in the previous step into corresponding aggregation; (3) and calling a PyMongo module in python to control the Mongo DB for data reading.
CN201911365589.4A 2019-12-26 2019-12-26 New mutation information knowledge platform for neuropsychiatric diseases Active CN111128308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911365589.4A CN111128308B (en) 2019-12-26 2019-12-26 New mutation information knowledge platform for neuropsychiatric diseases

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911365589.4A CN111128308B (en) 2019-12-26 2019-12-26 New mutation information knowledge platform for neuropsychiatric diseases

Publications (2)

Publication Number Publication Date
CN111128308A CN111128308A (en) 2020-05-08
CN111128308B true CN111128308B (en) 2023-03-24

Family

ID=70503027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911365589.4A Active CN111128308B (en) 2019-12-26 2019-12-26 New mutation information knowledge platform for neuropsychiatric diseases

Country Status (1)

Country Link
CN (1) CN111128308B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160996B (en) * 2021-01-19 2021-12-07 北京安智因生物技术有限公司 Entity-based cardiovascular disease data integration method
CN113628681A (en) * 2021-07-21 2021-11-09 哈尔滨星云医学检验所有限公司 Family denovo mutation-based analysis method and application thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107437004A (en) * 2017-08-07 2017-12-05 深圳华大基因研究院 A kind of system intelligently understood for tumour individuation genetic test
CN108364124A (en) * 2018-01-26 2018-08-03 天津中科智能识别产业技术研究院有限公司 International production capacity Cooperation Risk assessment based on big data and Decision Making Service System
CN108681569A (en) * 2018-05-04 2018-10-19 亚洲保理(深圳)有限公司 A kind of automatic data analysis system and its method
CN109086573A (en) * 2018-07-30 2018-12-25 东北师范大学 Multi-source biology big data convergence platform

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7953613B2 (en) * 2007-01-03 2011-05-31 Gizewski Theodore M Health maintenance system
US10796010B2 (en) * 2017-08-30 2020-10-06 MyMedicalImages.com, LLC Cloud-based image access systems and methods

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107437004A (en) * 2017-08-07 2017-12-05 深圳华大基因研究院 A kind of system intelligently understood for tumour individuation genetic test
CN108364124A (en) * 2018-01-26 2018-08-03 天津中科智能识别产业技术研究院有限公司 International production capacity Cooperation Risk assessment based on big data and Decision Making Service System
CN108681569A (en) * 2018-05-04 2018-10-19 亚洲保理(深圳)有限公司 A kind of automatic data analysis system and its method
CN109086573A (en) * 2018-07-30 2018-12-25 东北师范大学 Multi-source biology big data convergence platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于MongoDB的蛋白质组学大数据存储系统设计;张琳等;《计算机应用》;20160610;全文 *
神经精神疾病研究的现状和策略;罗建红等;《浙江大学学报(医学版)》;20080925(第05期);全文 *

Also Published As

Publication number Publication date
CN111128308A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
Sarropoulos et al. Developmental dynamics of lncRNAs across mammalian organs and species
US20200327956A1 (en) Methods of selection, reporting and analysis of genetic markers using broad-based genetic profiling applications
US20190311784A1 (en) Genome explorer system to process and present nucleotide variations in genome sequence data
Almasy et al. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees
Sadowski et al. Spatial chromatin architecture alteration by structural variations in human genomes at the population scale
Merkel et al. Detecting short tandem repeats from genome data: opening the software black box
CN111883210B (en) Single-gene disease name recommendation method and system based on clinical features and sequence variation
CN111128308B (en) New mutation information knowledge platform for neuropsychiatric diseases
Yuan et al. Evaluation of phenotype-driven gene prioritization methods for Mendelian diseases
Koire et al. A method to delineate de novo missense variants across pathways prioritizes genes linked to autism
Qian et al. Efficient clustering of identity-by-descent between multiple individuals
Yang et al. CottonMD: a multi-omics database for cotton biological study
Holtgrewe et al. Methods for the detection and assembly of novel sequence in high-throughput sequencing data
Lin et al. PsyMuKB: An integrative de novo variant knowledge base for developmental disorders
Umlai et al. Genome sequencing data analysis for rare disease gene discovery
CN111863132A (en) Method and system for screening pathogenic variation
Karp et al. Improving the identification of phenotypic abnormalities and sexual dimorphism in mice when studying rare event categorical characteristics
EP4115428A1 (en) Genome dashboard
US20190042697A1 (en) Computer-implemented methods for automated analysis and prioritization of variants in datasets
Kan et al. Detection of novel splice forms in human and mouse using cross-species approach
Hu Qian et al. Integrating massive RNA-seq data to elucidate transcriptome dynamics in Drosophila melanogaster
Zhu et al. A robust pipeline for ranking carrier frequencies of autosomal recessive and X-linked Mendelian disorders
Lu et al. Deep learning-assisted genome-wide characterization of massively parallel reporter assays
Cauley et al. Novel mutation identification and copy number variant detection via exome sequencing in congenital muscular dystrophy
Teng NGS for Sequence Variants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant