CN110111850B - Biological information database annotation method and system - Google Patents

Biological information database annotation method and system Download PDF

Info

Publication number
CN110111850B
CN110111850B CN201810017510.8A CN201810017510A CN110111850B CN 110111850 B CN110111850 B CN 110111850B CN 201810017510 A CN201810017510 A CN 201810017510A CN 110111850 B CN110111850 B CN 110111850B
Authority
CN
China
Prior art keywords
database
data
annotation
pool
input data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810017510.8A
Other languages
Chinese (zh)
Other versions
CN110111850A (en
Inventor
黄金艳
李剑峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd
Original Assignee
Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd filed Critical Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd
Priority to CN201810017510.8A priority Critical patent/CN110111850B/en
Publication of CN110111850A publication Critical patent/CN110111850A/en
Application granted granted Critical
Publication of CN110111850B publication Critical patent/CN110111850B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a biological information database annotation method, which comprises the following steps: establishing a data filtering script pool; establishing a database pool comprising a plurality of databases; receiving input data and initializing the input data to match a database through a database pool; judging whether the input data is standard or not according to database matching; if not, carrying out data standardization through a data filtering script pool; if the standard is met, returning after the data annotation step is carried out; a biological information database annotation system is also disclosed. The downloading and installation of a large number of biological information databases and corresponding annotation work can be automatically completed, and the efficiency and the flexibility of gene annotation are greatly improved by introducing a database pool, a data filtering script pool and a data annotation function library; by integrating a plurality of database sets, people can greatly conveniently carry out related biological information data annotation work.

Description

Biological information database annotation method and system
Technical Field
The invention relates to the technical field of biological information databases, in particular to a biological information database annotation method and system.
Background
Biological information database resources have been rapidly developed over decades and the variety and function have become more and more powerful. Comprehensive database: for example, NCBI has search tools for nucleic acids, proteins, gene names, genome names, etc., PUBMED literature databases, taxomy data, COG protein family libraries, etc.; genome Browser: such as UCSC, ensEMBL, provide a large number of genome-associated databases; ontology/Pathway: for example, DAVID integrates a large number of databases such as GO (Gene Ontology), KEGG, gene ID information, etc. for biological information mining. Genetic variation annotation tools: for example, ANNOVAR integrates nearly 50 databases related to gene variation, thereby greatly facilitating the annotation work of genomics data.
At present, a large number of biological information databases have greatly facilitated people to carry out related work. However, in using these databases, several difficulties and difficulties still exist:
1) Compatibility of input data with reference data. Because the source and the format of the input data are various, the format of the reference database can also have various formats such as a plain text file, an SQL-like database and the like, if relevant annotation and analysis are needed, the input data and the reference database must be standardized, before that, a user mainly writes a conversion script or manually adjusts the conversion script by himself or herself, and a set of systematic and complete data filter is not provided to match various types of input data and the reference database.
2) The single threaded mode of the annotation tool is less suitable for large data analysis. At present, a considerable number of data annotation tools still adopt a single-thread or single data stream format for annotation, and are not matched with the current mainstream computer cluster computation and the requirement of mass data.
3) The database sources are scattered, if various data annotations are needed, multiple operations are needed, the data annotation time is greatly slowed down, and the data analysis period is prolonged.
4) Only a few database annotation tools have the function of integrating own databases, and in addition, the flexibility of integrating own data by various tools is not high.
Disclosure of Invention
In view of the above-mentioned shortcomings, the present invention provides a method for annotating a biological information database, which can solve the above problems.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
a biological information database annotation process, said biological information database annotation process comprising the steps of:
establishing a data filtering script pool;
establishing a database pool comprising a plurality of databases;
receiving input data and initializing the input data to match a database through a database pool;
judging whether the input data is standard or not according to database matching;
if not, carrying out data standardization through a data filtering script pool;
and if the standard is met, returning after the data annotation step is carried out.
According to one aspect of the invention, the establishing a database pool comprising a plurality of databases comprises: integrating the ANNOVAR existing database and the public database to form a database pool.
According to one aspect of the invention, the establishing a database pool comprising a plurality of databases comprises: and downloading and installing data in the database pool, and generating meta information of the corresponding database for management and updating.
According to one aspect of the invention, said returning after the step of performing data annotation comprises: and annotating the data by using the corresponding annotation function in the data annotation function library, and returning the data according to other parameters input by the user.
According to one aspect of the invention, the determining whether the input data is standard based on the database matching comprises: and judging whether the input data conforms to the format of the corresponding reference database.
According to one aspect of the invention, the annotating data step comprises: and the single database annotation is completed by linking the database pool, the data standardization function library and the data annotation function library through the annotation function.
According to one aspect of the invention, the annotating data step comprises: and linking the database pool, the data filtering script pool and the data annotation function library through the annotation function, and simultaneously completing a plurality of database annotations.
According to an aspect of the present invention, the biological information database annotation method includes: and automatically segmenting input data and processing the segmented input data in parallel to finish the analysis work of a plurality of databases.
A biological information database annotation system comprises a database pool, a data filtering script pool, a database annotation module and a database management module, wherein the database annotation module comprises a data annotation function library, the database pool is formed by integrating an ANNOVAR existing database and a public database, and the data filtering script pool comprises a standardized function library.
The implementation of the invention has the advantages that: the invention relates to a biological information database annotation method, which comprises the following steps: establishing a data filtering script pool; establishing a database pool comprising a plurality of databases; receiving input data and initializing the input data to match a database through a database pool; judging whether the input data is standard or not according to database matching; if not, performing data standardization through a data filtering script pool; if the standard is met, returning after the data annotation step is carried out; the downloading and installation of a large amount of biological information databases and corresponding annotation work can be automatically completed. Particularly, the efficiency and the flexibility of gene annotation are greatly improved by introducing a database pool, a data filtering script pool and a data annotation function library. In addition, the performance of the R-based annotation tool is greatly improved by using parallel computation packages in R and large data processing packages. Moreover, by integrating a plurality of database sets such as ANNOVAR, DAVID and the like, people can greatly conveniently carry out related biological information data annotation work. And establishing a data filtering script pool for standardizing different input data and a reference database, and automatically completing data standardization by only selecting corresponding functions from the data and transmitting the functions to the annovarR by a user so as to annotate and analyze related data.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for annotating a biological information database according to the present invention;
fig. 2 is a schematic diagram of a biological information database annotation system according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1 and 2, a biological information database annotation method includes the steps of:
step S1: establishing a database pool comprising a plurality of databases;
the specific implementation manner of establishing a database pool including a plurality of databases in the step S1 is as follows: the database pool is formed by integrating the ANNOVAR existing database with other public databases. Other public databases include all currently published biological information databases in use, such as the NCBI public database, DAVID database, uniProt, pfam database, and the like.
In practical applications, for the use of the established database pool, the database can be downloaded, managed and automatically constructed through the database downloading, constructing and managing module as follows.
In practical application, data in a database pool is downloaded and installed through a function download, and meta information of a corresponding database is generated, so that later management and updating are facilitated.
The database pool may further include the following function functions to implement the corresponding functions:
auto build in sqlite: automatically constructing a database in an SQLite format, and inputting a file into a text file;
mysql. Automatically constructing a database in a MySQL format, wherein an input file is a text file;
sql2sqlite: automatically constructing a database in an SQLite format, wherein an input file is an SQL file;
sql2mysql: automatically constructing a database in MySQL format, wherein the input file is an SQL file;
arrival. Names: and obtaining all the supported database names in the current database pool.
Step S2: establishing a data filtering script pool;
the specific implementation manner of establishing a data filtering script pool in the step S2 is as follows: and establishing a data filtering script pool according to the database in the database pool on the basis of the R language, and mainly editing the data filtering script according to different data formats of different databases in the database pool to form the data filtering script pool.
Specifically, a data normalization function is generated through different database data formats, so as to perform normalization processing on data in the corresponding database format.
And step S3: receiving input data and initializing the input data to match a database through a database pool;
the step S3: the specific implementation of receiving input data and initializing the database matching via the database pool may be: the method comprises the steps of receiving input data, initializing, mainly obtaining a data format of the input data, and matching the data format with a data format of a database in a database pool to match the reference database.
And step S4: judging whether the input data is standard or not according to database matching;
the specific implementation manner of the step S4 of judging whether the input data is standard according to the database matching is as follows: and judging whether the input data is in a data standard format conforming to the database or not according to the matched data standard format of the reference database, if so, judging the input data is standardized data, and if not, judging the input data is nonstandard data and needing to be subjected to standardization processing.
Step S5: if not, carrying out data standardization through a data filtering script pool;
if not, the specific implementation mode of performing data standardization through the data filtering script pool in the step S5 is as follows: the different input data are normalized against a reference database, mainly by means of a normalization function.
Step S6: and if the standard is met, returning after the data annotation step is carried out.
In practical applications, the returning after the step of performing data annotation includes: and annotating the data by using the corresponding annotation function in the data annotation function library, and returning the data according to other parameters input by the user.
In practical applications, the step of performing data annotation includes: and completing single database annotation through the annotation function link database pool, the data standardization function library and the data annotation function library.
Specifically, the function annotation: and annotating the function, namely linking the database pool, the data standardization function library and the data annotation function library to finish single database annotation.
In practical applications, the step of performing data annotation may further include: and the database pool, the data filtering script pool and the data annotation function library are linked through the annotation function, and a plurality of database annotations (in parallel) are simultaneously finished.
Specifically, the function association. And annotating the function, namely linking the database pool, the data filtering script pool and the data annotating function pool, and simultaneously completing a plurality of database annotations (paralleling).
In practical application, the biological information database annotation method comprises the following steps: and automatically segmenting input data and processing the segmented input data in parallel to finish the analysis work of a plurality of databases.
File by the functional function paraannotation. Big: merge can automatically divide and process input data in parallel, generate ff (fast access to big data on disk) objects, and accelerate big data annotation.
And integrating ANNOVAR related databases and other various biological information databases which are used most frequently by people to achieve the purpose of completing the analysis work of a plurality of databases in one key.
And a programmable interface is used in the steps of data preprocessing, data annotation and the like, so that a user can develop an own annotation database conveniently.
Example two
As shown in fig. 2, a bioinformation database annotation system includes a database pool, a data filtering script pool, a database annotation module, and a database management module, wherein the database annotation module includes a data annotation function library, the database pool is formed by integrating an ANNOVAR existing database and a public database, and the data filtering script pool includes a standardized function library; the database annotation module is mainly used for data annotation, and the input data and the database are matched and annotated by using high-efficiency R packets such as RSQLite, RMySQL, data. .
The implementation of the invention has the advantages that: the invention relates to a biological information database annotation method, which comprises the following steps: establishing a data filtering script pool; establishing a database pool comprising a plurality of databases; receiving input data and initializing the input data to match a database through a database pool; judging whether the input data is standard or not according to database matching; if not, carrying out data standardization through a data filtering script pool; if the standard is met, returning after the data annotation step is carried out; the downloading and installation of a large amount of biological information databases and corresponding annotation work can be automatically completed. Particularly, the efficiency and the flexibility of gene annotation are greatly improved by introducing a database pool, a data filtering script pool and a data annotation function library. In addition, the performance of the R-based annotation tool is greatly improved by using parallel computation packages in R and large data processing packages. Moreover, by integrating a plurality of database sets, such as ANNOVAR, DAVID and the like, people can greatly conveniently carry out related biological information data annotation work. And establishing a data filtering script pool for standardizing different input data and a reference database, and automatically completing data standardization by only selecting a corresponding function from the data filtering script pool and transmitting the function into the annovarR, thereby performing related data annotation and analysis.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention disclosed herein are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (7)

1. A biological information database annotation process, characterized in that it comprises the following steps:
establishing a data filtering script pool for standardizing different input data and a reference database, and automatically completing data standardization by only selecting corresponding functions from the data and transmitting the functions to the annovarR by a user so as to annotate and analyze related data;
establishing a database pool comprising a plurality of databases;
receiving input data and initializing the input data to match a database through a database pool;
judging whether the input data are standard according to database matching, comprising the following steps: judging whether the input data conforms to the format of the corresponding reference database;
if not, carrying out data standardization through a data filtering script pool;
if the standard is met, returning after the data annotation step is carried out, wherein the method comprises the following steps: and annotating the data by using the corresponding annotation function in the data annotation function library, and returning the data according to other parameters input by the user.
2. The method for annotating a bioinformatic database according to claim 1, wherein said establishing a database pool comprising a plurality of databases comprises: the ANNOVAR existing database and the public database are integrated to form a database pool.
3. The method for annotating a biological information database according to claim 2, wherein said creating a database pool comprising a plurality of databases comprises: and downloading and installing data in the database pool, and generating meta information of the corresponding database for management and updating.
4. The bioinformatic database annotation process of claim 1, wherein said step of performing data annotation comprises: and completing single database annotation through the annotation function link database pool, the data standardization function library and the data annotation function library.
5. The bioinformatic database annotation process of claim 1, wherein said step of performing data annotation comprises: and the database pool, the data filtering script pool and the data annotation function library are linked through the annotation function, and a plurality of database annotations are simultaneously completed.
6. The bioinformation database annotation method according to one of claims 1 to 5, characterized in that it comprises: and automatically dividing the input data, and processing the input data in parallel to finish the analysis work of a plurality of databases.
7. A biological information database annotation system is characterized by comprising a database pool, a data filtering script pool, a database annotation module and a database management module, wherein the database pool receives input data and initializes to match a database through the database pool; judging whether the input data is standard or not according to database matching; if the input data is not in the standard, performing data standardization through a data filtering script pool, if the input data is in the standard, performing data annotation, returning, and judging whether the input data is in the standard according to database matching comprises judging whether the input data conforms to the format of a corresponding reference database, wherein a database annotation module comprises a data annotation function library, the database pool is formed by integrating an ANNOVAR existing database and a public database, the data filtering script pool comprises a standardization function library and is used for standardizing different input data and the reference database, and a user can automatically complete data standardization only by selecting a corresponding function from the standardization function library and transmitting the function into an ANNOVArR, so that related data annotation and analysis are performed.
CN201810017510.8A 2018-01-09 2018-01-09 Biological information database annotation method and system Active CN110111850B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810017510.8A CN110111850B (en) 2018-01-09 2018-01-09 Biological information database annotation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810017510.8A CN110111850B (en) 2018-01-09 2018-01-09 Biological information database annotation method and system

Publications (2)

Publication Number Publication Date
CN110111850A CN110111850A (en) 2019-08-09
CN110111850B true CN110111850B (en) 2023-04-07

Family

ID=67483009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810017510.8A Active CN110111850B (en) 2018-01-09 2018-01-09 Biological information database annotation method and system

Country Status (1)

Country Link
CN (1) CN110111850B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795080A (en) * 2019-10-21 2020-02-14 山东舜知信息科技有限公司 Automatic code generation system based on database annotation and construction method
CN117059179B (en) * 2023-08-30 2024-08-13 北京星云医学检验实验室有限公司 Biological information database annotation method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5404295A (en) * 1990-08-16 1995-04-04 Katz; Boris Method and apparatus for utilizing annotations to facilitate computer retrieval of database material
US6519603B1 (en) * 1999-10-28 2003-02-11 International Business Machine Corporation Method and system for organizing an annotation structure and for querying data and annotations
US20030036857A1 (en) * 2001-08-01 2003-02-20 Xiang Yao Methods and systems of biomolecular sequence matching
US20060212227A1 (en) * 2005-03-16 2006-09-21 Xiaoliang Han An Analysis Platform for Annotating Comprehensive Functions of Genes on high throughput and Integrated Bioarray System
CN102142064B (en) * 2011-04-21 2013-04-10 华东师范大学 Biomolecular network exhibition analysis system and analysis method thereof
CN106407407B (en) * 2016-09-22 2019-10-15 江苏通付盾科技有限公司 A kind of file labeling system and method
CN107103205A (en) * 2017-05-27 2017-08-29 湖北普罗金科技有限公司 A kind of bioinformatics method based on proteomic image data notes eukaryotic gene group

Also Published As

Publication number Publication date
CN110111850A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
McLaren et al. The ensembl variant effect predictor
Weissensteiner et al. mtDNA-Server: next-generation sequencing data analysis of human mitochondrial DNA in the cloud
Reisinger et al. OTP: An automatized system for managing and processing NGS data
Trapnell et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
US20150066383A1 (en) Collapsible modular genomic pipeline
US20150066381A1 (en) Genomic pipeline editor with tool localization
CN110111850B (en) Biological information database annotation method and system
Dorff et al. GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data
CN109584958A (en) One kind being based on cloud computing gene sequence data Management of quality control method
Delucchi et al. TRAL 2.0: tandem repeat detection with circular profile hidden Markov models and evolutionary aligner
Kretzmer et al. BAT: Bisulfite Analysis Toolkit: BAT is a toolkit to analyze DNA methylation sequencing data accurately and reproducibly. It covers standard processing and analysis steps from raw read mapping up to annotation data integration and calculation of correlating DMRs.
Wolf et al. DNAseq workflow in a diagnostic context and an example of a user friendly implementation
Nazipova et al. Big Data in bioinformatics
Perez-Riverol Proteomic repository data submission, dissemination, and reuse: key messages
Jensen et al. RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting
KR20230102240A (en) Multidimensional omics data transformation system and method therefor
Gedela Integration, warehousing, and analysis strategies of omics data
Liang et al. MAGIC-SPP: a database-driven DNA sequence processing package with associated management tools
Herrick et al. ILIAD: a suite of automated Snakemake workflows for processing genomic data for downstream applications
Cho et al. Update on Protein-Protein Interaction Data in WormBase. microPublication Biology
Medina‐Aunon et al. Protein Information and Knowledge Extractor: Discovering biological information from proteomics data
Yu et al. NGS-FC: A next-generation sequencing data format converter
CN110504006A (en) A kind of method, system, platform and the storage medium of processing amplification subdata
Liang et al. WebTraceMiner: a web service for processing and mining EST sequence trace files
Thangam et al. CRCDA—Comprehensive resources for cancer NGS data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant