CN110111850B

CN110111850B - Biological information database annotation method and system

Info

Publication number: CN110111850B
Application number: CN201810017510.8A
Authority: CN
Inventors: 黄金艳; 李剑峰
Original assignee: Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd
Current assignee: Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd
Priority date: 2018-01-09
Filing date: 2018-01-09
Publication date: 2023-04-07
Anticipated expiration: 2038-01-09
Also published as: CN110111850A

Abstract

The invention discloses a biological information database annotation method, which comprises the following steps: establishing a data filtering script pool; establishing a database pool comprising a plurality of databases; receiving input data and initializing the input data to match a database through a database pool; judging whether the input data is standard or not according to database matching; if not, carrying out data standardization through a data filtering script pool; if the standard is met, returning after the data annotation step is carried out; a biological information database annotation system is also disclosed. The downloading and installation of a large number of biological information databases and corresponding annotation work can be automatically completed, and the efficiency and the flexibility of gene annotation are greatly improved by introducing a database pool, a data filtering script pool and a data annotation function library; by integrating a plurality of database sets, people can greatly conveniently carry out related biological information data annotation work.

Description

Biological information database annotation method and system

Technical Field

The invention relates to the technical field of biological information databases, in particular to a biological information database annotation method and system.

Background

Biological information database resources have been rapidly developed over decades and the variety and function have become more and more powerful. Comprehensive database: for example, NCBI has search tools for nucleic acids, proteins, gene names, genome names, etc., PUBMED literature databases, taxomy data, COG protein family libraries, etc.; genome Browser: such as UCSC, ensEMBL, provide a large number of genome-associated databases; ontology/Pathway: for example, DAVID integrates a large number of databases such as GO (Gene Ontology), KEGG, gene ID information, etc. for biological information mining. Genetic variation annotation tools: for example, ANNOVAR integrates nearly 50 databases related to gene variation, thereby greatly facilitating the annotation work of genomics data.

At present, a large number of biological information databases have greatly facilitated people to carry out related work. However, in using these databases, several difficulties and difficulties still exist:

1) Compatibility of input data with reference data. Because the source and the format of the input data are various, the format of the reference database can also have various formats such as a plain text file, an SQL-like database and the like, if relevant annotation and analysis are needed, the input data and the reference database must be standardized, before that, a user mainly writes a conversion script or manually adjusts the conversion script by himself or herself, and a set of systematic and complete data filter is not provided to match various types of input data and the reference database.

2) The single threaded mode of the annotation tool is less suitable for large data analysis. At present, a considerable number of data annotation tools still adopt a single-thread or single data stream format for annotation, and are not matched with the current mainstream computer cluster computation and the requirement of mass data.

3) The database sources are scattered, if various data annotations are needed, multiple operations are needed, the data annotation time is greatly slowed down, and the data analysis period is prolonged.

4) Only a few database annotation tools have the function of integrating own databases, and in addition, the flexibility of integrating own data by various tools is not high.

Disclosure of Invention

In view of the above-mentioned shortcomings, the present invention provides a method for annotating a biological information database, which can solve the above problems.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

a biological information database annotation process, said biological information database annotation process comprising the steps of:

establishing a data filtering script pool;

establishing a database pool comprising a plurality of databases;

receiving input data and initializing the input data to match a database through a database pool;

judging whether the input data is standard or not according to database matching;

if not, carrying out data standardization through a data filtering script pool;

and if the standard is met, returning after the data annotation step is carried out.

According to one aspect of the invention, the establishing a database pool comprising a plurality of databases comprises: integrating the ANNOVAR existing database and the public database to form a database pool.

According to one aspect of the invention, the establishing a database pool comprising a plurality of databases comprises: and downloading and installing data in the database pool, and generating meta information of the corresponding database for management and updating.

According to one aspect of the invention, said returning after the step of performing data annotation comprises: and annotating the data by using the corresponding annotation function in the data annotation function library, and returning the data according to other parameters input by the user.

According to one aspect of the invention, the determining whether the input data is standard based on the database matching comprises: and judging whether the input data conforms to the format of the corresponding reference database.

According to one aspect of the invention, the annotating data step comprises: and the single database annotation is completed by linking the database pool, the data standardization function library and the data annotation function library through the annotation function.

According to one aspect of the invention, the annotating data step comprises: and linking the database pool, the data filtering script pool and the data annotation function library through the annotation function, and simultaneously completing a plurality of database annotations.

According to an aspect of the present invention, the biological information database annotation method includes: and automatically segmenting input data and processing the segmented input data in parallel to finish the analysis work of a plurality of databases.

A biological information database annotation system comprises a database pool, a data filtering script pool, a database annotation module and a database management module, wherein the database annotation module comprises a data annotation function library, the database pool is formed by integrating an ANNOVAR existing database and a public database, and the data filtering script pool comprises a standardized function library.

The implementation of the invention has the advantages that: the invention relates to a biological information database annotation method, which comprises the following steps: establishing a data filtering script pool; establishing a database pool comprising a plurality of databases; receiving input data and initializing the input data to match a database through a database pool; judging whether the input data is standard or not according to database matching; if not, performing data standardization through a data filtering script pool; if the standard is met, returning after the data annotation step is carried out; the downloading and installation of a large amount of biological information databases and corresponding annotation work can be automatically completed. Particularly, the efficiency and the flexibility of gene annotation are greatly improved by introducing a database pool, a data filtering script pool and a data annotation function library. In addition, the performance of the R-based annotation tool is greatly improved by using parallel computation packages in R and large data processing packages. Moreover, by integrating a plurality of database sets such as ANNOVAR, DAVID and the like, people can greatly conveniently carry out related biological information data annotation work. And establishing a data filtering script pool for standardizing different input data and a reference database, and automatically completing data standardization by only selecting corresponding functions from the data and transmitting the functions to the annovarR by a user so as to annotate and analyze related data.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for annotating a biological information database according to the present invention;

fig. 2 is a schematic diagram of a biological information database annotation system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 and 2, a biological information database annotation method includes the steps of:

step S1: establishing a database pool comprising a plurality of databases;

the specific implementation manner of establishing a database pool including a plurality of databases in the step S1 is as follows: the database pool is formed by integrating the ANNOVAR existing database with other public databases. Other public databases include all currently published biological information databases in use, such as the NCBI public database, DAVID database, uniProt, pfam database, and the like.

In practical applications, for the use of the established database pool, the database can be downloaded, managed and automatically constructed through the database downloading, constructing and managing module as follows.

In practical application, data in a database pool is downloaded and installed through a function download, and meta information of a corresponding database is generated, so that later management and updating are facilitated.

The database pool may further include the following function functions to implement the corresponding functions:

auto build in sqlite: automatically constructing a database in an SQLite format, and inputting a file into a text file;

mysql. Automatically constructing a database in a MySQL format, wherein an input file is a text file;

sql2sqlite: automatically constructing a database in an SQLite format, wherein an input file is an SQL file;

sql2mysql: automatically constructing a database in MySQL format, wherein the input file is an SQL file;

arrival. Names: and obtaining all the supported database names in the current database pool.

Step S2: establishing a data filtering script pool;

the specific implementation manner of establishing a data filtering script pool in the step S2 is as follows: and establishing a data filtering script pool according to the database in the database pool on the basis of the R language, and mainly editing the data filtering script according to different data formats of different databases in the database pool to form the data filtering script pool.

Specifically, a data normalization function is generated through different database data formats, so as to perform normalization processing on data in the corresponding database format.

And step S3: receiving input data and initializing the input data to match a database through a database pool;

the step S3: the specific implementation of receiving input data and initializing the database matching via the database pool may be: the method comprises the steps of receiving input data, initializing, mainly obtaining a data format of the input data, and matching the data format with a data format of a database in a database pool to match the reference database.

And step S4: judging whether the input data is standard or not according to database matching;

the specific implementation manner of the step S4 of judging whether the input data is standard according to the database matching is as follows: and judging whether the input data is in a data standard format conforming to the database or not according to the matched data standard format of the reference database, if so, judging the input data is standardized data, and if not, judging the input data is nonstandard data and needing to be subjected to standardization processing.

Step S5: if not, carrying out data standardization through a data filtering script pool;

if not, the specific implementation mode of performing data standardization through the data filtering script pool in the step S5 is as follows: the different input data are normalized against a reference database, mainly by means of a normalization function.

Step S6: and if the standard is met, returning after the data annotation step is carried out.

In practical applications, the returning after the step of performing data annotation includes: and annotating the data by using the corresponding annotation function in the data annotation function library, and returning the data according to other parameters input by the user.

In practical applications, the step of performing data annotation includes: and completing single database annotation through the annotation function link database pool, the data standardization function library and the data annotation function library.

Specifically, the function annotation: and annotating the function, namely linking the database pool, the data standardization function library and the data annotation function library to finish single database annotation.

In practical applications, the step of performing data annotation may further include: and the database pool, the data filtering script pool and the data annotation function library are linked through the annotation function, and a plurality of database annotations (in parallel) are simultaneously finished.

Specifically, the function association. And annotating the function, namely linking the database pool, the data filtering script pool and the data annotating function pool, and simultaneously completing a plurality of database annotations (paralleling).

In practical application, the biological information database annotation method comprises the following steps: and automatically segmenting input data and processing the segmented input data in parallel to finish the analysis work of a plurality of databases.

File by the functional function paraannotation. Big: merge can automatically divide and process input data in parallel, generate ff (fast access to big data on disk) objects, and accelerate big data annotation.

And integrating ANNOVAR related databases and other various biological information databases which are used most frequently by people to achieve the purpose of completing the analysis work of a plurality of databases in one key.

And a programmable interface is used in the steps of data preprocessing, data annotation and the like, so that a user can develop an own annotation database conveniently.

Example two

As shown in fig. 2, a bioinformation database annotation system includes a database pool, a data filtering script pool, a database annotation module, and a database management module, wherein the database annotation module includes a data annotation function library, the database pool is formed by integrating an ANNOVAR existing database and a public database, and the data filtering script pool includes a standardized function library; the database annotation module is mainly used for data annotation, and the input data and the database are matched and annotated by using high-efficiency R packets such as RSQLite, RMySQL, data. .

The implementation of the invention has the advantages that: the invention relates to a biological information database annotation method, which comprises the following steps: establishing a data filtering script pool; establishing a database pool comprising a plurality of databases; receiving input data and initializing the input data to match a database through a database pool; judging whether the input data is standard or not according to database matching; if not, carrying out data standardization through a data filtering script pool; if the standard is met, returning after the data annotation step is carried out; the downloading and installation of a large amount of biological information databases and corresponding annotation work can be automatically completed. Particularly, the efficiency and the flexibility of gene annotation are greatly improved by introducing a database pool, a data filtering script pool and a data annotation function library. In addition, the performance of the R-based annotation tool is greatly improved by using parallel computation packages in R and large data processing packages. Moreover, by integrating a plurality of database sets, such as ANNOVAR, DAVID and the like, people can greatly conveniently carry out related biological information data annotation work. And establishing a data filtering script pool for standardizing different input data and a reference database, and automatically completing data standardization by only selecting a corresponding function from the data filtering script pool and transmitting the function into the annovarR, thereby performing related data annotation and analysis.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention disclosed herein are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A biological information database annotation process, characterized in that it comprises the following steps:

establishing a data filtering script pool for standardizing different input data and a reference database, and automatically completing data standardization by only selecting corresponding functions from the data and transmitting the functions to the annovarR by a user so as to annotate and analyze related data;

establishing a database pool comprising a plurality of databases;

judging whether the input data are standard according to database matching, comprising the following steps: judging whether the input data conforms to the format of the corresponding reference database;

if not, carrying out data standardization through a data filtering script pool;

if the standard is met, returning after the data annotation step is carried out, wherein the method comprises the following steps: and annotating the data by using the corresponding annotation function in the data annotation function library, and returning the data according to other parameters input by the user.

2. The method for annotating a bioinformatic database according to claim 1, wherein said establishing a database pool comprising a plurality of databases comprises: the ANNOVAR existing database and the public database are integrated to form a database pool.

3. The method for annotating a biological information database according to claim 2, wherein said creating a database pool comprising a plurality of databases comprises: and downloading and installing data in the database pool, and generating meta information of the corresponding database for management and updating.

4. The bioinformatic database annotation process of claim 1, wherein said step of performing data annotation comprises: and completing single database annotation through the annotation function link database pool, the data standardization function library and the data annotation function library.

5. The bioinformatic database annotation process of claim 1, wherein said step of performing data annotation comprises: and the database pool, the data filtering script pool and the data annotation function library are linked through the annotation function, and a plurality of database annotations are simultaneously completed.

6. The bioinformation database annotation method according to one of claims 1 to 5, characterized in that it comprises: and automatically dividing the input data, and processing the input data in parallel to finish the analysis work of a plurality of databases.

7. A biological information database annotation system is characterized by comprising a database pool, a data filtering script pool, a database annotation module and a database management module, wherein the database pool receives input data and initializes to match a database through the database pool; judging whether the input data is standard or not according to database matching; if the input data is not in the standard, performing data standardization through a data filtering script pool, if the input data is in the standard, performing data annotation, returning, and judging whether the input data is in the standard according to database matching comprises judging whether the input data conforms to the format of a corresponding reference database, wherein a database annotation module comprises a data annotation function library, the database pool is formed by integrating an ANNOVAR existing database and a public database, the data filtering script pool comprises a standardization function library and is used for standardizing different input data and the reference database, and a user can automatically complete data standardization only by selecting a corresponding function from the standardization function library and transmitting the function into an ANNOVArR, so that related data annotation and analysis are performed.