CN107169310A - A kind of genetic test construction of knowledge base method and system - Google Patents

A kind of genetic test construction of knowledge base method and system Download PDF

Info

Publication number
CN107169310A
CN107169310A CN201710165871.2A CN201710165871A CN107169310A CN 107169310 A CN107169310 A CN 107169310A CN 201710165871 A CN201710165871 A CN 201710165871A CN 107169310 A CN107169310 A CN 107169310A
Authority
CN
China
Prior art keywords
database
knowledge base
contingency table
genetic test
public database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710165871.2A
Other languages
Chinese (zh)
Other versions
CN107169310B (en
Inventor
江月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Base Biotechnology Co Ltd
Original Assignee
Shanghai Base Biotechnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Base Biotechnology Co Ltd filed Critical Shanghai Base Biotechnology Co Ltd
Priority to CN201710165871.2A priority Critical patent/CN107169310B/en
Publication of CN107169310A publication Critical patent/CN107169310A/en
Application granted granted Critical
Publication of CN107169310B publication Critical patent/CN107169310B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Physiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of genetic test construction of knowledge base method and system, including:Build database entity table, the contingency table of public database and the contingency table of text mining;Build contingency table scoring system;Build database matching management system;The present invention carries out knowledge base by the way of semi-automation is semi-artificial and built, both the deviation manually built and careless omission were considered, it is contemplated that the confidence level difference of public database, during text mining algorithm and public database Information Gathering Software are incorporated the step of automatically collecting, devise core library and candidate storehouse, and the entry for having verification storehouse to obscure score value carries out desk checking, the mode automatically updated is employed to be managed data-base content, the knowledge base management system of matching is designed, is easy to carry out database content search, verification, modification and version management.

Description

A kind of genetic test construction of knowledge base method and system
Technical field
The present invention relates to biomolecule information database field, and in particular to a kind of genetic test construction of knowledge base method and is System.
Background technology
The development of two generation sequencing technologies so that the mankind increasingly can easily obtain most basic genome sequence, Solve the hereditary information of itself.It is many scientific investigations showed that, many diseases of the mankind, phenotype and the reaction to medicine all originate from individual Genetic background difference, that is, each human DNA sequence difference.It is more and more since the Human Genome Project in 2000 Human genome be decrypted, these compositions of genome reference sequences of human genome.The direct purpose of genetic test is, (or subregion is sequenced) is sequenced by the genome to individual, the sequence of itself and reference gene group DNA level is obtained Row difference, then again by the comparison with existing knowledge base, come predict possible disease, phenotype, drug response association.
Mutant dna sequence and disease, phenotype, the source of drug response related information typically have:
GWAS(Genome-wide association study):If thering is specified disease, phenotype and medicine to make some Crowd carries out hereditary information excavation, finds the difference of they and reference sequences, then may infer that, it may be possible to these DNA The difference of level, has ultimately resulted in the differential responses of disease, phenotype either to medicine.
Biochemistry, genetics experiments in traditional biological study:It can also be looked for by traditional biological study method It is to a large amount of and disease, phenotype either to genetic mutation, protein variations, enzyme variation of drug response etc., these variations are anti- DNA aspects are reflected, are the generations of class variation in DNA level.Such as, a disease is due to the trans-membrane region of a protein It there occurs that variation causes the exception of ion channel, then, encode in this protein DNA sequence, as long as any have DNA water Flat variation have impact on the transmembrane domain of protein, it is possible to cause the generation of this disease.Part Experiment is in pattern life Thing and cell line are completed.
Other:Pass through gene, protein expression amount difference in study of disease sample and check sample;It is biological by calculating Method carries out function prediction etc..
More than, the difference of source of evidence, confidence level also can be different, and the element of the association such as disease is also differed.It is applied to The sequence variations of DNA level and the relation of disease, it may be desirable to which different record and subsequent treatment modes are, it is necessary to design and put up Knowledge base.
It is to build the core link in GENE Assay analysis flow that knowledge base, which is built,.But present knowledge base exist it is following Problem:
Building a knowledge base needs labor intensive material resources, generally requires integration public database, adds professional and read Scientific documents are read to complete.
Public database level is uneven, and artificially collect always there is certain deviation.
Many knowledge bases can obscure the information of the different aspects such as DNA, RNA, protein, such as GWAS is to be directed to DNA layers The result in face, and the research of some diseases related genes is some researchs related to drug metabolism in protein aspect It is in enzyme and metabolin aspect, although be eventually grouped into the variation of DNA aspects, but during database building, it should Make a distinction in advance, so that subsequent algorithm is developed.
Based on a series of algorithm such as text mining, knowledge base is built automatically, but fully rely on the knowledge of algorithm structure Storehouse false positive rate and false negative rate are all very high, have a great impact to follow-up genetic test disease risks analysis.
Many knowledge bases are built after completion, some existing associated entries be also required to according to new scientific achievement it is automatic more Newly, but due to the structure design of these databases solidification, automatically update difficult to realize..
Many knowledge bases can all use some public database management languages, not square to the researcher of Biological background Just use.
Biological information Relational database in existing patent of invention examination & verification, is contained:A kind of bioinformatic database system System and data processing method (application number:201410009130.1) and biomolecule information database construction method and device (application Number:201410742604.3)
Wherein, in first invention, a kind of biomolecule information database system and data processing method are disclosed, it is possible to achieve The unified management of bioinformatic data.Sample, project and experiment module are contained in the system, main purpose is convenient experiment Design and data processing, improve operating efficiency;In second invention, a kind of method for building biomolecule information database is disclosed And device, mainly by the text mining made a summary to PubMed, to disease, related literature summary is decomposed, according to mutation Regular expression semantic base carries out the extraction and classification of gene and abrupt information, builds the related semantic base of disease, determines gene The score of mutation, finally builds biomolecule information database.
The content of the invention
For deficiency of the prior art, it is an object of the invention to provide a kind of genetic test construction of knowledge base method and it is System, including:Genetic test relevant knowledge storehouse building method is provided, and after database building completion, based on database Management system.
The purpose of the present invention is realized using following technical proposals:
The present invention provides a kind of genetic test construction of knowledge base method, and it is theed improvement is that, methods described includes:
Build database entity table, the contingency table of public database and the contingency table of text mining;
Build contingency table scoring system;
Database matching management system is built, is easy to expert to search, verifies, changes contingency table.
Further, in the structure database entity table, the contingency table of public database and the contingency table of text mining Before, in addition to:Collect public database and it is arranged, and database structure is determined according to public database.
Further, the structure database entity table, including the cdna sample containing disease, phenotype and environmental factor form Information;
The environmental factor form is manual entry form;
In the match information between integrating disparate databases ID, for biology element, take and navigate to genome position Put, judge whether intersecting for position is more than the 0.5 of union to judge;
For the title of disease, then using the matching form of database typing itself, if unmatched, record is abandoned Enter.
Further, the contingency table for building public database, including:
Public database is layered;
Given a mark according to layering level, marking scope, -1~1, is positive correlation more than 0, be negative correlation less than 0.
Further, the contingency table for building text mining, including:
Collect the literature summary within nearly 25 years;
Literature summary is classified;
Evidence rank is given a mark, marking is included:Verb list reaches the marking of positive-negative relationship;Different document score values is carried out Superposition, and absolute value is added, if the score value being directly added is less than threshold value, the literature summary of evidence level is extracted and is put into school Test table;Final score maximum is 1 point, minimum -1 point.
Further, contingency table scoring system is built according to the contingency table of public database and the contingency table of text mining, Including:
Integrate public database and excavate the score value that obtained score value and literature mining are obtained;
Last associated entry comes from above-mentioned two source, and all associated entries are carried out according to relationship score to collect row Sequence, sets threshold value, is respectively put into core contingency table and candidate association table;
If the entry in public database source and the entry score value difference of literature reference are more than threshold value, verification is put into Table;
There is the associated entry for determining OR values for disease, if there is no conflicting or OR value fold differences 0.5~2 Within, then directly enter core contingency table;If it is present being put into checklist.
Further, the database matching management system is checked for database, database search, database are verified, Database version management, log query pages and the operation standard page.
Further, after the structure database matching management system, in addition to:The artificial school of database checklist Test;Verifying work will carry out entry in data base management system and browse and manual modification.
Total described, the invention provides a kind of genetic test construction of knowledge base system, it is theed improvement is that, the system System includes:
First builds module:Association for building database entity table, the contingency table of public database and text mining Table;
Second builds module:For building contingency table scoring system;
3rd builds module:For building database matching management system.
Further, the system also includes collecting public database and it is arranged, and is determined according to public database The collection module of database structure.
Compared with immediate prior art, the beneficial effect that the technical scheme that the present invention is provided reaches is:
1. carrying out knowledge base by the way of semi-automation is semi-artificial to build, the deviation manually built both is considered and has dredged Leakage, it is contemplated that the confidence level difference of public database.
2. the mode automatically updated is employed to be managed data-base content.
3. in incorporating text mining algorithm and public database Information Gathering Software the step of automatically collecting.
4. devising core library and candidate storehouse, and there is verification storehouse to carry out desk checking to the entry that score value is obscured.
5. devising the knowledge base management system of matching, it is easy to carry out database content search, verification, modification and version Management.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of genetic test construction of knowledge base method;
Fig. 2 is knowledge data base structural representation.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, technical scheme will be carried out below Detailed description.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Base Embodiment in the present invention, those of ordinary skill in the art are resulting on the premise of creative work is not made to be owned Other embodiment, belongs to the scope that the present invention is protected.
First optimal technical scheme
The form of knowledge base is generally divided into two major classes, and a class is entity table, and a class is contingency table.Entity table is exactly to data Element and the record of element essential information in storehouse.Member in general biological species knowledge base have:Sequence difference, the base of DNA level Cause, transcript, protein, metabolin, environmental factor, disease, phenotype, medicine etc..The sequence difference of DNA level can be according to it Complexity, is divided into:
*SNP(single nucleotide polymorphism):Most common DNA sequence dna difference, is the prominent of single base Become, account for more than the 90% of known dna polymorphism, it is widely distributed, study relatively more.According to ACMG guides, typically according to dbSNP numbers Specification is carried out to SNP according to the standard in storehouse.Frequencies of the SNP in crowd, is also critically important information in Risk Calculation, typically adopts With 1000Genomes and Hapmap data.
*indel(insertion and deletion):The insertion and deletion of small fragment, typically the length model in 2~16bp Enclose, not over 1kb.Name is also according to dbSNP databases.
*CNV(copy number variation):Copy number variation, minimum 50bp length, it is most of 1kb~ 3Mb, can include the CNV of large fragment.Its subclass has DEL (deletion), DUP (duplication), INS And LCV (large scale CNV) (insertion).The general dbVar database standards recommended according to ACMG, if do not deposited In entry, then using DGV databases.
*SV(structure variation):The structure variation of large fragment, SV contains CNV in the definition of broad sense, is Difference, herein SV only include INV (inversion) and translocation, the displacement of chromosome and transposition.It is typically also The dbVar database standards recommended according to ACMG, if there is no entry, then using DGV databases.
If standardization input mode can separately be had by being not present in data above storehouse in matching entry, the present invention.
Gene, transcript, protein reference RefSeq databases, if it does not match, using the ENSG in Ensembl (ENST, ENSP) is named, and can record Official Gene Symbol, to do text mining.
Enzyme and metabolin are then according to the annotation of KEGG databases.
The name of disease, the main omim database recommended using ACMG, add ICD10, MalaCard, MedGen, Orphanet databases.
The name of phenotype, mainly using HPO (human phenotype ontology) annotation, and MeSH, UMLS number According to storehouse.
The name of medicine, it is main to use DrugBank databases.
More than, it is the entity table of knowledge base.
Contingency table is the record in database, being associated between element.It is the association between biology element first:
* in mutant dna sequence, the linkage disequilibrium (LD, the linkage that are associated as between them between SNP disequilibrium).SNP independence analysis in being calculated for disease risks.According to 1000GenomesProject data The information in storehouse.
* gene->Transcript->Between protein (enzyme), according to the central dogma of transcription, translation, it is connected between element.With In influence of the prediction mutant dna sequence to coded protein, then the disease association with albumen qualitative correlation.Mainly according to RefSeq Database, the influence on DNA sequence dna to coded protein, then using Annovar algorithms.
* between gene-gene, mainly according to the information of genetic interaction and coevolution (Biogrid, MINT);Between protein, mainly according to the interactive network (Biogrid, MINT) of protein;Protein (enzyme), metabolin Between, mainly according to KEGG metabolite datas storehouse.These dependency relations serve primarily in the explanation to mutation, and with disease (table Type, drug response) between Relationship Prediction.
* the relation of mutant dna sequence and disease, mainly according to ClinVar, HGMD, DECIPHER database;Gene (egg White matter) with the relation of disease, mainly according to databases such as OMIM, UniProt, GAD, DisGeNET;
* the relation of mutant dna sequence/genes/proteins matter and phenotype, mainly according to HPO databases;
* mutant dna sequence/genes/proteins matter and the relation of medicine, mainly according to data such as DrugBank, PharmGKB Storehouse.
Other, such as relation of environmental factor and disease etc., not ready-made public database, these are all by right PubMed carries out text mining acquisition.
Semi-automatic, semi-artificial mode is taken to build knowledge base
Knowledge base is designed using module layer-stepping.The entity of database includes:According to the element of biology, be divided into DNA, RNA, protein, enzyme and metabolin, five aspects of environmental factor, according to research purpose, are divided into disease, phenotype, medicine etc. three big Module.Element in each aspect and module, all can be preferentially by collecting ACMG (The American College of Medical Genetics and Genomics) information in generally acknowledged public database, build the entity that knowledge base needs Table;Special evidence storehouse is designed in addition, records the source-information of each record of knowledge base.
For the contingency table between entity, it will carry out primary dcreening operation to the public database of required digging evidence by expert, and it is right The confidence level of each database is scored.Then, software is collected by database automatically, related information is acquired, and led to Evidence assessment algorithm program is crossed, the confidence level scoring for providing every evidence (incorporates database confidence level, data base entries credible The intersection degree of degree and disparate databases record), score value is normalized to -1~1.
* the false positive and false negative of public database information are considered, the present invention is contained for PubMed data in literature The text in storehouse is dug according to algorithm.The thought of algorithm design is to download the newest full storehouses of summary of PubMed, first according to keyword by every Content in document and three big modules is matched, and document and the contingency table of disease (phenotype, medicine) is built, then by five layers All elements in face are matched in the related literature summary of each disease (phenotype, medicine), and association is carried out according to semanteme (uncorrelated) and scoring (positive correlation, negative correlation) are filtered, for OR value of the complex disease if determination, source and OR is recorded It is worth sentence.The scoring of different documents is collected again, and is corrected with all document numbers, score value is normalized to -1~1.
Last associated entry comes from above-mentioned two source, and all entries are carried out according to relationship score to collect sequence, Threshold value is set, core contingency table and candidate association table is respectively put into;If the entry and literature reference in public database source Entry has greater difference (score value difference is more than 0.5, and threshold value can adjust), then is put into checklist.Wherein, have really for complex disease Determine the associated entry of OR values, (OR value fold differences are 0.5~2 if there is no associated entry that is conflicting or differing greatly Within, threshold value is adjustable), then core contingency table is directly entered, if it is present being put into checklist.
Human expert is transferred to be handled checklist content.Associated entry is respectively put into by core contingency table according to result With candidate association table.
Knowledge base can monthly automatically update (update cycle is adjustable), and following work will be repeated every time by updating:Judge Public database updates, if so, being given a mark again to entry;The newest full storehouses of summary of PubMed are updated, algorithm above is repeated, it is right Entry is given a mark;According to result, associated entry is reentered into core contingency table, candidate association table, but if entry Table ownership is changed, it will be put into checklist, desk checking is carried out by special messenger.
Make in order that the biology direction expert without database specialty background carries out desk checking and inquiry to database With the present invention specially develops supporting data base management system, contains the functions such as inquiry, modification, log records, can make Expert in terms of biology is verified in time to data-base content.In addition, database meeting is according to whether there is modification schedule backup, Also it can be carried out determining version according to research and development of products progress, it is ensured that the database used in production environment can trace to the source.
Second optimal technical scheme
The present invention provides a kind of genetic test construction of knowledge base method, and its flow chart is as shown in figure 1, comprise the steps:
1st, collection and edit of the first step for public database, the database collected now has:dbSNP、 1000GenomeProjects、dbVar、DGV、RefSeq、ENSG、ClinVar、HGMD、DECIPHER、OMIM、HPO、MeSH、 UMLS、DrugBank、Biogrid、MINT、UniProt、GAD、DisGeNET、PharmGKB、KEGG、PubMed Abstract.Wherein except PubMed Abstract, the initial data of other databases is all processed into tab formatted files, form It is as follows:
2. second step is knowledge data base structure design, database structure such as Fig. 2.In each module, there is special messenger's input public The title and major key of database, pass through the software of independent development, the initial integration tab formatted files in automatically generated data storehouse altogether. Enumerate in fig. 2 preparation in need file name, and file association source.Now, secondary file is also had, is come The information such as the ID matchings between storage disparate databases.
3. the 3rd step is the structure of entity table in database, entity table contains Snp (based on dbSNP), Var (except SNP mutant dna sequence, based on dbVar), Gene (contain the information of transcript, protein and enzyme, using RefSeq as It is main), Env (environmental factor form), Disease (disease information, based on OMIM and ICD10 databases), Pheno (believe by phenotype Breath, based on HPO) and Drug (medicine, based on DrugBank).Wherein, environmental factor form is artificial rule of thumb typing. In the match information between integrating disparate databases ID, for biology element, take and navigate to genomic locations, then sentence Whether about what positioning was put intersects the 0.5 of union to judge;For the title of disease, then using the typing of the databases such as OMIM itself Matching form, the disease name in addition to OMIM, Orphanet, ICD10 in other diseases information database, if matching On not, then typing is abandoned.
4. the 4th step builds for the contingency table based on public database.Association evidence is beaten by the software of independent development Point (be positive correlation more than 0 scope -1~1, be negative correlation less than 0, most of data-base recording positive correlation information).Marking Method be:
* database is classified, given a mark, by taking the association of disease and gene as an example, database is divided into two levels, UniProt, OMIM are first layer, and GAD, DisGeNet are the second layer, and the score value of first layer is set to 0.3 point, the score value of the second layer For 0.1 point, if the association in legacy data storehouse has had score value, normalized to after 0~1, and database is in itself Score value is multiplied.Then the identical recordings score value of disparate databases is overlapped, and highest is divided into 1 point.
* database, can be all layered by each contingency table in advance by expert.
5. the 5th step builds for the contingency table based on text mining.The api interfaces provided first using PubMed official websites, under The summary of all documents since load nineteen ninety.Using independent development software carry out text mining (based on python programming, Nltk natural language processings bag).Basic program thread is as follows:
* abstract files (plain text format) are read in, the annotation of MH (MESH databases) is determined whether there is, if so, Then extracted, and all entity tables (and all alias tables) are compared in database;If not provided, directly to summary (AB) participle is carried out, then by all participles, carries out part of speech classification, by all nouns and all entity tables (and all alias Table) it is compared.Now main and disease_info, disease_alias form compares (drug, pheno are similar).This step Main purpose be that literature summary is classified and (allows a document to be grouped into multiple classifications).
* for the variation of DNA sequence dna level, in Snp_alias.txt files, SNP with No. RS be topmost major key, And HGVS titles are collected, according to rsID (e.g rs1024323)>Base changes (e.g NC_000004.11:g.3006043C> T)>Transcript changes (e.g NM_001004056.1:c.329C>T)>Amino acid changes (e.g NP_001004057.1: P.Ala142Val, NP_001004057.1:p.A142V)>Position (e.g 37:chr4;3006043;C;T), and plus now Gene common first names GRK4.If directly occurring in that rs1024323 in sentence, or occur GKR4 and A142V simultaneously, it is as follows:
e.g:We hypothesized that 3nonsynonymous GRK4single-nucleotide Polymorphisms, R65L (rs2960306), A142V (rs1024323), and A486V (rs1801058), would be Associated with blood pressure response to atenolol, but not Hydrochlorothiazide, and would be associated with long-term cardiovascular Outcomes (all-cause death, nonfatal myocardial infarction, nonfatal stroke) in participants treated with an atenolol-based versus verapamil-SR-based antihypertensive strategy.
e.g:METHODS:Participants from the African American Study of Kidney Disease and Hypertension(AASK)trial were genotyped at three GRK4polymorphisms:R65L, A142V, and A486V.
Then corresponding sentence is proposed.
* for the document in same category, Atenolol (atenolol) for example now, more than in first case Obtain it is related to A142V, then in whole summary, extract the related sentence of description " A142V (rs1024323), ... would be associated with blood pressure response to atenolol ", now pass through Participle, obtains verb therein and passes judgement on emotional semantic classification, the relation of record mutation and medicine, and whole sentence.
* now evidence rank can also be given a mark, marking is included:Verb list reaches the marking of positive-negative relationship;It is such as clear and definite Expression related " associated with " give 0.3 point, if " might " waits the vocabulary of reduction dependency relation, then to 0.1 point. It is related if negative sense, then to -0.3 and -0.1 point.Different document score values are overlapped, and absolute value is added, if directly The score value for connecing addition is less than the score value (threshold value is adjustable) that 0.5* absolute values are added, then extracts the record, be put into checklist.Most Whole score maximum is 1 point, minimum -1 point.
6. the 6th step is contingency table scoring system.Main purpose is to integrate public database to excavate obtained score value and document Excavate obtained score value.Last associated entry comes from above-mentioned two source, and all entries are converged according to relationship score Total sequence, sets threshold value, is respectively put into core contingency table and candidate association table;If the entry and document in public database source The entry in source has greater difference (score value difference is more than 0.5, and threshold value can adjust), then is put into checklist.Wherein, for complicated disease Disease has the associated entry for determining OR values, and if there is no associated entry that is conflicting or differing greatly, (OR value fold differences exist Within 0.5~2, threshold value is adjustable), then core contingency table is directly entered, if it is present being put into checklist.
7. the 7th step is built for the data base management system of matching, the code of the system is finished writing in advance, and original date storehouse Structure matching, can carry out secondary development according to database real demand.The hardware configuration that corollary system needs is as follows:
(1) database part is developed using MySql.
(2) managing web part is developed using Django, for database data search, modification etc., managing institute There are html pages etc..
(3) page is write using html5, and comprising css, javascript code definition style sheets write the page complicated Action etc..
(4) system is backed up using Linux Shell scripts.
System building is in Linux centos platforms, 4 core CPU, 4G internal memories, 1T hard disks.
8. the 8th step is the desk checking of database checklist.This step verifying work will in a management system, R&D expert Carry out after Account Logon, carry out entry and browse and manual modification.
9. the 9th step is the test of match management system.Management system is contained:Database is checked, database search, number Verified according to storehouse, database version management, log query pages and the operation standard page.
The data for the construction method database that the present invention is provided also directly pass through common data from not only literature summary Storehouse is excavated, and the information excavated is not limited to disease, further comprises phenotype, environmental factor etc., more comprehensively.In addition, this hair In bright, the management system of database matching is contained.
3rd optimal technical scheme
Based on same inventive concept, the present invention also provides a kind of genetic test construction of knowledge base system, including:
First builds module:Association for building database entity table, the contingency table of public database and text mining Table;
Second builds module:For building contingency table scoring system;
3rd builds module:For building database matching management system.
The system also includes collecting public database and it is arranged, and determines database structure according to public database Collection module.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (10)

1. a kind of genetic test construction of knowledge base method, it is characterised in that methods described includes:
Build database entity table, the contingency table of public database and the contingency table of text mining;
Build contingency table scoring system;
Database matching management system is built, is easy to expert to search, verifies, changes contingency table.
2. genetic test construction of knowledge base method as claimed in claim 1, it is characterised in that in the structure database entity Before the contingency table of table, the contingency table of public database and text mining, in addition to:Collect public database and it arranged, And database structure is determined according to public database.
3. genetic test construction of knowledge base method as claimed in claim 1, it is characterised in that the structure database entity Table, including the cdna sample information containing disease, phenotype and environmental factor form;
The environmental factor form is manual entry form;
In the match information between integrating disparate databases ID, for biology element, take and navigate to genomic locations, sentence Position whether the intersection put is more than the 0.5 of union to judge;
For the title of disease, then using the matching form of database typing itself, if unmatched, typing is abandoned.
4. genetic test construction of knowledge base method as claimed in claim 1, it is characterised in that the structure public database Contingency table, including:
Public database is layered;
Given a mark according to layering level, marking scope, -1~1, is positive correlation more than 0, be negative correlation less than 0.
5. genetic test construction of knowledge base method as claimed in claim 1, it is characterised in that the pass of the structure text mining Join table, including:
Collect the literature summary within nearly 25 years;
Literature summary is classified;
Evidence rank is given a mark, marking is included:Verb list reaches the marking of positive-negative relationship;Different document score values are folded Plus, and absolute value is added, if the score value being directly added is less than threshold value, the literature summary of evidence level is extracted and is put into verification Table;Final score maximum is 1 point, minimum -1 point.
6. genetic test construction of knowledge base method as claimed in claim 1, it is characterised in that according to the association of public database Table and the contingency table of text mining build contingency table scoring system, including:
Integrate public database and excavate the score value that obtained score value and literature mining are obtained;
Last associated entry comes from above-mentioned two source, and all associated entries are carried out according to relationship score to collect sequence, Threshold value is set, core contingency table and candidate association table is respectively put into;
If the entry in public database source and the entry score value difference of literature reference are more than threshold value, checklist is put into;
For disease have determine OR values associated entry, if there is no conflicting or OR values fold difference 0.5~2 it It is interior, then directly enter core contingency table;If it is present being put into checklist.
7. genetic test construction of knowledge base method as claimed in claim 1, it is characterised in that the database matching management system Unite checked for database, database search, database verification, database version management, log query pages and operation standard page Face.
8. genetic test construction of knowledge base method as claimed in claim 1, it is characterised in that in the structure database matching After management system, in addition to:The desk checking of database checklist;Verifying work will carry out bar in data base management system Mesh is browsed and manual modification.
9. a kind of genetic test construction of knowledge base system, it is characterised in that the system includes:
First builds module:Contingency table for building database entity table, the contingency table of public database and text mining;
Second builds module:For building contingency table scoring system;
3rd builds module:For building database matching management system.
10. genetic test construction of knowledge base system as claimed in claim 1, it is characterised in that the system also includes collecting Public database is simultaneously arranged to it, and determines according to public database the collection module of database structure.
CN201710165871.2A 2017-03-20 2017-03-20 Gene detection knowledge base construction method and system Expired - Fee Related CN107169310B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710165871.2A CN107169310B (en) 2017-03-20 2017-03-20 Gene detection knowledge base construction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710165871.2A CN107169310B (en) 2017-03-20 2017-03-20 Gene detection knowledge base construction method and system

Publications (2)

Publication Number Publication Date
CN107169310A true CN107169310A (en) 2017-09-15
CN107169310B CN107169310B (en) 2020-06-26

Family

ID=59848863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710165871.2A Expired - Fee Related CN107169310B (en) 2017-03-20 2017-03-20 Gene detection knowledge base construction method and system

Country Status (1)

Country Link
CN (1) CN107169310B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108172296A (en) * 2018-01-23 2018-06-15 上海其明信息技术有限公司 A kind of method for building up of database and the Risk Forecast Method of genetic disease
CN108229105A (en) * 2018-02-05 2018-06-29 南京组慧生物科技有限公司 A kind of construction method of disease genome database and disease genome database
CN108363902A (en) * 2018-01-30 2018-08-03 成都奇恩生物科技有限公司 A kind of accurate prediction technique of pathogenic hereditary variation
CN109086574A (en) * 2018-08-16 2018-12-25 国家卫生计生委科学技术研究所 Disease related protein database
CN109460464A (en) * 2018-11-23 2019-03-12 北京羽扇智信息科技有限公司 Knowledge Discovery Method, device, electronic equipment and storage medium
CN110827916A (en) * 2019-10-24 2020-02-21 南方医科大学南方医院 Schizophrenia gene-gene interaction network and construction method thereof
CN110931084A (en) * 2018-08-31 2020-03-27 国际商业机器公司 Extraction and normalization of mutant genes from unstructured text for cognitive search and analysis
CN111192626A (en) * 2019-12-31 2020-05-22 中南大学湘雅医院 Construction method, device, server and storage medium of Parkinson disease genomics association model
CN111597161A (en) * 2020-05-27 2020-08-28 北京诺禾致源科技股份有限公司 Information processing system, information processing method and device
CN112687326A (en) * 2019-10-18 2021-04-20 北京希望组生物科技有限公司 Gene and phenotype associated knowledge base, construction method and application thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194187A1 (en) * 2001-05-16 2002-12-19 Mcneil John Multi-paradigm knowledge-bases
CN1720524A (en) * 2002-10-29 2006-01-11 埃里·阿博 Knowledge system method and apparatus
CN102622346A (en) * 2011-01-26 2012-08-01 中国科学院上海生命科学研究院 Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
US20130144882A1 (en) * 2011-12-03 2013-06-06 Medeolinx, LLC Multidimensional integrative expression profiling for sample classification
CN103714180A (en) * 2014-01-08 2014-04-09 浪潮(北京)电子信息产业有限公司 Bioinformatics database system and data processing method
CN105740243A (en) * 2014-12-08 2016-07-06 深圳华大基因研究院 Method and device for constructing biological information database

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194187A1 (en) * 2001-05-16 2002-12-19 Mcneil John Multi-paradigm knowledge-bases
CN1720524A (en) * 2002-10-29 2006-01-11 埃里·阿博 Knowledge system method and apparatus
CN102622346A (en) * 2011-01-26 2012-08-01 中国科学院上海生命科学研究院 Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
US20130144882A1 (en) * 2011-12-03 2013-06-06 Medeolinx, LLC Multidimensional integrative expression profiling for sample classification
CN103714180A (en) * 2014-01-08 2014-04-09 浪潮(北京)电子信息产业有限公司 Bioinformatics database system and data processing method
CN105740243A (en) * 2014-12-08 2016-07-06 深圳华大基因研究院 Method and device for constructing biological information database

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MAHDI ASADPOUR等: "《 Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)》", 18 December 2007 *
ROCÍO C. ROMERO-ZALIZ等: ""A Multiobjective Evolutionary Conceptual Clustering Methodology for Gene Annotation Within Structural Databases: A Case of Study on the Gene Ontology Database"", 《IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION》 *
何飞等: ""RiceDB:基于Web界面的水稻基因芯片注释整合数据库"", 《中国水稻科学》 *
王凤格等: ""玉米品种DNA指纹数据库构建的标准化规范"", 《分子植物育种》 *
王栋等: ""抑郁症相关基因数据库的构建"", 《中国临床康复》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108172296A (en) * 2018-01-23 2018-06-15 上海其明信息技术有限公司 A kind of method for building up of database and the Risk Forecast Method of genetic disease
CN108363902A (en) * 2018-01-30 2018-08-03 成都奇恩生物科技有限公司 A kind of accurate prediction technique of pathogenic hereditary variation
CN108363902B (en) * 2018-01-30 2022-02-25 成都奇恩生物科技有限公司 Accurate prediction method for pathogenic genetic variation
CN108229105A (en) * 2018-02-05 2018-06-29 南京组慧生物科技有限公司 A kind of construction method of disease genome database and disease genome database
CN109086574A (en) * 2018-08-16 2018-12-25 国家卫生计生委科学技术研究所 Disease related protein database
CN109086574B (en) * 2018-08-16 2022-01-07 国家卫生健康委科学技术研究所 Disease-associated protein database
CN110931084A (en) * 2018-08-31 2020-03-27 国际商业机器公司 Extraction and normalization of mutant genes from unstructured text for cognitive search and analysis
CN110931084B (en) * 2018-08-31 2024-04-16 国际商业机器公司 Extraction and normalization of mutant genes from unstructured text for cognitive searching and analysis
CN109460464A (en) * 2018-11-23 2019-03-12 北京羽扇智信息科技有限公司 Knowledge Discovery Method, device, electronic equipment and storage medium
CN112687326A (en) * 2019-10-18 2021-04-20 北京希望组生物科技有限公司 Gene and phenotype associated knowledge base, construction method and application thereof
CN110827916A (en) * 2019-10-24 2020-02-21 南方医科大学南方医院 Schizophrenia gene-gene interaction network and construction method thereof
CN111192626A (en) * 2019-12-31 2020-05-22 中南大学湘雅医院 Construction method, device, server and storage medium of Parkinson disease genomics association model
CN111597161A (en) * 2020-05-27 2020-08-28 北京诺禾致源科技股份有限公司 Information processing system, information processing method and device

Also Published As

Publication number Publication date
CN107169310B (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN107169310A (en) A kind of genetic test construction of knowledge base method and system
CA3018186C (en) Genetic variant-phenotype analysis system and methods of use
Chen et al. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline
Seifuddin et al. lncRNAKB, a knowledgebase of tissue-specific functional annotation and trait association of long noncoding RNA
US7856317B2 (en) Systems and methods for constructing genomic-based phenotypic models
Verspoor et al. Annotating the biomedical literature for the human variome
CN110570905A (en) method and device for constructing omics data analysis platform and computer equipment
Knowles et al. Grape RNA-Seq analysis pipeline environment
Gefen et al. Syndrome to gene (S2G): in‐silico identification of candidate genes for human diseases
Alterovitz et al. Knowledge-based bioinformatics: from analysis to interpretation
Tripp et al. Sleepless nights: When you can't find anything to use but molecules to describe new taxa
Tang et al. MAC: merging assemblies by using adjacency algebraic model and classification
Srivastava et al. NetSeekR: a network analysis pipeline for RNA-Seq time series data
Szakonyi et al. The KnownLeaf literature curation system captures knowledge about Arabidopsis leaf growth and development and facilitates integrated data mining
Li et al. PanSVR: Pan-genome augmented short read realignment for sensitive detection of structural variations
Martin et al. The genolevures database
Pastor et al. A conceptual modeling approach to improve human genome understanding
Leveugle et al. ParaDB: a tool for paralogy mapping in vertebrate genomes
Padilla-Iglesias et al. Deep history of cultural and linguistic evolution among Central African hunter-gatherers
Lesk et al. Database annotation in molecular biology
CN118280456B (en) Mitochondrial DNA data normalization method and integrated application platform
Mielczarek et al. Extraordinary Command Line: Basic Data Editing Tools for Biologists Dealing with Sequence Data
Wang et al. Shenjie Wang1, 2, Yuqian Liu1, 2, Juan Wang1, 2, 3*, Xiaoyan Zhu 1, 2, Yuzhi Shi3, Xuwen Wang1, 2, Tao Liu3, Xiao Xiao2, 4 and Jiayin Wang1, 2
Becker Machine Learning Methods for Complex Structural Variation Analysis
Tripathi Harmonization of SNP identifiers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200626