CN106295252B - Search method for gene prod - Google Patents

Search method for gene prod Download PDF

Info

Publication number
CN106295252B
CN106295252B CN201610687440.8A CN201610687440A CN106295252B CN 106295252 B CN106295252 B CN 106295252B CN 201610687440 A CN201610687440 A CN 201610687440A CN 106295252 B CN106295252 B CN 106295252B
Authority
CN
China
Prior art keywords
gene
keyword
prod
unique features
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610687440.8A
Other languages
Chinese (zh)
Other versions
CN106295252A (en
Inventor
刘杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Bree Lan Technology Co Ltd
Original Assignee
Hangzhou Bree Lan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Bree Lan Technology Co Ltd filed Critical Hangzhou Bree Lan Technology Co Ltd
Priority to CN201610687440.8A priority Critical patent/CN106295252B/en
Publication of CN106295252A publication Critical patent/CN106295252A/en
Application granted granted Critical
Publication of CN106295252B publication Critical patent/CN106295252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

The present invention provides the search methods for gene prod, belong to information retrieval field, including constructing homologous gene database, obtain keyword to be retrieved, determine that unique features label corresponding with keyword expands keyword according to unique features label, it obtains and expands keyword, carry out network retrieval according to keyword is expanded.By obtaining unique features label according to keyword to be retrieved, expansion processing is carried out to keyword based on unique features label, it is final that the whole network retrieval is carried out according to obtained expansion keyword, multiple restriction corresponding with keyword to be retrieved is contained due to expanding in keyword, to guarantee can search on the internet with the strongest resource of keyword relevance, reduce other interference of the unrelated resource to search result.

Description

Search method for gene prod
Technical field
The invention belongs to information retrieval fields, in particular to are used for the search method of gene prod.
Background technique
With the development of sequencing technologies, several species gene order-checking is completed successively, and rapid due to Internet technology Development has become becoming in the industry based on the search that internet carries out the associated materials such as gene and gene document, gene prod Gesture.
So far, the inner number of genes of including of U.S. National Institutes gene database (NCBI) alreadys exceed 1,003,000,000 Item.But due to the presence of the historical reasons of naming rule and homologous gene, every gene is in addition to numbering (gene ID) with gene Except, it is also possible to have gene full name (gene full name), gene symbol (gene symbol), also known as (aliase, The title in the industry such as synonym), can not be included when including gene document, gene prod by unified title.Cause to work as It is preceding be based on term single gene name keyword search inquiry specific gene relevant information and product when, search efficiency it is low and inquiry knot Easily there is situations such as extraneous data or missing data in fruit.Huge difficulty is brought to the search in later period in this way.
Summary of the invention
In order to solve shortcoming and defect existing in the prior art, the present invention provides for improving being used for for recall precision The search method of gene prod.
In order to reach above-mentioned technical purpose, the present invention provides the search method for gene prod, the search method Include:
According to gene number, gene symbol, gene full name and nickname building homologous gene database;
Keyword to be retrieved is obtained, unique features label corresponding with keyword is determined from homologous gene database;
According to unique features label, keyword is opened up in conjunction with gene number, gene symbol, gene full name and nickname Exhibition obtains and expands keyword;
Network retrieval is carried out according to keyword is expanded, search result is exported.
Optionally, the search method, further includes:
Building include gene document, gene prod searching database, the searching database be equipped with it is each described Gene document, the corresponding unique features label of each gene prod.
Optionally, the search method, further includes:
Including gene document corresponding with the unique features label and/or gene are chosen in the searching database The search result of product;
The search result is exported.
Optionally, described according to unique features label, it is right in conjunction with gene number, gene symbol, gene full name and nickname Keyword is expanded, and is obtained and is expanded keyword, comprising:
According to unique features label, target gene number corresponding with unique features label, target gene symbol, mesh are determined Gene full name and nickname;
Based on keyword, by target gene number, the target gene symbol, the target gene full name with And also known as by or logical construction expanded, obtain expand keyword.
Optionally, further includes:
The unique features label is character string, and sequence byte and verifying byte are equipped in the character string.
Optionally, it is equipped with and gene number, gene symbol, gene full name and nickname in the homologous gene database Corresponding label.
Optionally, including the expansion keyword is including at least gene number, gene symbol, gene full name and nickname Character string.
Optionally, further includes:
Species gene data are obtained from gene word bank, and species gene data are screened in conjunction with comparison database, are obtained To across the direct homologous gene of species;
Based on across the direct homologous gene of species, being numbered with gene full name or gene is mutually all that standard carries out in gene word bank Expand matching, obtain direct homologous gene keyword data collection, is established according to obtained direct homologous gene keyword data collection Non-redundant database;
The expansion keyword with Keywords matching is chosen in non-redundant database.
Optionally, the combination comparison database screens species gene data, obtains across the directly homologous base of species Cause, comprising:
Sample gene data corresponding with species gene data is extracted from comparison database, is based on sample gene data pair Species gene data carry out duplicate removal screening, after being screened across the direct homologous gene of species.
Optionally, the direct homologous gene keyword data stored in the non-redundant database has uniqueness.
Technical solution provided by the invention has the benefit that
By obtaining unique features label according to keyword to be retrieved, keyword is opened up based on unique features label Exhibition processing, it is final that the whole network retrieval is carried out according to obtained expansion keyword, due to expand in keyword contain with it is to be retrieved The corresponding multiple restriction of keyword, to guarantee to search on the internet and the strongest resource of keyword relevance, drop Other the low interference of unrelated resource to search result.
Detailed description of the invention
It, below will be to attached drawing needed in embodiment description in order to illustrate more clearly of technical solution of the present invention It is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the flow diagram provided by the present invention for the search method of gene prod;
Fig. 2 is the flow diagram of the acquisition modes provided by the invention for expanding keyword.
Specific embodiment
To keep structure and advantage of the invention clearer, structure of the invention is made further below in conjunction with attached drawing Description.
Embodiment one
The present invention provides the search methods for gene prod, as shown in Figure 1, the search method includes:
11, according to gene number, gene symbol, gene full name and nickname building homologous gene database.
12, keyword to be retrieved is obtained, unique features mark corresponding with keyword is determined from homologous gene database Label.
13, according to unique features label, in conjunction with gene number, gene symbol, gene full name and also known as to keyword into Row is expanded, and is obtained and is expanded keyword.
14, according to keyword progress network retrieval is expanded, search result is exported.
In an implementation, in order to search result as abundant as possible and with gene-correlation according to keyword acquisition, this hair The bright search method provided for gene prod constructs homologous gene database, in homologous base in this search method first Because including that a large amount of gene is numbered in database, gene symbol, gene full name and nickname.In order in the next steps, energy Enough particular contents according to keyword, in homologous gene database it is determining with the associated gene number of keyword, gene symbol, Gene full name and nickname.Then according to keyword to be retrieved is got, the homologous gene database that is constructed from back Middle determination unique features label corresponding with keyword.Again according to contents such as the corresponding gene numbers of unique features label to pass Keyword carries out expansion processing, obtains that treated and expands keyword.The whole network retrieval is finally carried out according to expansion keyword, is examined Hitch fruit.
In above-mentioned steps, why be arranged obtain unique features label the step of, be in order to will include gene number, Resource in the homologous gene database of gene symbol, gene full name and nickname expands keyword, to keyword into Row accurately limit, thus guarantee can search on the internet with the strongest resource of keyword relevance, reduce other nothings Close interference of the resource to search result.
It is worth noting that, being closed present in homologous gene database when determining unique features label in step 12 Keyword group may be corresponded with keyword, in this way, unique features label can be directly determined with corresponding crucial phrase;If In homologous gene database, for keyword to be retrieved, there are more than one crucial phrases to be corresponding to it, and needs in this way More close crucial phrase is chosen from multiple crucial phrases, and then determines unique features corresponding with the crucial phrase selected Label, consequently facilitating completing subsequent processing steps according to determining unique features label.
The step of expanding keyword is obtained in step 13 to specifically include:
According to unique features label, target gene number corresponding with unique features label, target gene symbol, mesh are determined Gene full name and nickname;
Based on keyword, by target gene number, the target gene symbol, the target gene full name with And also known as by or logical construction expanded, obtain expand keyword.
Unique features label therein is character string, and sequence byte and verifying byte are equipped in the character string.So as to In after determining unique features label, calculated sequence byte is verified by verifying byte.In addition, in order to homologous It is equipped in gene database and gene number, gene symbol, gene full name and also known as corresponding label.The expansion got is closed Keyword is including at least the character string including gene number, gene symbol, gene full name and nickname.
Specifically, the search method, further includes: building includes the searching database of gene document, gene prod, in institute It states searching database and is equipped with unique features label corresponding with each gene document, each gene prod.
In an implementation, in addition to what is proposed in the above method expands keyword, the whole network is carried out based on keyword is expanded Retrieval is unexpected, further includes building searching database, and then retrieved in searching database according to unique features label, obtains Result after retrieval.
So-called searching database in this step in advance may be used comprising the database including gene document, gene prod Can as search result gene document and gene prod construct database, and be searching database in each gene pairs The content answered assigns unique features label.It, can be in the retrieval number in this way after determining unique features label according to keyword According to selection search result corresponding with the unique features label, including gene document and/or gene prod in library, and then will The search result output, selects retrieval content corresponding with keyword, phase according to unique features label in searching database For carrying out the whole network retrieval by internet, more rapid and accurately retrieval can be realized.
In the first retrieval mode, the mode for carrying out the whole network retrieval according to expansion keyword is proposed, is set forth below another A kind of acquisition modes about expansion keyword, detailed process is as shown in Figure 2.
21, species gene data are obtained from gene word bank, and species gene data are screened in conjunction with comparison database, It obtains across the direct homologous gene of species.
22, based on across the direct homologous gene of species, being numbered with gene full name or gene is mutually all standard in gene word bank Expansion matching is carried out, direct homologous gene keyword data collection is obtained, according to obtained direct homologous gene keyword data collection Establish non-redundant database.
23, the expansion keyword with Keywords matching is chosen in non-redundant database.
In an implementation, according to National Center for Biotechnology Information (National Center of Biotechnology Information, NCBI) gene word bank arrange several species gene data, in conjunction with HomoloGene Database screens across the direct homologous gene of species, is all standard in gene with gene symbol Symbol or full name full name phase Direct homologous gene data are expanded in matching in word bank, finally generate direct homologous gene keyword data collection, establish gene symbol Symbol title non-redundant database chooses the expansion keyword with Keywords matching.
Combination comparison database in step 21 screens species gene data, obtains across the direct homologous gene of species Concrete mode are as follows: corresponding with species gene data sample gene data is extracted from comparison database, based on sample gene Data to species gene data carry out duplicate removal screening, after being screened across the direct homologous gene of species.
Also, the direct homologous gene keyword data stored in non-redundant database has uniqueness.
The present invention provides the search methods for gene prod, including building homologous gene database, obtain to be retrieved Keyword, determine that corresponding with keyword unique features label expands keyword according to unique features label, obtain Expansion keyword is taken, carries out network retrieval according to keyword is expanded.By obtaining unique features mark according to keyword to be retrieved Label, carry out expansion processing to keyword based on unique features label, final to carry out the whole network retrieval according to obtained expansion keyword, Multiple restriction corresponding with keyword to be retrieved is contained due to expanding in keyword, to guarantee to search on the internet Rope to the strongest resource of keyword relevance, reduce other interference of the unrelated resource to search result.
Each serial number in above-described embodiment is for illustration only, the assembling for not representing each component or the elder generation in use process Sequence afterwards.
The above description is only an embodiment of the present invention, is not intended to limit the invention, all in the spirit and principles in the present invention Within, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims (9)

1. being used for the search method of gene prod, which is characterized in that the search method includes:
According to gene number, gene symbol, gene full name and nickname building homologous gene database;
Keyword to be retrieved is obtained, unique features label corresponding with keyword is determined from homologous gene database;
According to unique features label, keyword is expanded in conjunction with gene number, gene symbol, gene full name and nickname, It obtains and expands keyword;
Network retrieval is carried out according to keyword is expanded, search result is exported;
Wherein, it obtains and expands keyword further include:
From gene word bank obtain species gene data, species gene data are screened in conjunction with comparison database, obtain across The direct homologous gene of species;
Based on across the direct homologous gene of species, being numbered with gene full name or gene is mutually all that standard is expanded in gene word bank Matching, obtains direct homologous gene keyword data collection, is established according to obtained direct homologous gene keyword data collection non-superfluous Remaining database;
The expansion keyword with Keywords matching is chosen in non-redundant database.
2. the search method according to claim 1 for gene prod, which is characterized in that the search method is also wrapped It includes:
Building includes the searching database of gene document, gene prod, is equipped with and each gene in the searching database Document, the corresponding unique features label of each gene prod.
3. the search method according to claim 2 for gene prod, which is characterized in that the search method is also wrapped It includes:
Including gene document corresponding with the unique features label and/or gene prod are chosen in the searching database Search result;
The search result is exported.
4. the search method according to claim 1 for gene prod, which is characterized in that described according to unique features mark Label expand keyword in conjunction with gene number, gene symbol, gene full name and nickname, obtain and expand keyword, packet It includes:
According to unique features label, target gene number corresponding with unique features label, target gene symbol, purpose base are determined Because of full name and nickname;
Based on keyword, by target gene number, the target gene symbol, the target gene full name and not Claim by or logical construction expanded, obtain expand keyword.
5. the search method according to claim 1 for gene prod, which is characterized in that further include:
The unique features label is character string, and sequence byte and verifying byte are equipped in the character string.
6. the search method according to claim 1 for gene prod, which is characterized in that in the homologous gene data It is equipped in library and gene number, gene symbol, gene full name and also known as corresponding label.
7. being used for the search method of gene prod according to claim 1 or 5, which is characterized in that the expansion keyword Character string to be numbered including at least gene, including gene symbol, gene full name and nickname.
8. the search method according to claim 1 for gene prod, which is characterized in that the combination comparison database Species gene data are screened, are obtained across the direct homologous gene of species, comprising:
Sample gene data corresponding with species gene data is extracted from comparison database, based on sample gene data to species Gene data carry out duplicate removal screening, after being screened across the direct homologous gene of species.
9. the search method according to claim 1 for gene prod, which is characterized in that in the non-redundant database The direct homologous gene keyword data of middle storage has uniqueness.
CN201610687440.8A 2016-08-18 2016-08-18 Search method for gene prod Active CN106295252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610687440.8A CN106295252B (en) 2016-08-18 2016-08-18 Search method for gene prod

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610687440.8A CN106295252B (en) 2016-08-18 2016-08-18 Search method for gene prod

Publications (2)

Publication Number Publication Date
CN106295252A CN106295252A (en) 2017-01-04
CN106295252B true CN106295252B (en) 2019-05-07

Family

ID=57661318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610687440.8A Active CN106295252B (en) 2016-08-18 2016-08-18 Search method for gene prod

Country Status (1)

Country Link
CN (1) CN106295252B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428137A (en) * 2017-02-14 2018-08-21 阿里巴巴集团控股有限公司 Generate the method and device of abbreviation, verification electronic banking rightness of business
CN110349632B (en) * 2019-06-28 2020-06-16 南方医科大学 Method for screening gene keywords from PubMed literature
CN111540472B (en) * 2020-05-18 2023-06-20 霓蝶(上海)医疗科技有限公司 Intelligent risk assessment system and method for health activities
CN111739585B (en) * 2020-06-24 2022-10-18 胡嘉欣 Information extraction method based on NCBI database and related equipment thereof

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1744080A (en) * 2005-09-27 2006-03-08 南方医科大学 Specific function-related gene information searching system and method for building database of searching workds thereof
CN101201847A (en) * 2007-12-26 2008-06-18 北京东方灵盾科技有限公司 System and method for searching conventional medicament patent information
CN101266601A (en) * 2007-03-14 2008-09-17 沈诗昊 Gene chip data search engine
CN101539916A (en) * 2008-03-17 2009-09-23 亿维讯软件(北京)有限公司 Initial patent retrieving device, secondary patent retrieving device and patent retrieving system
CN101738196A (en) * 2009-12-10 2010-06-16 东软集团股份有限公司 Method and device of navigation equipment for information retrieval
CN102043812A (en) * 2009-10-13 2011-05-04 北京大学 Method and system for retrieving medical information
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN105589936A (en) * 2015-12-11 2016-05-18 航天恒星科技有限公司 Data query method and system
CN105630813A (en) * 2014-10-30 2016-06-01 苏宁云商集团股份有限公司 Keyword recommendation method and system based on user-defined template
CN105740243A (en) * 2014-12-08 2016-07-06 深圳华大基因研究院 Method and device for constructing biological information database

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1744080A (en) * 2005-09-27 2006-03-08 南方医科大学 Specific function-related gene information searching system and method for building database of searching workds thereof
CN101266601A (en) * 2007-03-14 2008-09-17 沈诗昊 Gene chip data search engine
CN101201847A (en) * 2007-12-26 2008-06-18 北京东方灵盾科技有限公司 System and method for searching conventional medicament patent information
CN101539916A (en) * 2008-03-17 2009-09-23 亿维讯软件(北京)有限公司 Initial patent retrieving device, secondary patent retrieving device and patent retrieving system
CN102043812A (en) * 2009-10-13 2011-05-04 北京大学 Method and system for retrieving medical information
CN101738196A (en) * 2009-12-10 2010-06-16 东软集团股份有限公司 Method and device of navigation equipment for information retrieval
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN105630813A (en) * 2014-10-30 2016-06-01 苏宁云商集团股份有限公司 Keyword recommendation method and system based on user-defined template
CN105740243A (en) * 2014-12-08 2016-07-06 深圳华大基因研究院 Method and device for constructing biological information database
CN105589936A (en) * 2015-12-11 2016-05-18 航天恒星科技有限公司 Data query method and system

Also Published As

Publication number Publication date
CN106295252A (en) 2017-01-04

Similar Documents

Publication Publication Date Title
CN106295252B (en) Search method for gene prod
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN103902698B (en) A kind of data-storage system and storage method
CN111460311A (en) Search processing method, device and equipment based on dictionary tree and storage medium
CN106156082B (en) A kind of ontology alignment schemes and device
CN104268280B (en) A kind of Hierarchical storage and querying method based on key value database
Jin et al. GBLENDER: towards blending visual query formulation and query processing in graph databases
CN107832047A (en) A kind of non-api function argument based on LSTM recommends method
CN101882152A (en) Portable learning machine and resource retrieval method thereof
CN112115265A (en) Small sample learning method in text classification
CN104392171B (en) A kind of automatic internal memory evidence analysis method based on data association
Lee et al. Seeding for pervasively overlapping communities
CN104794130B (en) Relation query method and device between a kind of table
Filipavicius et al. Pre-training protein language models with label-agnostic binding pairs enhances performance in downstream tasks
JP5980520B2 (en) Method and apparatus for efficiently processing a query
CN109471951A (en) Lyrics generation method, device, equipment and storage medium neural network based
CN102541284B (en) A kind of method and system of carrying out combination through target quantity in character input
CN103870460B (en) One kind beautiful search method and system
CN111061972A (en) AC searching optimization method and device for URL path matching
CN103500214B (en) Word segmentation information pushing method and device based on video searching
CN107180098B (en) Keyword eliminates method and device in a kind of information search
Li et al. FACC: a novel finite automaton based on cloud computing for the multiple longest common subsequences search
CN113204676B (en) Compression storage method based on graph structure data
Yang et al. Large-scale metagenomic sequence clustering on map-reduce clusters
CN105094209B (en) The restorative procedure and device of data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant