CN117033735A - Gene data retrieval method, device, computer equipment and storage medium - Google Patents

Gene data retrieval method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117033735A
CN117033735A CN202311286697.9A CN202311286697A CN117033735A CN 117033735 A CN117033735 A CN 117033735A CN 202311286697 A CN202311286697 A CN 202311286697A CN 117033735 A CN117033735 A CN 117033735A
Authority
CN
China
Prior art keywords
gene
information
target gene
searched
field information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311286697.9A
Other languages
Chinese (zh)
Other versions
CN117033735B (en
Inventor
谢彬
郭同坤
陈高祥
宋敏芳
唐进
王威
关欢
梁启
李科
张振华
李敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311286697.9A priority Critical patent/CN117033735B/en
Publication of CN117033735A publication Critical patent/CN117033735A/en
Application granted granted Critical
Publication of CN117033735B publication Critical patent/CN117033735B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a gene data retrieval method, a device, a computer device and a storage medium. The gene data retrieval method comprises the following steps: acquiring gene information to be searched, which is input by a user; matching target gene items in the gene retrieval items according to the gene information to be retrieved; acquiring a selection instruction input by a user based on a target gene item; and extracting corresponding gene association data from the gene association database according to the selection instruction and the target gene entry. According to the gene information to be searched, the matched target gene item and the selection instruction input by the user, the gene association data corresponding to the gene information to be searched can be accurately extracted from the association-based database, and the accuracy of gene data searching is improved.

Description

Gene data retrieval method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of data retrieval technology, and in particular, to a method, an apparatus, a computer device, and a storage medium for retrieving genetic data.
Background
With the development of biomedical technology, biological data continues to grow in an explosive manner, and particularly in the aspect of gene research, a large amount of scientific research result data exists; the gene data retrieval is of great importance to the development of related researches by scientific researchers.
In the related art, association is generally established with gene data using a unique ID. When the gene data is searched, the unique ID is searched to obtain the gene related data. However, in practical application, scientific researchers are more prone to search by using gene information such as gene names in cognition, but because scientific researches are gradually in depth and definitions of different scholars are different, the gene information is differentiated to a certain extent, so that a search system cannot accurately match an optimal search result according to the input gene information, and further, the search accuracy of the gene data is lower.
Aiming at the problem of low accuracy of gene data retrieval in the related technology, no effective solution is proposed at present.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for retrieving gene data.
In a first aspect, the present application provides a method of retrieving genetic data, the method comprising:
acquiring gene information to be searched, which is input by a user;
matching a target gene item in the gene retrieval item according to the gene information to be retrieved;
acquiring a selection instruction input by a user based on a target gene item;
And extracting corresponding gene association data from a gene association database according to the selection instruction and the target gene entry.
In one embodiment, the matching the target gene entry in the gene search entry according to the to-be-searched gene information includes: the gene search entry includes a plurality of gene entries; each of the genetic entries includes a plurality of field information;
determining field information to be searched corresponding to the gene information to be searched according to the gene information to be searched;
matching a plurality of target gene items in the gene retrieval items according to the field information to be retrieved;
and sequencing the target gene items, and displaying the sequenced target gene items.
In one embodiment, the determining, according to the to-be-retrieved genetic information, to-be-retrieved field information corresponding to the to-be-retrieved genetic information includes: the plurality of field information includes: unique identification ID, gene ID, market name, official name, alias, full name, and species; the gene information to be searched is more than or equal to two characters;
if the gene information to be searched is a positive integer, determining that the field information to be searched comprises a gene ID;
If the gene information to be searched is an English word, determining that the field information to be searched comprises a full name;
and if the gene information to be searched is not a positive integer and is not an English word, determining that the field information to be searched comprises a market name, an official name, an alias and a full name.
In one embodiment, matching a plurality of target gene entries in the gene search entries according to the field information to be searched includes:
acquiring species information input by a user;
and matching a plurality of target gene items in the gene retrieval items according to the species information and the field information to be retrieved.
In one embodiment, the sorting the plurality of target gene entries and displaying the sorted plurality of target gene entries includes:
and sorting the target gene items according to the gene IDs from small to large, and displaying the sorted target gene items.
In one embodiment, the sorting the plurality of target gene entries and displaying the sorted plurality of target gene entries includes: if the gene information to be searched is not a positive integer and is not an English word;
Acquiring hit field information of a plurality of target gene entries and weight information corresponding to the hit field information;
sorting a plurality of target gene entries according to the weight information corresponding to the hit field information;
and de-duplicating the sequenced target gene items, and displaying the de-duplicated target gene items.
In one embodiment, the sorting the plurality of target gene entries according to the weight information corresponding to the hit field information includes;
acquiring a first character length of the gene information to be searched;
acquiring a second character length corresponding to the hit field information;
determining the score value of the corresponding target gene item according to the first character length, the second character length and the weight information corresponding to the hit field information;
and sorting a plurality of target gene items according to the scoring values.
In one embodiment, the displaying the sorted plurality of target gene entries includes:
displaying the preset field information in the sorted target gene items; the preset field information includes: gene ID, official name, hit field information.
In a second aspect, the present application also provides a genetic data retrieval device, the device comprising:
the acquisition module is used for acquiring the gene information to be searched, which is input by a user;
the matching module is used for matching target gene items in the gene retrieval items according to the gene information to be retrieved;
the selection module is used for acquiring a selection instruction input by a user based on the target gene item;
and the extraction module is used for extracting corresponding gene association data from a gene association database according to the selection instruction and the target gene entry.
In a third aspect, the present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method according to any one of the embodiments of the first aspect, when the processor executes the computer program.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method described in any of the embodiments of the first aspect above.
The gene data retrieval method, the device, the computer equipment and the storage medium firstly acquire the gene information to be retrieved, which is input by a user; secondly, matching target gene items in the gene retrieval items according to the gene information to be retrieved; next, based on the target gene entry, acquiring a selection instruction input by a user; and finally, extracting corresponding gene association data from the gene association database according to the selection instruction and the target gene entry. According to the gene information to be searched, the corresponding target gene item is matched, the gene searching range and applicability can be enlarged, further according to the selection instruction and the target gene item input by a user, the gene related data corresponding to the gene information to be searched can be accurately extracted from the gene related database, and the accuracy of gene data searching is improved.
Drawings
The drawings described herein are designed to provide a further understanding of the application. The illustrative embodiments of the application and their description form part of this application and are not intended to limit the application in any way. In the drawings:
FIG. 1 is a diagram of an application environment for a method of gene data retrieval in one embodiment;
FIG. 2 is a flow chart of a method for retrieving genetic data in one embodiment;
FIG. 3 is a flow chart of a method for retrieving genetic data according to another embodiment;
FIG. 4 is a flow chart of a method for retrieving genetic data according to another embodiment;
FIG. 5 is a block diagram showing the structure of a gene data retrieval device according to one embodiment;
fig. 6 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used herein, are intended to encompass non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.
The method for searching the gene data provided by the embodiment of the application can be applied to an application environment shown in fig. 1, and fig. 1 is an application environment diagram of the method for searching the gene data in one embodiment. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, fig. 2 is a flow chart of a method for retrieving genetic data in one embodiment, where the method is applied to a terminal for illustration, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:
Step S201, obtaining the gene information to be searched, which is input by a user.
The gene information to be searched is the gene information to be searched which is input by a user. The genetic information to be retrieved may be a combination of arbitrary characters. Illustratively, the genetic information to be retrieved includes at least a positive integer, an english word, and a name. For example, when the gene information to be retrieved input by the user is "64", the gene information to be retrieved at this time is a positive integer; when the gene information to be searched input by the user is fast, the gene information to be searched is English word.
Step S202, matching the target gene item in the gene retrieval item according to the gene information to be retrieved.
The gene retrieval items refer to gene index data, each gene retrieval item comprises a plurality of gene items, and each gene item comprises a field corresponding to each gene and corresponding field information; and matching the gene information to be searched with the field information of each field in each gene entry, and taking the corresponding gene entry as a target gene entry if the gene information to be searched exists in the field information. The target gene entry refers to a gene entry corresponding to the gene information to be retrieved, i.e., a gene entry containing the gene information to be retrieved.
Specifically, according to the obtained gene information to be searched which is input by the user, at least one target gene entry corresponding to the gene information to be searched is matched in the gene search entry, namely the gene index data by using a regular matching mode of neglecting case and case. For example, when the information of the gene to be retrieved input by the user is "64", among the gene retrieval items, "64" is contained in all the target gene items that are matched.
The regular matching method neglecting cases is only one embodiment of the matching method, and the matching method can also be a matching method capable of accurately matching corresponding target gene items according to the information of the genes to be searched input by the user.
Step S203, a selection instruction input by the user based on the target gene entry is acquired.
After the target gene entry is acquired, the corresponding target gene entry can be displayed through a display device. The user can select the target gene item expected by the user from the displayed target gene items by using external equipment, and a selection instruction is generated. The selection instruction is used for screening target gene entries, i.e. for selecting target gene entries desired by the user.
Specifically, if at least one target gene entry is matched according to the gene information to be searched, at this time, a selection instruction input by the user is acquired, and the target gene entry corresponding to the selection instruction, that is, the target gene entry desired by the user, can be selected from all the target gene entries.
Step S204, extracting corresponding gene association data from the gene association database according to the selection instruction and the target gene entry.
The gene association database is used for storing gene association data and establishing association relation with target gene items through unique Identification (ID); namely, through the unique identification ID of the target gene item, corresponding gene association data can be extracted from a gene association database; gene correlation data includes, but is not limited to, biological information, public information, experimental data.
Specifically, screening target gene items according to the selection instruction to obtain target gene items corresponding to the selection instruction; and extracting gene association data corresponding to the selected target gene entry from a gene association database based on the selected target gene entry for user reference.
The gene data retrieval method can be used for retrieving different gene information to be retrieved, so that the range and applicability of gene data retrieval are enlarged; based on the gene information to be searched input by the user, at least one target gene item can be accurately matched; the target gene item expected by the user can be screened out by utilizing the selection instruction; based on the selected target gene item, the corresponding gene association data can be accurately extracted from the gene association database, and the accuracy of gene data retrieval is further improved.
In one embodiment, matching the target gene entry in the gene retrieval entry according to the gene information to be retrieved includes: the gene search entry includes a plurality of gene entries; each genetic entry includes a plurality of field information; the method also comprises the following steps:
step 1, according to the gene information to be searched, determining the field information to be searched corresponding to the gene information to be searched.
And 2, matching a plurality of target gene entries in the gene retrieval entries according to the field information to be retrieved.
And step 3, sorting the plurality of target gene items, and displaying the sorted plurality of target gene items.
The gene retrieval items comprise a plurality of gene items and are used for obtaining a plurality of target gene items corresponding to the gene information to be retrieved in a matching mode; each genetic entry comprises a plurality of field information which is used for being matched with the genetic information to be searched to obtain the field information to be searched.
Specifically, when the gene information to be searched is a positive integer, the field information to be searched corresponding to the gene information to be searched includes the positive integer, and in the gene search items, all the target gene items containing the field information to be searched, namely, the target gene items containing the positive integer, are matched, the matched target gene items are ordered, and the ordered target gene items are displayed. When the gene information to be searched is an English word, the field information to be searched corresponding to the gene information to be searched comprises the English word, all target gene items containing the field information to be searched, namely the English word, are matched in the gene search items, the matched target gene items are ordered, and the ordered target gene items are displayed. When the gene information to be searched is a name, the field information to be searched corresponding to the gene information to be searched comprises the name, all the target gene items containing the information to be searched, namely the name, are matched in the gene search items, the matched target gene items are ordered, and the ordered target gene items are displayed. The sorted target gene entries are highlighted with hit information identical to the information of the gene to be retrieved, and the highlighting may be, but not limited to, bolded, ground color, font color, italic, underlined, and background color.
In the embodiment, the corresponding field information to be searched is obtained through the gene information to be searched, and the field information to be searched is further utilized to match a plurality of target gene entries containing the field information to be searched, so that the accuracy of matching the target gene entries is improved; and the plurality of target gene items are sequenced and displayed, so that a user can conveniently and quickly browse and acquire the target gene items corresponding to the gene information to be searched.
In one embodiment, determining field information to be searched corresponding to the gene information to be searched according to the gene information to be searched includes: the plurality of field information includes: unique identification ID, gene ID, market name, official name, alias, full name, and species; the information of the genes to be searched is more than or equal to two characters;
if the gene information to be searched is a positive integer, determining that the field information to be searched comprises a gene ID;
if the gene information to be searched is an English word, determining that the field information to be searched comprises a full name;
if the gene information to be searched is not a positive integer and is not an English word, determining that the field information to be searched comprises a market name, an official name, an alias and a full name.
Wherein, the plurality of field information further includes a history gene ID, which refers to a gene ID that has been abandoned and used in the past with progress of scientific research. The unique identification ID refers to an index of the gene entry stored in the gene retrieval entry and is used for assisting the establishment of association relation between the gene association data in the gene association database and the gene entry. The market name is a popular and common gene name of researchers in the biomedical industry, and the gene name is saved in a gene retrieval entry through manual checking and inputting, so that the accuracy of the market name is ensured. It should be noted that not all genes have market names, and that market names may differ from official names.
Specifically, the information of the genes to be searched is more than or equal to two characters; if the gene information to be searched is single character, the number of the target gene items matched in the gene search items is too large, so that the search results are too large in number and complicated in sorting; and in practical use, it is difficult to display target gene items desired by the user on the first page of the limited display screen; therefore, the present embodiment does not perform related retrieval of the single-character gene information to be retrieved.
Further, according to the gene information to be searched, the corresponding field information to be searched can be determined; the gene information to be searched at least comprises positive integers, english words and names. If the gene information to be searched is a positive integer, determining that the field information to be searched comprises a gene ID; for example, when the gene information to be searched is "51", when the gene information to be searched is a positive integer, it is determined that the field information to be searched is a gene ID and a history gene ID, that is, "51" is included in the gene ID and the history gene ID. If the gene information to be searched is an English word, determining that the field information to be searched comprises a full name; for example, when the to-be-searched gene information is "programmed", and the to-be-searched gene information is an english word, it is determined that the to-be-searched field information is a full name, that is, the full name includes "programmed". If the gene information to be searched is not a positive integer and is not an English word, determining that the field information to be searched comprises a market name, an official name, an alias and a full name; for example, when the gene information to be retrieved is "program", and when the gene information to be retrieved is not a positive integer and is not an english word, it is determined that the field information to be retrieved includes a market name, an official name, an alias, and a full name, that is, the market name, or the official name, or the alias, or the full name includes "program".
It can be understood that according to the to-be-searched gene information input by the user, preferably, whether the character of the to-be-searched gene information is larger than or equal to two characters is judged first, if so, whether the to-be-searched gene information is a positive integer is determined; otherwise, the relevant search is not performed. If the gene information to be searched is a positive integer, determining that the field information to be searched comprises a gene ID and a historical gene ID; otherwise, determining whether the gene information to be searched is an English word. If the gene information to be searched is an English word, determining that the field information to be searched comprises a full name; otherwise, determining whether the gene information to be retrieved is a name. If the gene information to be searched is a name, namely the gene information to be searched is not a positive integer and is not an English word, determining that the field information to be searched comprises a market name, an official name, an alias and a full name.
In the embodiment, the gene information to be searched is larger than or equal to two characters, so that the problems of huge number and complex sequencing of search results caused by inputting single-character gene information to be searched can be avoided; through the gene information to be searched, the corresponding field information to be searched can be accurately determined, and the accuracy of gene data searching is further improved.
In one embodiment, matching a plurality of target gene entries in the gene search entries according to the field information to be searched, further comprising the steps of:
step 1, species information input by a user is acquired.
And 2, matching a plurality of target gene items in the gene retrieval items according to the species information and the field information to be retrieved.
The to-be-searched gene information also comprises species information, and is used for further screening a plurality of target gene items, and discarding target gene items of which the part does not meet the requirements of users so as to reduce the number of the target gene items and further reduce the range of the search result of the gene data.
Specifically, matching a plurality of target gene entries in the gene retrieval entries according to field information to be retrieved; and further screening the matched target gene entries according to the input species information, and discarding target gene entries which do not contain the species information. For example, when the to-be-searched gene information input by the user is "51", the input is stopped, and at this time, the to-be-searched gene information is "51" or more and two characters or more, and the to-be-searched gene information is a positive integer, so that it can be determined that the to-be-searched field information is the gene ID and the history gene ID, that is, "51" is included in the gene ID and the history gene ID, 6 target gene entries are matched by omitting the case of the regular matching method, and the partial field information table of the 6 target gene entries matched according to the to-be-searched gene information "51" is shown in table 1. Wherein, the gene IDs of the 6 target gene entries are '5133', '512', '645560', '513', '5131', '25133', respectively; since the history gene ID of the target gene entry having the gene ID of "645560" includes "51", it is matched; further, the species information input by the user is acquired, and if the species information is "homosapiens", the target gene entry with the gene ID of "25133" is discarded because the species information is "Rattus norvegicus"; finally, the gene IDs of the 5 target gene entries are "5133", "512", "645560", "513", "5131", respectively, which are matched to the 5 target gene entries.
TABLE 1 partial field information Table of 6 target Gene entries matched according to Gene information to be retrieved "51
It should be noted that, in this embodiment, it is required to determine whether to trigger the gene data retrieval process by acquiring the operation after the user inputs the information to be retrieved, and the operation may be, but is not limited to, stopping the input, clicking the enter key, and clicking the submit button on the web page.
In this embodiment, by acquiring the species information input by the user, the number of target gene entries can be further reduced, and target gene entries which do not meet the expectations of the user can be accurately discarded, so that the range of the gene data retrieval result is reduced, and the accuracy of the gene data retrieval is improved.
In one embodiment, sorting the plurality of target gene entries and displaying the sorted plurality of target gene entries includes:
and sorting the plurality of target gene entries from small to large according to the gene IDs, and displaying the sorted plurality of target gene entries.
For example, when the gene information to be searched is "51" or more and the gene information to be searched is a positive integer, determining that the field information to be searched is a gene ID and a history gene ID, namely the gene ID and the history gene ID comprise "51", matching 6 target gene entries by ignoring a regular matching mode of a case, wherein the gene ID or the history gene ID of each target gene entry comprises "51", sorting from small to large according to the gene IDs, and the gene IDs of the 6 sorted target gene entries are "512", "513", "5131", "5133", "25133", "645560", respectively; further, the species information input by the user is acquired, and if the species information is "homosapiens", the target gene entry with the gene ID of "25133" is discarded because the species information is "Rattus norvegicus"; finally, the 5 target gene entries are matched, the sequence is carried out according to the gene IDs from small to large, and the gene IDs of the 5 target gene entries after the sequence are respectively '512', '513', '5131', '5133', '645560'.
Further, displaying the sorted target gene items for selection by a user; since the gene information to be retrieved inputted by the user is "51" and the species information is "Homo sapiens", it is preferable that the displayed field information includes a gene ID and an official name, and furthermore, since the history gene ID of the target gene entry whose gene ID is "645560" contains "51", it is necessary to display the history gene ID of the target gene entry. The 5 target gene entries finally displayed are ATP5CL2 (ID: 512), ATP5F1D (ID: 513), ATP5F1D (ID: 5131), PDCD1 (ID: 5133), ATP5F1CP1 (ID: 645560, histID: 512), respectively; wherein "ATP5CL2", "ATP5F1D", "PDCD1" and "ATP5F1CP1" are the official names corresponding to the respective target gene entries, respectively; "ID: "Gene ID for displaying target gene entry"; "HistID: "historical Gene ID for displaying target Gene entry". It should be noted that, the hit information identical to the gene information to be searched needs to be highlighted, so that the user can conveniently review the hit information, and the highlighting mode can be, but is not limited to, bolded, ground color, font color, italic, underlined, and background color.
In this embodiment, the sorting is performed from small to large according to the gene IDs, and the sorted target gene entries are displayed, so that a user can conveniently and quickly obtain the desired target gene entry, and further the retrieval efficiency of the gene data is improved.
In one embodiment, sorting the plurality of target gene entries and displaying the sorted plurality of target gene entries includes: if the gene information to be searched is not a positive integer and is not an English word, the method comprises the following steps:
step 1, acquiring hit field information of a plurality of target gene entries and weight information corresponding to the hit field information.
And step 2, sorting the plurality of target gene entries according to weight information corresponding to the hit field information.
And 3, performing duplication elimination on the sorted target gene entries, and displaying the duplicated target gene entries.
Wherein, hit field information refers to the information of the field to be searched for hitting the information of the gene to be searched. When the gene information to be searched is not a positive integer and is not an English word, the field information to be searched comprises a market name, an official name, an alias and a full name. At this time, the hit field information may be one or more of a market name, an official name, an alias, and a full name; i.e. market name, official name, alias and full name including the genetic information to be retrieved. The weight information is used for sorting the matched target gene items; the weight information is a weight value, and is marked as w; the size of the weight value w can be adjusted according to the cognition of the user on hit field information; the weight value of the market name is marked as w1, the weight value of the official name is marked as w2, and the weight value of the alias is marked as w3; since the awareness of the user to the market name, the official name and the alias decreases in order, the weight value w1 of the market name is larger than the weight value w2 of the official name, and the weight value w2 of the official name is larger than the weight value w3 of the alias, that is, w1> w2> w3. In addition, the target gene entry is deduplicated based on the unique identification ID, that is, the target gene information having the same unique identification ID is deduplicated.
Specifically, if the user desires to retrieve the target gene entry with the gene ID of 5133, but the gene information to be retrieved input by the user is pd, the gene information to be retrieved is not a positive integer and is not an english word, that is, the gene information to be retrieved is a name; and 6 target gene entries are obtained by a regular matching mode of neglecting cases. The partial field information table of the 6 target gene entries matched according to the gene information "pd" to be searched is shown in table 2. The gene IDs of the 6 target gene entries are sequentially '3952', '5133', '6622', '18566', '45913', '817329' in sequence according to the sequence of the gene IDs from small to large; further, based on the species information "Homo sapiens" input by the user, discarding the target gene entries having the gene IDs "45913", "18566", "817329", and obtaining 3 target gene entries at this time, displaying the 3 target gene entries, with the display results being LEP (ID: 3952, alias: lepd), PDCD1 (ID: 5133, mktsym: pd1), SNCA (ID: 6622, alias: pd1); wherein the displayed field information includes an official name, a gene ID, a market name, and an alias; "MktSym: "market name for displaying target gene entry; "Alias: "alias for displaying target gene entry". It should be noted that, the partial target gene entries may be matched multiple times because the market name, alias, etc. in the field information to be searched includes the gene information to be searched, i.e. the hit field information includes the gene information to be searched, for example, the target gene entry with the gene ID "5133" may be matched with each of the aliases "PD1", "PD-1" and "hPD-l", in which case the first target gene entry is taken. In this case, the target gene entry displayed in the first position is not the target gene entry whose gene ID is "5133" and the user desires to search, and in this case, weight information may be given to hit field information of a plurality of target gene entries, and further, the plurality of target gene entries may be sorted and de-duplicated.
TABLE 2 partial field information Table of 6 target Gene entries matched according to Gene information "pd" to be retrieved
Specifically, when the user desires to retrieve a target gene entry with a gene ID of 5133, the input gene information to be retrieved is pd, and the input species information is Homo sapiens, the hit field information of 7 target gene entries is obtained by omitting the regular matching mode with the case, wherein the hit field information comprises pd; at this time, the cognition of the user on hit field information is different, namely, the cognition of the user on market names, official names and aliases is different, and the cognition is sequentially reduced; therefore, there is a difference in the weight information corresponding to the hit field information of each target gene entry, that is, the weight value. Further, based on the weight information corresponding to the obtained hit field information, sorting the 7 target gene entries, wherein the sorted target gene entry partial fields and the weight information table corresponding to the hit field information are shown in table 3; the hit value is a value corresponding to the hit field information. The hit field information of the ordered target gene item is "market name", "official name", "alias"; the corresponding hit values are "PD1", "PDCD1", "PD-1", "hPD-1", "PD1", "LEPD", the corresponding weight information is sequentially w1, w2, w3 and w3, the corresponding gene IDs are "5133", "6622", "3952" in this order. The target gene item with the gene ID of 5133 is repeatedly matched, and the target gene item with the gene ID of 5133 is required to be de-duplicated through the unique identification ID, and only the target gene item with the highest weight value is reserved; finally, a plurality of target gene items which are sequenced and de-duplicated are displayed, wherein the display results are PDCD1 (ID: 51ktSym: PD1), LEP (ID: 3952, alias: LEPD), SNCA (ID: 6622, alias: PD1); wherein the displayed field information includes an official name, a gene ID, a market name, and an alias. At this time, the target gene entry PDCD1 (ID: 5133, MKTSym: PDC1) displayed in the first position is the target gene entry desired by the user.
Table 3 weight information table corresponding to the sorted target gene entry partial field and hit field information
In this embodiment, by acquiring hit field information of a plurality of target gene entries and weight information corresponding to the hit field information, a target search entry desired by a user can be accurately displayed in the first position; and the duplicate target gene item matched is subjected to duplicate removal treatment, so that the accuracy of gene data retrieval is further improved.
In one embodiment, sorting the plurality of target gene entries according to weight information corresponding to the hit field information includes:
step 1, obtaining a first character length of the gene information to be searched.
And step 2, acquiring a second character length corresponding to the hit field information.
And step 3, determining the score value of the corresponding target gene item according to the first character length, the second character length and the weight information corresponding to the hit field information.
And 4, sorting the plurality of target gene items according to the score values.
The weight information is a weight value, which is marked as w, the weight value of the market name is marked as w1, the weight value of the official name is marked as w2, the weight value of the alias is marked as w3, the weight value of the market name is larger than the weight value of the official name, and the weight value of the official name is larger than the weight value of the alias, namely w1> w2> w3.
It should be noted that, acquiring the first character length of the gene information to be searched refers to acquiring the first character length of the gene information to be searched excluding the special character; acquiring a second character length corresponding to the hit field information, namely acquiring the second character length corresponding to the hit field information without special characters; for example, when the gene information to be searched is "pd/1", wherein "/" is a special character, the "/" needs to be removed to obtain the gene information to be searched "pd1" with the special character removed.
Specifically, the first of the genetic information to be retrievedThe character length is marked as L1, the second character length corresponding to the hit field information is marked as L2, the score value of the corresponding target gene item is determined according to the first character length L1, the second character length L2 and the weight information w corresponding to the hit field information, the score value is marked as R,. The market name has a weight value w1 and a score value R1,/for>The method comprises the steps of carrying out a first treatment on the surface of the The official name has a weight value of w2, a score value of R2,the method comprises the steps of carrying out a first treatment on the surface of the The weight value of the alias is w3, the score value is R3,/for the alias>. Further, the plurality of target gene entries are ordered according to the score values.
For example, when the user desires to retrieve the target gene entry with the gene ID of "5133", the input gene information to be retrieved is "pd", and the input species information is "Homo sapiens", 7 target gene entries are obtained by omitting the regular matching manner of the case, and the hit field information of each target gene entry contains "pd". And obtaining the score value of each target gene item based on the first character length L1 of the gene information to be searched, the second character length L2 corresponding to the hit field information and the weight information w corresponding to the hit field information. If the weight w1 of the market name is 10, the weight w2 of the official name is 5, and the weight w3 of the alias is 2, the score values of the target gene entries are sorted in the order from high to low. The partial field information and score value table of the 7 target gene items matched after sorting are shown in table 4. The first character length L1 of the gene information to be searched is 2; hit values corresponding to the hit field information of the target gene entries after sequencing are 'PD 1', 'PDCD 1', 'PD-1', 'hPD-1', 'PD 1', 'LEPD'; removing special characters in the hit value corresponding to the hit field information to obtain corresponding second character lengths L2 of 3, 5, 3, 4, 3 and 4 in sequence; the corresponding weight information is 10, 5, 2 and 2 in turn; the score values of the corresponding target gene entries are 6.66, 2, 1.33, 1, 1.33 and 1 in sequence. And further performing deduplication on the sorted target gene entries, namely performing deduplication on the target gene entry with the gene ID of 5133, only reserving the target gene entry with the highest score value, namely reserving the target gene entry with the gene ID of 5133 with the score of 6.66, and discarding the rest. Finally displaying target gene items obtained after sequencing and de-duplication, wherein the display results are PDCD1 (ID: 5133, MKTSym: PD1), SNCA (ID: 6622, alias: PD1) and LEP (ID: 3952, alias: LEPD); the target gene item with the gene ID of 5133 is arranged at the first position because of the highest score value; the target gene entry with the gene ID of "6622" is arranged before the target gene entry with the gene ID of "3952" because the alias is closer to the information of the gene to be retrieved input by the user.
TABLE 4 partial field information and score table of 7 target Gene entries that are matched after ordering
In this embodiment, the score values of the corresponding target gene entries are determined through the weight information corresponding to the first character length, the second character length and the hit field information, and then the plurality of target gene entries are ordered based on the score values, so that the target gene entries expected by the user can be accurately arranged in the first position, and the accuracy of gene data retrieval is further improved.
In one embodiment, displaying the ordered plurality of target gene entries includes:
displaying preset field information in the sorted target gene items; the preset field information includes: gene ID, official name, hit field information.
For example, if the gene information to be searched is "pd", after sorting and deduplicating the target gene entries, displaying a plurality of target gene entries, wherein the display results are PDCD1 (ID: 5133, MKTSym: PD1), SNCA (ID: 6622, alias: PD1), LEP (ID: 3952, alias: LEPD); wherein, "PDCD1", "SNCA" and "LEP" are official names; "ID: "Gene ID for displaying target gene entry"; "MktSym: "market name for displaying target gene entry; "Alias: "alias for displaying target gene entry; the market names "PD1", the alias "PD1", and the alias "LEPD" are hit field information and hit values.
In this embodiment, by displaying the gene ID, the official name and the hit field information of the target gene entry, the user can be assisted to quickly look up the desired target gene entry, and further according to the target gene entry, the corresponding gene association data can be quickly obtained, thereby improving the efficiency of gene data retrieval.
In another embodiment, referring to fig. 3, fig. 3 is a flow chart of a method for retrieving genetic data in another embodiment, which includes the following steps:
step S301, obtaining the gene information to be retrieved input by the user.
Step S302, judging whether the gene information to be searched is more than or equal to two characters.
Wherein, the two characters or more of the gene information to be searched are the gene data searching conditions.
Specifically, if the to-be-searched gene information input by the user is greater than or equal to two characters, that is, meets the gene data search condition, step S303 is executed; otherwise, execution of step S308 does not perform the search. For example, when the gene information to be searched is "5", the gene information to be searched is a single character, and at this time, the gene information to be searched is not searched in a related manner.
Step S303, judging whether the gene information to be searched is a positive integer.
Specifically, if the to-be-retrieved genetic information is a positive integer, step S304 is executed to determine that the to-be-retrieved field information includes a genetic ID and a historical genetic ID; otherwise, step S309 is performed to further determine the type of the genetic information to be retrieved.
In step S304, the field information to be retrieved includes a gene ID and a history gene ID.
Specifically, when the gene information to be searched is a positive integer, it is preferable to determine that the corresponding field information to be searched is a gene ID and a history gene ID according to the gene information to be searched.
Step S305, determining whether there is a corresponding target gene entry.
Specifically, based on the field information to be searched, if there is a corresponding gene entry, that is, if there is a target gene entry, step S306 is executed; otherwise, step S309 is performed to further determine the type of the genetic information to be retrieved.
In step S306, the precisely hit gene IDs are ranked first, and the rest are ranked from small to large according to the gene IDs.
Step S307, displaying the sorted target gene items.
Wherein the displayed field information at least comprises an official name, a gene ID and a historical gene ID; hit information identical to the gene information to be retrieved is displayed in a highlighted manner, which may be, but is not limited to, bolded, ground color, font color, italic, underlined, background color.
In step S308, no search is performed.
Step S309, further determines the type of the gene information to be retrieved.
The type of the gene information to be searched at least comprises English words and names.
For example, the gene information to be searched input by the user is obtained as '513', and at this time, the gene information to be searched is 3 characters, more than 2 characters, and meets the gene data searching condition. Further, it is determined that the field information to be retrieved includes a gene ID and a history gene ID. And matching the information of the field to be searched containing '513' in the gene search item by a regular matching mode of neglecting the case, namely traversing the gene ID containing '513' and the historical gene ID to obtain 4 target gene items. At this time, the gene IDs of the accurate hits are ranked first, and the others are ranked from small to large according to the gene IDs, so that the gene IDs of the target gene entries after ranking are "513", "5131", "5133", "25133" in this order. Further, displaying partial field information in the target gene item corresponding to the sequenced gene ID for the user to select and review; the results were shown to be ATP5F1D (ID: 513), PDB1 (ID: 5131), PDCD1 (ID: 5133), and Pcdhb12 (ID: 25133); wherein the displayed field information includes an official name and a gene ID. Further, a selection instruction input by a user is acquired, and a target gene item expected by the user is hit according to the selection instruction, so that corresponding gene association data is extracted from a gene association database.
For example, the gene information to be searched input by the user is obtained as 5133, and the gene information to be searched is 4 characters, more than 2 characters and accords with the gene data searching condition. Further, it is determined that the field information to be retrieved includes a gene ID and a history gene ID. And matching the information of the field to be searched containing 5133 in the gene search item by a regular matching mode of neglecting the case, namely traversing the gene ID containing 5133 and the historical gene ID to obtain 2 target gene items. At this time, the gene IDs of the target gene entries after sorting are ranked in order of "5133" and "25133" by aligning the precisely hit gene IDs first. Further, partial field information in the target gene entry corresponding to the sequenced gene ID is displayed, and the display results are PDCD1 (ID: 5133) and Pcdhb12 (ID: 25133); wherein the displayed field information includes an official name and a gene ID. Further, a selection instruction input by a user is acquired, and a target gene item expected by the user is hit according to the selection instruction, so that corresponding gene association data is extracted from a gene association database.
For example, if the user desires to retrieve a target gene entry having a gene ID of "5133", but remembers only "279" in the alias "CD279" of the target gene entry. At this time, the gene information to be searched, which is input by the user, is acquired as '279', the gene information to be searched is 3 characters and is more than 2 characters, and the gene information to be searched accords with the gene data searching condition. Further, it is determined that the field information to be retrieved includes a gene ID and a history gene ID. And matching the information of the field to be searched containing '279' in the gene search items by a regular matching mode of neglecting the case, namely traversing the gene ID containing '279' and the historical gene ID to obtain 0 target gene items. At this time, it is necessary to further determine the type of the gene information to be retrieved. In this embodiment, the gene information "279" to be retrieved is further determined as a name; and determining that the field information to be retrieved comprises a market name, an official name, an alias, and a full name; and matching the information of the field to be searched containing '279' in the gene search item by a regular matching mode of neglecting the case, namely traversing the market name, the official name, the alias and the full name containing '279', so as to obtain 1 target gene item. Displaying the target gene item, wherein the display result is PDCD1 (ID: 5133, alias: CD 279); the fields shown therein include official names, gene IDs, and aliases. Further, a selection instruction input by a user is acquired, and a target gene item expected by the user is hit according to the selection instruction, so that corresponding gene association data is extracted from a gene association database.
In this embodiment, different situations when the gene information to be searched is a positive integer have corresponding processing modes, so that the target gene entry corresponding to the gene information to be searched and expected by the user can be accurately matched, and the accuracy of gene data search is further improved.
In another embodiment, referring to fig. 4, fig. 4 is a flow chart of a method for retrieving genetic data in another embodiment, which includes the following steps:
step S401, obtaining the gene information to be searched, which is input by a user.
Step S402, the gene information to be searched is more than or equal to two characters and is not a positive integer.
Step S403, judging whether the gene information to be searched is English word.
Specifically, if the genetic information to be searched is an english word, step S404 is executed; otherwise, step S408 is performed to further determine the type of the genetic information to be retrieved.
In step S404, the field information to be retrieved includes a full name.
Step S405, determining whether there is a corresponding target gene entry.
Specifically, based on the field information to be retrieved, if there is a corresponding target gene entry, step S406 is executed; otherwise, step S408 is performed to further determine the type of the genetic information to be retrieved.
Step S406, sorting from small to large according to the gene IDs.
Step S407, displaying the sorted target gene items.
Wherein the displayed field information at least comprises an official name, a gene ID and a full name; hit information identical to the gene information to be retrieved is displayed in a highlighted manner, which may be, but is not limited to, bolded, ground color, font color, italic, underlined, background color.
For example, the gene information to be searched input by the user is obtained as "program", and at this time, the gene information to be searched is larger than 2 characters and is not a positive integer. Further, the genetic information "program" to be searched is not matched in the english dictionary, that is, the genetic information "program" to be searched is not an english word, and at this time, the type of the genetic information to be searched needs to be further determined. In this embodiment, the gene information "program" to be retrieved is further determined as a name; and determining that the field information to be retrieved comprises a market name, an official name, an alias, and a full name; and matching the information of the field to be searched containing the program in the gene search item by a regular matching mode of neglecting the case, namely traversing the market name, the official name, the alias and the full name containing the program to obtain the corresponding target gene item. Further, a selection instruction input by a user is acquired, and a target gene item expected by the user is hit according to the selection instruction, so that corresponding gene association data is extracted from a gene association database.
For example, the gene information to be searched input by the user is obtained as "programmed", and at this time, the gene information to be searched is larger than 2 characters and is not a positive integer. Further, the genetic information "programmed" to be searched can be matched in the english dictionary, that is, the genetic information "programmed" to be searched is an english word. Further, according to the gene information to be searched, the field information to be searched is determined to comprise the full name. And matching the field information to be searched containing the programmed in the gene search items by a regular matching mode of neglecting the case, namely traversing the full name containing the programmed to obtain 2 target gene items. The partial field information table of 2 target gene entries matched according to the gene information "programmed" to be searched is shown in table 5. Sorting is performed according to the gene IDs from small to large, and the gene IDs of the target gene items after sorting are obtained as 5133 and 18566 in sequence. Further, the target gene entry is screened by acquiring the species information "Homo sapiens" input by the user, and the target gene entry having the gene ID of "18566" is discarded because the species information of the target gene entry is "Mus museuus". Further, displaying the reserved target gene items; the result was shown to be PDCD1 (ID: 5133,Name:programmed cell death 1); wherein the displayed field information contains an official name, a gene ID, and a full name; "Name: "full name for displaying target gene entry". Further, a selection instruction input by a user is acquired, and a target gene item expected by the user is hit according to the selection instruction, so that corresponding gene association data is extracted from a gene association database.
TABLE 5 partial field information Table of 2 target Gene entries according to the Gene information "programmed" to be retrieved
For example, the gene information to be searched input by the user is obtained as "cell delete", and at this time, the gene information to be searched is larger than 2 characters and is not a positive integer. Further, the to-be-searched gene information 'cell-description' can be matched in the English dictionary, namely, the to-be-searched gene information 'cell-description' is an English word. Further, according to the gene information to be searched, the field information to be searched is determined to comprise the full name. Since the gene information to be searched "cell delete" contains a space, it is necessary to divide the gene information to be searched into "cell" and "delete"; and matching the information of the fields to be searched containing the cell desath, the cell and the desath in the gene search item in a regular matching mode of neglecting the case, namely traversing the full names containing the cell desath, the cell and the desath to obtain 3 target gene items. The partial field information table of the 3 target gene entries matched according to the gene information "cell delete" to be searched is shown in table 6. Sorting is performed according to the gene IDs from small to large, and the gene IDs of the target gene items after sorting are 355, 5133 and 18566 in sequence. Further, displaying the target gene item; the results were shown to be FAS (ID: 355,Name:Fas cell surface death receptor), PDCD1 (ID: 5133,Name:programmed cell death1), and Pdcd1 (ID: 18566,Name:programmed cell death 1); wherein the displayed field information contains an official name, a gene ID, and a full name. Further, a selection instruction input by a user is acquired, and a corresponding target gene entry is hit according to the selection instruction, so that corresponding gene association data is extracted from a gene association database.
TABLE 6 partial field information Table of 3 target Gene entries matched according to Gene information to be retrieved "cell Death
For example, the gene information to be searched input by the user is obtained as "fast", and at this time, the gene information to be searched is larger than 2 characters and is not a positive integer. Further, the genetic information "fast" to be searched can be matched in the English dictionary, namely, the genetic information "fast" to be searched is an English word. Further, according to the gene information to be searched, the field information to be searched is determined to comprise the full name. And matching the field information to be searched containing the fast in the gene search items by a regular matching mode of neglecting the case, namely traversing the full name containing the fast, so as to obtain 0 target gene items. At this time, it is necessary to further determine the type of the gene information to be retrieved. In this embodiment, the gene information "fast" to be retrieved is further determined as a name; obtaining field information to be searched, wherein the field information to be searched comprises a market name, an official name, an alias and a full name; matching the information of the field to be searched containing the fast in the gene search item by a regular matching mode of neglecting the case, namely traversing the market name, the official name, the alias and the full name containing the fast, and obtaining a target gene item with 1 individual name of the fast, wherein the gene ID of the target gene item is 355, the target gene item is displayed, and the display result is FAS (ID: 355, alias: FASTM); wherein the displayed field information includes an official name, a gene ID, and an alias. Further, a selection instruction input by a user is acquired, and a target gene item expected by the user is hit according to the selection instruction, so that corresponding gene association data is extracted from a gene association database.
In this embodiment, different situations when the gene information to be searched is an english word have corresponding processing modes, so that the target gene entry corresponding to the gene information to be searched and expected by the user can be accurately matched, thereby further improving the accuracy of gene data search.
In another embodiment, when the gene information to be retrieved is not a positive integer and is not an english word, that is, the gene information to be retrieved is a name, the user may also perform differentiation and sorting when the target gene entry desired by the user is not displayed in the first position.
Wherein, differentiation and selection refers to the priority ordering of target gene items with hit field information of market names; the market name is a commonly accepted gene name which is custom made by researchers in the biological medicine industry, and the acceptance degree is higher.
For example, if the user desires to retrieve the target gene entry with the gene ID of "5133", but the gene information to be retrieved input by the user is "pd", the gene information to be retrieved is not a positive integer and is not an english word, that is, the gene information to be retrieved is a name; 6 target gene entries are obtained by omitting the regular matching mode of case and case, and are ordered from small to large according to gene IDs, wherein the gene IDs of the 6 target gene entries are ' 3952 ', ' 5133 ', ' 6622 ', ' 18566 ', ' 45913 ' 817329 '; further, based on the species information "Homo sapiens" input by the user, the target gene entries having the gene IDs "45913", "18566", "817329" were discarded, and at this time, 3 target gene entries were obtained, and the 3 target gene entries were displayed as LEP (ID: 3952, alias: lepd), PDCD1 (ID: 5133, mktsym: pd1), SNCA (ID: 6622, alias: pd1). In this case, the target gene entry displayed in the first position is not the target gene entry whose gene ID is "5133" and the user may search for the target gene entry, and differentiation may be performed in this case. Since hit field information of the target gene entry whose gene ID is "5133" is the market name "PD1", it is prioritized. Displaying the reordered target gene items, and adjusting the display result to PDCD1 (ID: 5133, MKTSym: PD1), LEP (ID: 3952, alias: LEPD) and SNCA (ID: 6622, alias: PD1); the target gene entry PDCD1 (ID: 5133, MKTSym: PDCD 1) displayed in the first position is the target gene entry desired by the user.
In this embodiment, the target gene items expected by the user are arranged in the first position through differentiation choice, so that the user can be helped to quickly select the expected target gene items, and further the efficiency of gene data retrieval is improved.
In another embodiment, when the genetic entry to be retrieved is a name, the special symbol needs to be removed.
For example, when the to-be-searched gene information is "pd/1", wherein "/" is a special character, the "/" needs to be removed to obtain the to-be-searched gene information "pd1" with the special character removed, and the to-be-searched gene information "pd1" with the special character removed is not a positive integer and is not an english word; determining the gene information to be searched 'pd 1' with special characters removed as a name, and obtaining field information to be searched including a market name, an official name, an alias and a full name; further, by omitting the regular matching mode of case, the corresponding target gene entry is matched in the gene retrieval entries.
In this embodiment, the special symbol is removed to convert the unrecognizable gene information to be searched into identifiable gene information to be searched, so that the range and applicability of gene data search are further enlarged.
In one embodiment, when the gene information to be searched is a name, the information of the field to be searched is preferentially matched with the target gene entry corresponding to the market name, the official name and the alias; based on the gene information to be searched, if the target gene items corresponding to the market name, the official name and the alias are not included in the field information to be searched, or the number of the target gene items corresponding to the market name, the official name and the alias is insufficient, the target gene items with the field information to be searched being full names can be used as supplements for displaying.
For example, if the display page can display 5 target gene entries at most, but only 3 pieces of field information to be searched are obtained, wherein the information comprises a market name, an official name and a target gene entry corresponding to an alias; at this time, among the target gene entries whose field information to be retrieved is the full name, the top 2 target gene entries ordered from small to large according to the gene ID may be displayed as supplements.
In one embodiment, a selection instruction input by a user is obtained, and a target gene item expected by the user can be selected according to the selection instruction; based on the unique identification ID of the selected target gene item, the corresponding gene association data can be extracted from the gene association database; the accuracy of gene data retrieval is further improved.
Wherein, the gene association data includes but is not limited to biological information, public information, experimental data; biological information, including but not limited to, the type of gene, chromosomal location, sequence, introns, exons, interactions, biological pathways, proteins, homology, etc., are biological data that exist in the organism itself and are progressively scientifically revealed; by way of example, scRNA-seq (single-cell RNA sequencing, single-cell transcriptome sequencing) is currently the most widely used single-cell sequencing technique that reverse transcribes, amplifies, and high-throughput sequences mRNA from single cells, and based on similarity in transcriptional profiles, can distinguish between different cell types and even reveal new cell types. The scRNA-seq data is typically presented in the form of a gene expression matrix whose behavioral genes, columns are cells. By searching the gene data, the expression condition of the gene in each cell can be obtained, thereby being beneficial to the identification of cell types and the research on the mechanism of the cell types in each tissue and disease. Public information includes, but is not limited to, publications such as genetic descriptions, papers, patents, meetings, monographs, news, blogs, research reports, and the like; experimental data includes, but is not limited to, data obtained from experimental assays of transcriptomes, proteomes, genomes, epigenetic groups, and the like; for example, innovations in the biomedical industry often rely on the discovery of new drug targets, i.e., genes, and the continued research of patent drugs for targets. Although the contribution of the drug enterprise to the value chain is the patent of the final drug-forming small molecular compound or the macromolecular biological drug, but not the target point, namely the gene itself, the information of the target point, namely the gene is often recorded in the patent, and the research condition and the vein of the current target point, namely the gene can be obtained through searching the gene data. For example, several tens of thousands of related patents including at least compounds and derivatives, compositions, antibodies, methods of preparation and indications can be obtained by simple search of the PD1 gene. By searching the gene data and combining other key words, technical classification, applicant, country, region and other information, the research condition of the further subdivision field concerned by the user can be obtained or anti-infringement measures can be made.
According to the gene data retrieval method, on the first aspect, the type of the gene information to be retrieved can be accurately judged based on the gene information to be retrieved input by a user, namely, the gene information to be retrieved is accurately judged to be a positive integer or an English word or name, so that the range and applicability of gene data retrieval are enlarged; in the second aspect, the matched target gene items are sequenced and de-duplicated, so that a user can conveniently and quickly obtain the expected target gene items, and the retrieval efficiency of the gene data is improved; according to the third aspect, according to the gene information to be searched, the target gene item and the selection instruction, the gene association data corresponding to the gene information to be searched can be accurately extracted from the gene association database, and the accuracy of gene data searching is further improved.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a gene data retrieval device for realizing the above related gene data retrieval method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitations in the embodiments of one or more gene data retrieval devices provided below can be referred to above for the limitations of the gene data retrieval method, and will not be repeated here.
In one embodiment, as shown in fig. 5, fig. 5 is a block diagram of a genetic data retrieval device according to one embodiment, including: an acquisition module 501, a matching module 502, a selection module 503, and an extraction module 504, wherein:
an obtaining module 501, configured to obtain genetic information to be retrieved input by a user;
the matching module 502 is configured to match a target gene entry in a gene search entry according to the to-be-searched gene information;
a selection module 503, configured to obtain a selection instruction input by a user based on a target gene entry;
and an extracting module 504, configured to extract corresponding gene association data from a gene association database according to the selection instruction and the target gene entry.
In one embodiment, the matching module 502 is further configured to:
determining field information to be searched corresponding to the gene information to be searched according to the gene information to be searched;
matching a plurality of target gene items in the gene retrieval items according to the field information to be retrieved;
and sequencing the target gene items, and displaying the sequenced target gene items.
In one embodiment, the matching module 502 is further configured to determine that the field information to be retrieved includes a gene ID if the gene information to be retrieved is a positive integer;
if the gene information to be searched is an English word, determining that the field information to be searched comprises a full name;
if the gene information to be searched is not a positive integer and is not an English word, determining that the field information to be searched comprises a market name, an official name, an alias and a full name.
In one embodiment, the matching module 502 is further configured to:
acquiring species information input by a user;
and matching a plurality of target gene entries in the gene retrieval entries according to the species information and the field information to be retrieved.
In one embodiment, the matching module 502 is further configured to sort the plurality of target gene entries, and display the sorted plurality of target gene entries includes:
And sorting the plurality of target gene entries from small to large according to the gene IDs, and displaying the sorted plurality of target gene entries.
In one embodiment, the matching module 502 is configured to sort the plurality of target gene entries and display the sorted plurality of target gene entries, and is specifically configured to:
if the gene information to be searched is not a positive integer and is not an English word;
acquiring hit field information of a plurality of target gene entries and weight information corresponding to the hit field information;
sorting the plurality of target gene items according to the weight information corresponding to the hit field information;
and de-duplicating the sequenced target gene items, and displaying the de-duplicated target gene items.
In one embodiment, the matching module 502 is further configured to sort the plurality of target gene entries according to weight information corresponding to the hit field information, including:
acquiring a first character length of the gene information to be searched;
acquiring a second character length corresponding to the hit field information;
determining the score value of the corresponding target gene item according to the first character length, the second character length and the weight information corresponding to the hit field information;
The plurality of target gene entries are ordered according to the score values.
In one embodiment, the matching module 502 is further configured to display the sorted plurality of target gene entries comprising:
displaying preset field information in the sorted target gene items; the preset field information includes: gene ID, official name, hit field information.
The respective modules in the above-described gene data retrieval apparatus may be realized in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 6, and fig. 6 is an internal structure diagram of the computer device in one embodiment. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of gene data retrieval. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (11)

1. A method of retrieving genetic data, the method comprising:
acquiring gene information to be searched, which is input by a user;
matching a target gene item in the gene retrieval item according to the gene information to be retrieved;
acquiring a selection instruction input by a user based on a target gene item;
and extracting corresponding gene association data from a gene association database according to the selection instruction and the target gene entry.
2. The method according to claim 1, wherein said matching a target gene entry in a gene search entry according to the gene information to be searched comprises: the gene search entry includes a plurality of gene entries; each of the genetic entries includes a plurality of field information;
determining field information to be searched corresponding to the gene information to be searched according to the gene information to be searched;
matching a plurality of target gene items in the gene retrieval items according to the field information to be retrieved;
and sequencing the target gene items, and displaying the sequenced target gene items.
3. The method according to claim 2, wherein determining, according to the to-be-retrieved genetic information, to-be-retrieved field information corresponding to the to-be-retrieved genetic information includes: the plurality of field information includes: unique identification ID, gene ID, market name, official name, alias, full name, and species; the gene information to be searched is more than or equal to two characters;
if the gene information to be searched is a positive integer, determining that the field information to be searched comprises a gene ID;
if the gene information to be searched is an English word, determining that the field information to be searched comprises a full name;
And if the gene information to be searched is not a positive integer and is not an English word, determining that the field information to be searched comprises a market name, an official name, an alias and a full name.
4. The method of claim 2, wherein matching a plurality of target gene entries in a gene search entry according to the field information to be searched comprises:
acquiring species information input by a user;
and matching a plurality of target gene items in the gene retrieval items according to the species information and the field information to be retrieved.
5. The method of claim 2, wherein sorting the plurality of target gene entries and displaying the sorted plurality of target gene entries comprises:
and sorting the target gene items according to the gene IDs from small to large, and displaying the sorted target gene items.
6. The method of claim 2, wherein sorting the plurality of target gene entries and displaying the sorted plurality of target gene entries comprises: if the gene information to be searched is not a positive integer and is not an English word;
Acquiring hit field information of a plurality of target gene entries and weight information corresponding to the hit field information;
sorting a plurality of target gene entries according to the weight information corresponding to the hit field information;
and de-duplicating the sequenced target gene items, and displaying the de-duplicated target gene items.
7. The method of claim 6, wherein the sorting the plurality of target gene entries according to the weight information corresponding to hit field information comprises;
acquiring a first character length of the gene information to be searched;
acquiring a second character length corresponding to the hit field information;
determining the score value of the corresponding target gene item according to the first character length, the second character length and the weight information corresponding to the hit field information;
and sorting a plurality of target gene items according to the scoring values.
8. The method of claim 2, wherein displaying the ordered plurality of target gene entries comprises:
displaying the preset field information in the sorted target gene items; the preset field information includes: gene ID, official name, hit field information.
9. A genetic data retrieval device, the device comprising:
the acquisition module is used for acquiring the gene information to be searched, which is input by a user;
the matching module is used for matching target gene items in the gene retrieval items according to the gene information to be retrieved;
the selection module is used for acquiring a selection instruction input by a user based on the target gene item;
and the extraction module is used for extracting corresponding gene association data from a gene association database according to the selection instruction and the target gene entry.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
CN202311286697.9A 2023-10-08 2023-10-08 Gene data retrieval method, device, computer equipment and storage medium Active CN117033735B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311286697.9A CN117033735B (en) 2023-10-08 2023-10-08 Gene data retrieval method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311286697.9A CN117033735B (en) 2023-10-08 2023-10-08 Gene data retrieval method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117033735A true CN117033735A (en) 2023-11-10
CN117033735B CN117033735B (en) 2024-01-16

Family

ID=88632187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311286697.9A Active CN117033735B (en) 2023-10-08 2023-10-08 Gene data retrieval method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117033735B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040051748A (en) * 2002-12-11 2004-06-19 바이오인포메틱스 주식회사 Apparatus and method for performing genome sequence analysis and data management
US20060064413A1 (en) * 2003-07-31 2006-03-23 Yuichi Uzawa Data retrieval method and device
JP2007299039A (en) * 2006-04-27 2007-11-15 Kanebo Cosmetics Inc Method for searching gene information
US20170091245A1 (en) * 2015-09-28 2017-03-30 International Business Machines Corporation Index management
CN110866091A (en) * 2019-11-19 2020-03-06 杭州数梦工场科技有限公司 Data retrieval method and device
CN113658644A (en) * 2021-07-05 2021-11-16 深圳大学 Gene database system
CN113901006A (en) * 2021-10-13 2022-01-07 国家计算机网络与信息安全管理中心 Large-scale gene sequencing data storage and query system
CN115762632A (en) * 2022-11-23 2023-03-07 中山大学 Construction method of gene information query system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040051748A (en) * 2002-12-11 2004-06-19 바이오인포메틱스 주식회사 Apparatus and method for performing genome sequence analysis and data management
US20060064413A1 (en) * 2003-07-31 2006-03-23 Yuichi Uzawa Data retrieval method and device
JP2007299039A (en) * 2006-04-27 2007-11-15 Kanebo Cosmetics Inc Method for searching gene information
US20170091245A1 (en) * 2015-09-28 2017-03-30 International Business Machines Corporation Index management
CN110866091A (en) * 2019-11-19 2020-03-06 杭州数梦工场科技有限公司 Data retrieval method and device
CN113658644A (en) * 2021-07-05 2021-11-16 深圳大学 Gene database system
CN113901006A (en) * 2021-10-13 2022-01-07 国家计算机网络与信息安全管理中心 Large-scale gene sequencing data storage and query system
CN115762632A (en) * 2022-11-23 2023-03-07 中山大学 Construction method of gene information query system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李淑芝;王峰;: "USGENE数据库基因序列检索方法及比较研究", 情报科学, no. 08 *
王栋, 梁蜀忠, 孙金立, 李广德: "抑郁症相关基因数据库的构建", 中国临床康复, no. 32 *

Also Published As

Publication number Publication date
CN117033735B (en) 2024-01-16

Similar Documents

Publication Publication Date Title
Lam et al. Compressed indexing and local alignment of DNA
US8620934B2 (en) Systems and methods for selecting data elements, such as population members, from a data source
US9418144B2 (en) Similar document detection and electronic discovery
Budowski-Tal et al. FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately
Xu et al. Yale Image Finder (YIF): a new search engine for retrieving biomedical images
Jacsó The plausibility of computing the h‐index of scholarly productivity and impact using reference‐enhanced databases
CN105493075A (en) Retrieval of attribute values based upon identified entities
US20140122509A1 (en) System, method, and computer program product for performing a string search
Song et al. Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central
CN107844493B (en) File association method and system
CN106156111B (en) Patent document retrieval method, device and system
KR20180097120A (en) Method for searching electronic document and apparatus thereof
JP2009116559A (en) Batch retrieval method of large number of arrangements, and retrieval system
Khan et al. DextMP: deep dive into text for predicting moonlighting proteins
Komura et al. Luigi: Large-scale histopathological image retrieval system using deep texture representations
WO2016034062A1 (en) Information lookup method and device
CN109522275B (en) Label mining method based on user production content, electronic device and storage medium
CN109299238B (en) Data query method and device
Li et al. FastPval: a fast and memory efficient program to calculate very low P-values from empirical distribution
CN117033735B (en) Gene data retrieval method, device, computer equipment and storage medium
Ilic et al. Inverted index search in data mining
Gáspári et al. Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm
KR20070119394A (en) Apparatus and method for browsing contents
CN106484865A (en) One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem
Cui et al. Fingerprinting protein structures effectively and efficiently

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant