CN112567345A - Data mining method and electronic equipment - Google Patents

Data mining method and electronic equipment Download PDF

Info

Publication number
CN112567345A
CN112567345A CN201980001876.9A CN201980001876A CN112567345A CN 112567345 A CN112567345 A CN 112567345A CN 201980001876 A CN201980001876 A CN 201980001876A CN 112567345 A CN112567345 A CN 112567345A
Authority
CN
China
Prior art keywords
entity
topic
subject
entities
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201980001876.9A
Other languages
Chinese (zh)
Other versions
CN112567345B (en
Inventor
牛钢
季小阳
张中海
冯震东
张春明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Phil Rivers Technology Ltd
Original Assignee
Phil Rivers Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Phil Rivers Technology Ltd filed Critical Phil Rivers Technology Ltd
Publication of CN112567345A publication Critical patent/CN112567345A/en
Application granted granted Critical
Publication of CN112567345B publication Critical patent/CN112567345B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The data mining method comprises the following steps: obtaining a document entity database relating to the specified topic (S11); identifying genes that are precisely related to the topic based on the literature entity database (S12); and performing data mining based on the literature entity database and genes that are precisely related to the topic (S13). And identifying genes precisely related to the specified subject based on the document entity database related to the specified subject, and taking the genes precisely related to the specified subject as a precondition for data mining, so that the subsequent data mining has biological functionality and accuracy, and is convenient for users to understand and use the data obtained by mining.

Description

Data mining method and electronic equipment Technical Field
The application relates to a data mining method and electronic equipment.
Background
The document mining is an automatic text data analysis method, and relates to a plurality of research fields such as data mining, text mining and natural language processing. Among the techniques commonly used in the mining of biomedical literature are: information extraction, text classification, entity identification, entity association analysis, and the like. In the information extraction technology, a query information acquisition system and a document information acquisition system are generally used. Dictionary-based, rule-based, and statistical-based methods are commonly employed in entity identification technologies for entity identification. Entity co-occurrence frequency and a method based on natural language processing are generally adopted in the entity association analysis technology to carry out entity association. Therefore, the knowledge-driven automatic literature mining method is built, interested information can be effectively extracted from scientific literature, the interaction relation among entities hidden in the literature is found, and the burden and cost of information extraction are reduced. In summary, the goal of document mining is to provide human beings with descriptive text about a particular topic that is easy to read, and must conform to the human being's reading habits and comprehension.
The prior literature mining method in the biological field has the following defects:
in the first aspect, subject specificity is lacking. In general, document mining is based on direct extraction of a subject document content data set, and the result includes all genes and biological entity information related to the "subject". However, the current problems are mainly (1) that the relevance between the extracted partial knowledge and the theme is not strong, and (2) that excessive knowledge is extracted, which is not beneficial to human understanding. These all add noise to the entity to which the extracted subject is accurately related.
In a second aspect, functional information (other than gene/protein entities) of the entity is lacking. In general, the literature mining can directly extract biological entities based on methods such as dictionaries and rules, and the extracted entities (except genes/proteins) cannot describe functional information in biology or medicine for subjects and cannot help human beings to accurately understand biological or medical significance of knowledge in the subject field.
In a third aspect, knowledge is lacking in systematicness. Usually, document mining is only to simply integrate technical methods such as entity identification, entity association analysis and the like, and the requirements of human understanding assistance such as marking entities with precisely related topics, performing standardized classification on the entities, and performing signal path function analysis on entity categories are rarely met, so that it is difficult to finally construct a systematic knowledge database according to the related data of the topics.
Technical problem
The application aims to provide a new data mining method which can facilitate users to understand and use data obtained by mining.
Technical solution
One aspect of the present application provides a data mining method, which is executed by an electronic device, where the electronic device includes: a memory, a processor, and a program stored in the memory, the processor configured to implement, when executing the program:
obtaining a document entity database related to a specified topic;
identifying genes that are precisely related to the topic based on the literature entity database; and
data mining is performed based on the literature entity database and genes that are precisely related to the topic.
In one embodiment, the data mining based on the literature entity database and genes that are precisely related to the topic comprises:
identifying entities that are precisely related to the subject based on the literature entity database and genes that are precisely related to the subject; and
data mining is performed based on entities that are precisely related to the topic.
In one embodiment, the data mining based on entities that are precisely related to the topic comprises:
classifying entities that are precisely related to the topic into a predetermined number of topic entity categories.
In one embodiment, the data mining based on entities that are precisely related to the topic comprises: obtaining the strength of association between genes that are precisely related to the subject and the entity or subject entity class.
In one embodiment, the data mining based on entities that are precisely related to the topic further comprises: signal path analysis of the subject entity category is performed.
In one embodiment, the data mining based on entities that are precisely related to the topic comprises: a set of topic documents is obtained that are significantly related to the topic entity category.
Another aspect of the present application provides an electronic device, including: a memory, a processor, and a program stored in the memory, the program configured to be executed by the processor, the processor when executing the program implementing: the data mining method as described above.
Advantageous effects
In some embodiments of the present application, genes that are precisely related to a specified topic are identified based on a document entity database related to the specified topic, and the genes that are precisely related to the specified topic are used as a precondition for data mining, so that subsequent data mining has biological functionality and accuracy, and a user can conveniently understand and use data obtained by mining.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating a data mining method according to a first embodiment of the present application;
2A to 2C are schematic flow charts of a data mining method according to a second embodiment of the present application;
3-7 show a flow chart of the data mining method of the third to seventh embodiments of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the application.
Modes for carrying out the invention
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The entity "comprising" and any variations thereof in the description and claims of this application and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, or system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Further, the entities "first", "second", and "third", etc. are for distinguishing different objects, and are not for describing a particular order.
Example one
Fig. 1 is a schematic flow chart of a data mining method according to an embodiment of the present application, which may be executed by an electronic device, where the electronic device includes: a memory, a processor, and a program stored in the memory, the processor configured to implement the following steps when executing the program:
s11, obtaining a document entity database related to the specified subject;
s12, identifying genes precisely related to the subject based on the literature entity database; and
and S13, performing data mining based on the literature entity database and the genes precisely related to the subject.
In this embodiment, the genes precisely related to the specified topic are identified based on the document entity database related to the specified topic, and the genes precisely related to the specified topic are used as the precondition for data mining, so that the subsequent data mining has biological functionality and accuracy, and the user can understand and use the data obtained by mining conveniently.
Preferably, the data mining of S13 based on the literature entity database and genes precisely related to the topic comprises:
s131, identifying entities precisely related to the subject based on the literature entity database and the subject precisely related genes; and
and S132, carrying out data mining on the entity with the precisely related theme.
In this embodiment, the gene is used as a precondition to perform mining of the related biological entity, and thus the extracted biological entity is precisely related to the gene, so that the extracted biological entity has functionality, and the biological significance to be clarified can be further explained.
Example two
Fig. 2A to 2C are schematic flowcharts illustrating a data mining method according to another embodiment of the present application, and this embodiment provides a specific implementation of the data mining method according to the first embodiment, where:
referring to fig. 2A, obtaining the document entity database related to the specified topic at S11 in fig. 1 may further include:
s211, performing document retrieval in a specified document database by taking specified subject terms as keywords to obtain a subject document data set;
a document database in this application refers to a collection of computer-readable relevant document information, including both public and business databases, and public electronic publications.
Specifically, a specified subject term input by a user may be detected, after the user input is detected, document retrieval is performed in a specified document database with the specified subject term as a keyword to obtain documents including the specified subject term, and specified information of each document retrieved is downloaded to obtain a subject document data set; the topic literature dataset at least comprises the identification of each searched literature, the content of each literature and the corresponding relation between the literature identification and the literature content. The bibliographic content may include an abstract and/or a full text.
The specified subject term is related to the specified biological subject to be studied, and the specified subject term may be one or more.
The specific information includes, but is not limited to, a summary and/or a full text of the document, and may also include, for example, a title, an author, and/or a unit, etc.
S212, performing text splitting on the content of each document in the subject document data set to obtain a subject Entity library (TED); the subject entity library at least comprises the identification of each document, the entity obtained by text splitting of each document content and the corresponding relation between the document identification and each entity of the corresponding document content.
In the field of medicine and biology, but not limited to, the most valuable phrases in the literature are terms used to describe the field of research, such as genes, proteins, drug biomolecules, diseases, names of people, units, and experimental methods, etc., which are collectively referred to as entities.
In one embodiment, after text splitting, the vocabulary obtained after splitting may be normalized to a standard form.
S213, searching documents in the specified document database by taking the specified reference word as a keyword to obtain a reference document data set;
specifically, a specified reference word input by a user may be detected, after the user input is detected, document retrieval is performed in a specified document database with the reference word as a keyword to obtain documents including the specified reference word, and at least an abstract and/or a full text of each retrieved document is downloaded to obtain a reference document data set; the reference document data set includes at least an identification of each document retrieved, a document content of each document, and a correspondence between the document identification and the document content.
In order to further enhance the accuracy of specifying the subject term as the effective associated entity of the keyword, the present application may use a document data set in which the specified reference term is the keyword as a random corpus for comparison. In one embodiment, the term "public health" is designated as the reference because the "public health" document data set contains a wide range of proteins, genes, and related biomedical entities, corresponding to statistical sampling in the overall "public health" subject document data set, each time the document data set is extracted representing a random database of biomedical entities.
In one embodiment, all or part of documents retrieved using a specific reference word as a keyword may be used as the reference document data set (also referred to as a first reference document data set).
In another embodiment, a reference document data set (also referred to as a second reference document data set) may be randomly extracted from documents retrieved using a specific reference word as a keyword, the number of documents in the subject document data set being equal to the number of documents in the reference document data set, so that the interference due to the number of document sets can be eliminated.
S214, performing text splitting on the content of each document in the reference document data set to obtain a reference entity library; the reference entity library at least comprises the identification of each document, the entity obtained by text splitting of the content of each document and the corresponding relation between the document identification and each entity of the corresponding document content.
The reference Entity library obtained by performing the processes of steps S213 and S214 with the Public Health as a reference word may also be referred to as a Public-Health Entity library (PED).
S215, comparing THE subject entity library with a reference entity library to obtain a subject specific entity library (TPE) and a subject shared entity library (THE); the theme shared entity library comprises entities shared by the theme entity library and the reference entity library, and the theme specific entity library comprises entities left after the entities in the theme shared entity library are removed from the theme entity library.
In order to ensure that the extracted biological entities are functional and accurate, in the present application, genetic entities are identified from a subject entity library. Referring to fig. 2B, the identification of genes precisely related to the subject based on the literature entity database at S12 in fig. 1 includes:
s221, comparing the entities in the subject entity library with the genes in the designated gene library, and extracting the genes shared by the subject entity library and the designated gene library to form a primary gene entity list.
Wherein the specified gene library may be a gene library of a specified species, for example, a human official gene library which may be the Yuguo Gene nomenclature Committee (HGNC) database.
S222, determining the gene entities precisely related to the theme in the primary list of the gene entities according to the theme literature data set, the reference literature data set and the preset screening standard of the theme precisely related genes.
The higher frequency of occurrence of an entity in the literature, as assessed by statistical investigations of literature entities, indicates that the entity is a widely used and precisely related subject biological subject. In this example, the subject precise relevant gene screening criteria may be: the gene entities appear in the subject document dataset and the odds-to-dominance ratio of the gene entities appearing in the subject document dataset to the reference document dataset is greater than a predetermined value.
In one embodiment, screening can be based on the number of occurrences of the genetic entities in the primary list of genetic entities in the literature. Specifically, the gene entities in the primary list of gene entities are respectively compared with the subject literature data set and the reference literature data set, and the ratio of the frequency of occurrence of each gene in the primary list of gene entities (also called as the dominance ratio) is respectively calculated, so as to obtain the gene list accurately related to the subject. Equation (1) shows a calculation equation of the odds ratio.
Figure PCTCN2019093713-APPB-000001
In the formula (1)Wherein i represents the ith gene entity in the primary list of gene entities, A represents the subject literature dataset comprising the gene entity,
Figure PCTCN2019093713-APPB-000002
represents the subject literature dataset that does not contain the gene entity, and f represents the space (also referred to as frequency) of the literature content that does or does not contain the gene entity in the corresponding subject literature dataset. B represents a reference data set containing the gene,
Figure PCTCN2019093713-APPB-000003
represents a reference data set that does not contain the gene. In this example, the subject precise related gene screening criteria specifically are: the gene entity occurs at least once in the subject document dataset and the odds dominance ratio of the occurrence in the subject document dataset to the reference document dataset is greater than 5, i.e.
Figure PCTCN2019093713-APPB-000004
And ORi>5。
Referring to fig. 2C, the identifying of the entity precisely related to the subject based on the literature entity database and the genetic entity precisely related to the subject at S131 in fig. 1 comprises:
s231, obtaining a literature data set which contains a gene entity accurately related to a theme in the theme literature data set to form a theme literature set alpha;
s232, enabling entities of all document contents in the theme document set alpha to form a candidate entity library;
s233, obtaining a first part of candidate entity libraries related to the theme according to the candidate entity libraries and the theme sharing entity library;
s234, obtaining a second part of candidate entity library related to the theme according to the candidate entity library and the theme specific entity library;
s235, combining the entities in the first part of the topic-related candidate entity library with the entities in the second part of the topic-related candidate entity library to form a topic-related candidate entity library;
wherein, S233 may specifically include:
s2331, comparing the entities in the topic sharing entity library with the candidate entities in the topic literature set α, and counting document sections (or called frequency numbers) of the topic literature set α where each entity in the topic sharing entity library appears HEj in the topic literature set α.
S2332, randomly extracting documents from the first reference data set equal to the documents in the subject document set α as a second reference data set, randomly extracting N1 times (for example, 100 times), comparing each entity HEj with the entities in the second reference data set randomly extracted each time, and counting the document space (or frequency) of the entity HEj in the second reference data set randomly extracted each time.
S2333, calculating standard scores of each entity HEj in the topic literature set alpha and the second reference literature data set, judging whether each entity HEj belongs to a first topic related candidate entity according to a first preset condition, and forming a first part of topic related candidate entity library by each entity HEj with the judgment result being yes; wherein the first predetermined condition may be that the frequency criterion score of the occurrence of the entity HEj in the subject document set a and the second reference data set reaches a predetermined value.
The formula for calculating the standard score is shown in formula (2):
Figure PCTCN2019093713-APPB-000005
in formula (2), j represents the j-th entity in the topic sharing entity library, f represents frequency, C represents the topic literature set alpha, D represents the randomly extracted second reference literature data set,
Figure PCTCN2019093713-APPB-000006
and
Figure PCTCN2019093713-APPB-000007
respectively, the results of N1 random simulations of the jth entity with respect to the second reference data set
Figure PCTCN2019093713-APPB-000008
Mean and standard deviation of (d). In this example, the screening criteria for the first part of topic-related candidate entities are: the standard fraction of frequency of occurrence of the entity in the subject literature set alpha and the second reference literature data set is greater than or equal to 6, namely Zj is greater than or equal to 6.
Wherein, S234 may specifically include: comparing the entities in the topic specific entity library with the candidate entities in the topic literature set alpha, respectively counting the literature sections (or called frequency) of the topic literature set of each entity PEk in the topic specific entity library appearing in the topic literature set alpha, judging whether each entity PEk belongs to the second topic related candidate entity according to the frequency and a second predetermined condition, and judging that each entity PEk belongs to the second topic related candidate entityPEk constitutes the second partial topic-related candidate entity library. In one specific implementation, the second predetermined condition may be that the frequency of occurrences of entities in the topic document set α reaches a predetermined value (e.g., greater than 3).
After obtaining the topic candidate entity library in S235, the following process may be continued to obtain entities precisely related to the topic.
And S236, performing entity preprocessing on the topic candidate entity library to obtain a topic candidate document set.
Some entity nouns exist in a form of single complex number and multiple synonyms in the context language environment of the literature content, and in the application, entity preprocessing can be performed on the extracted subject candidate entity library, and the method comprises the following steps:
and dividing the entities in the topic candidate entity library into single-form entities and composite-form entities. The entities in the subject candidate entity library are classified, one is entities in which only one form exists, called Single Form Entities (SFE), each Single form entity forms a Single form entity library, and the other is entities in which a Single complex form and its synonym are contained, called Complex Form Entities (CFE), each complex form entity forms a complex form entity library.
In an entity mode, entity identification is distributed to each entity in the subject candidate entity library, a single-form entity has unique entity identification, and a single-plural form and synonyms of the same entity have the same entity identification, so that subsequent classification and screening are facilitated.
Secondly, determining the topic single-form entity in the single-form entity library according to the topic literature data set, the reference literature data set and a preset topic single-form entity screening standard. The single-form entities SFE can be compared with the subject literature data set and the reference literature data set respectively, the ratio of the frequency of each single-form entity appearing in the two data sets is calculated respectively, and then the single-form entities of the subject are obtained. The subject single form entity screening criteria may be: the single-form entity at least appears for a preset number of times in the subject literature data set, the ratio of the frequency appearing in the subject literature data set to the frequency appearing in the reference literature data set reaches the preset number, and the formula (3) shows a calculation formula of the entity dominance ratio.
Figure PCTCN2019093713-APPB-000009
In equation (3), ii represents a single-form entity, F represents a frequency, E represents a subject document dataset, F represents a reference document dataset, and S represents the number of documents in the corresponding document dataset. In this embodiment, the screening criteria for the subject single form entity are: the single form entity appears at least 20 times in the subject document data set, and the frequency dominance ratio of the appearance in the subject document data set to the reference document data set is greater than or equal to 5, that is
Figure PCTCN2019093713-APPB-000010
And ORii≥5。
And on the other hand, determining the topic composite form entity in the composite form entity library according to the topic literature data set, the reference literature data set and a preset topic composite form entity screening standard. Each composite form entity can be compared with a subject literature data set and a reference literature data set respectively, and the ratio of the frequency of each composite form entity appearing in the two data sets is calculated respectively, so that the subject composite form entity is obtained. The subject composite form entity screening standard can be as follows: the ratio of the frequency ratios of the complex form entities appearing in the subject literature data set and the reference literature data set reaches a preset number, and the formula (4) shows a calculation formula of the entity dominance ratio.
Figure PCTCN2019093713-APPB-000011
In equation (4), jj represents a complex-form entity, F represents a frequency, E represents a subject document data set, F represents a reference document data set, and S represents the number of documents in the corresponding document data set. In this embodiment, the screening criteria of the topic composite form entity are: the dominance ratio of the frequency of occurrence of the entity in the compound form in the subject literature data set and the reference literature data set is more than or equal to 5, namely: OR (OR)jj≥5。
Then, comparing the entity of each topic single form with the entity of each topic compound form with the topic literature data set, screening out the literature containing the entities, and forming a topic candidate literature set.
And S237, filtering the candidate entities in the topic candidate document set according to preset filtering conditions to obtain entities precisely related to the topics.
In a specific implementation, the preset filtering conditions are as follows: candidate documents in the topic candidate document set at least containing a predetermined number (for example, 10) of topic candidate entities form a candidate set, and if a certain topic candidate entity appears in the candidate set at least a predetermined number of times (for example, 1 time), the topic candidate entity belongs to an entity precisely related to the topic.
EXAMPLE III
Fig. 3 is a schematic flow chart of a data mining method according to another embodiment of the present application, and this embodiment provides a specific implementation of S132 data mining based on an entity precisely related to the topic on the basis of the first embodiment. In this embodiment, S132 specifically includes:
and S24, clustering and classifying the entities precisely related to the theme.
After the entities precisely related to the topic are classified in a standardized way, the biological entities with close relationship distances are gathered into a category, and the biological entities contained in each topic entity category can jointly illustrate the knowledge content of a certain aspect of the topic. In this way, researchers can be facilitated to conduct further knowledge collation and mining on biological characteristics of interest.
In the application, clustering can be performed according to the relationship distance between each pair of entities, and the entities with close relationship are clustered into one category.
In one implementation, entities precisely related to the topic may be compared with the candidate set composed of candidate documents including at least a predetermined number (e.g., 10) of topic candidate entities mentioned in S236, and the distance between each pair of entities may be calculated and clustered. In particular, dichotomy clustering can be employed to separate entities that are precisely related to a topic into two different categories (also referred to as clusters). And then, performing binary clustering on clusters with more clustered data again, and so on until the number of the clusters is equal to the specified number, and finally, dividing entities with accurately related topics into different categories with the specified number.
In one particular implementation, entities that are precisely related to a topic may be dichotomized once based on pdist functions, link functions, and cluster functions.
And calculating the distance between two points in the matrix by adopting a clustering algorithm pdist. Wherein, X represents a matrix of a topic candidate document set and an accurate correlation entity, the distance between each pair of entities can be obtained by calculating the mutual distance of each pair of row vectors in the matrix X, distance is a calculation method of a specified distance, the linear correlation degree of two vectors in the matrix X is mainly measured, and correlation distance (correlation) can be adopted for calculation. The calling format of the pdist function is as follows:
D=pdist(X,distance)
the linkage between variables is defined by using a linkage function, a 'method' parameter is an algorithm for specifying the system clustering tree calculation, and D is a distance vector returned by a pdist function. The computation of the systematic clustering tree can be performed by a minimum variance method (ward), and the clustering effect can be visualized by using a dendrogram function. The calling format of the linkage function is as follows:
Z=linkage(D,‘method’)
cluster creation is performed by using cluster, classification is created according to the output Z of the link function, and c is a threshold value for cutting Z into clusters. The calling format of the cluster function is as follows:
T=cluster(Z,c)
example four
Fig. 4 is a schematic flow chart of a data mining method according to another embodiment of the present application, and this embodiment provides another specific implementation of S132 performing data mining based on an entity precisely related to the topic on the basis of the first embodiment. In this example, the strength of association between genes and entities that are precisely related to a subject may be further analyzed. In one implementation, the strength of association between a gene and an entity that are precisely related to a subject can be analyzed by determining a Relational Distance (RD) between the gene and the entity that are precisely related to the subject, including:
s441, for each gene Gi precisely related to the subject, selecting a literature set AGi containing the gene and a literature set NAGi not containing the gene from the subject literature set alpha;
s442, counting the average document length of the entity Ej which is exactly related to the subject and appears in the document set AGi containing the gene Gi and the average document length of the entity Ej which appears in the document set NAGi not containing the gene.
S443, calculating the difference between the average document sections of the entities Ej appearing in the two document sets AGi and NAGi, and obtaining the first relation distance RD between the gene Gi and the entity Ej with the precise related theme1ij
To obtain more meaningful results and to filter possible false positives, a relationship distance significance screen can be performed by standard scores.
And S444, generating N2 (for example, 100) random topic literature sets based on the topic literature set alpha so as to perform random simulation. The topic literature set alpha can be represented by a matrix, and comprises the identification of each literature, the content of each literature and the corresponding relation between the identification of the literature and the content of the literature. Each random topic literature set in this example randomly orders the literature identifications in the topic literature set α, so that the correspondence between the literature identifications and the literature content changes randomly.
S445, selecting a literature set SAGI containing the gene and a literature set SNAGi not containing the gene from a random theme literature set generated by random simulation for each time for each gene Gi precisely related to the theme; in this example, the correspondence between the genes precisely related to the topic and the document identifiers is not changed, and the documents corresponding to the genes Gi are also randomly changed by randomly ordering the document identifiers in the topic document set α.
446. The average document spread of the occurrence of the entity Ej precisely related to the subject in the document set SAGI containing the gene Gi and the average document spread of the occurrence of the entity Ej in the document set SNAGi not containing the gene are counted.
S447, calculating the difference between the average document sections of the subject entity Ej in the two part document sets SAGI and SNAGi, thereby obtaining the second relation distance RD between the gene Gi and the entity Ej corresponding to the random simulation2ij
S448, calculating RD1ijAnd RD2ijThe standard score of (a) obtains the relationship distance between the gene Gi and the entity Ej. The smaller the calculated relationship distance number, the more closely the gene Gi is related to the entity Ej.
Further, after obtaining the association strength between the genes and the entities precisely related to the topic, in order to more intuitively display the association strength between the genes and the topic entities, the standard scores may be ranked to obtain a relationship matrix of the ranked genes and the entities precisely related to the topic.
EXAMPLE five
Fig. 5 is a schematic flow chart of a data mining method according to another embodiment of the present application, and this embodiment provides another specific implementation of S132 performing data mining based on an entity precisely related to the topic on the basis of the third embodiment. After the topic accurate related entities are classified in a standardized way, the biological entities with close relationship distances are gathered into a class, and the biological entities contained in each topic entity class can jointly illustrate the knowledge content of a certain aspect of the topic. In this embodiment, the genes precisely related to the topic are associated with the topic entity category, and the association strength between the genes precisely related to the topic and the topic entity category is further analyzed, so that a research mechanism of the topic entity category can be disclosed from the perspective of the topic biological function, a knowledge structure of the topic is presented more clearly and specifically, and a scientific research worker can conveniently screen and research the knowledge structure.
In one specific implementation, the correlation strength between the genes precisely related to the subject and the subject entity category may be analyzed by determining a Relational Distance (RD) between the genes precisely related to the subject and the subject entity category, including:
s541, obtaining entities which are contained in each topic entity category Ci and are accurately related to the topic;
s542, extracting the relationship distance between each entity of the subject entity category Ci and the gene precisely related to the subject from the relationship matrix of the gene precisely related to the subject, and forming the relationship matrix of the subject entity category Ci and the gene precisely related to the subject.
S543, if the subject entity category Ci comprises a plurality of entities precisely related to the subject, calculating the average relationship distance between the plurality of subject entities in the subject entity category Ci and the gene Gi for each gene Gi precisely related to the subject, and marking the average relationship distance as RD3
S544, random simulation is carried out for a preset number of times (for example, 3000 times), each random simulation can carry out random distribution on the category of each topic-related entity, for example, random distribution of category identification is carried out, and then, random distribution is calculatedThe average relational distance between the plurality of entities assigned to the class Ci and the gene Gi is labeled RD4
S545, calculating RD3And RD4The subject entity category Ci is obtained as a relationship distance between the genes Gi.
After the relationship distance between each topic entity category and each gene precisely related to the topic is obtained, an incidence matrix between the topic gene and the topic entity category can be obtained.
EXAMPLE six
Fig. 6 is a schematic flow chart of a data mining method according to another embodiment of the present application, and this embodiment provides another specific implementation of S132 performing data mining based on an entity precisely related to the topic on the basis of the third embodiment. In this embodiment, after performing cluster analysis on entities precisely related to a topic, signal path analysis of the topic entity category may be performed.
The biological signal pathway is a series of complex signal transmission to induce the expression of related genes, regulate cell division and determine the outcome of cells. Most biological functions of cells are coordinated through signal pathways and regulation mechanisms, so that the cells respond to changes in the external environment, thereby adapting to the environment. Therefore, the research on the signal path is an important means for understanding the problems of complex life processes and the like. In this embodiment, the subject entity category may be combined with the signal path function analysis, so as to disclose the function mechanism corresponding to the subject entity category at the level of the signal path.
In one implementation, genomic, chemical, and system functional information can be integrated by Kyoto Encyclopedia of Genes and Genomes, KEGG databases, correlating gene catalogs with system functions at the cellular, species, and ecosystem level, and including pathway information for cellular biochemical processes such as metabolism, membrane transport, signaling, and cell cycle, etc., via Kyoto Genes.
Specifically, taking the KEGG signal path database as an example, the analysis of the signal path of the subject entity category may be implemented by the following method:
s641, downloading a KEGG signal path database to obtain the information of the signal path and the gene set contained in the signal path.
The information provided in the signal pathway database includes the name of the gene contained in each pathway and the status of the gene in that pathway, e.g., up-or down-regulated.
S642, determining the subject entity category to be analyzed, and simultaneously mapping the subject genes contained in the subject entity category and the relationship distance between the subject entity category and the subject genes to the data set of the signal path according to the incidence matrix between the subject genes and the subject entity category.
S643, comparing the subject genes contained in the subject entity category with the genes contained in each signal path respectively to obtain intersection genes of the subject entity category and each signal path.
S644, extracting the corresponding activation state of the intersection genes in the signal path, and obtaining the relation distance between the intersection genes and the subject entity category according to the incidence matrix between the subject genes and the subject entity category.
S645, respectively calculating an Average value (Up Average, UA) of the relationship distance between the Up-regulated intersection gene and the subject entity category and an Average value (Down Average, DA) of the relationship distance between the Down-regulated intersection gene and the subject entity category in the signal path.
S646, calculating a difference value between UA and DA of each signal path to obtain a path signature (path signature) corresponding to the subject entity type.
In order to more intuitively show the correlation strength between the subject entity category and the gene, the signal path state label can be subjected to positive and negative conversion of a numerical value, wherein the larger the numerical value, the easier the signal path is activated in the corresponding subject entity category.
In addition, further data processing may be performed based on the strength of association between the subject entity category and the signal path. For example, the subject entity category which can activate the signal path and the subject entity category which does not activate the signal path can be obtained by taking the signal path as a standard; alternatively, the signal path activated by the subject entity category and the signal path not activated by the subject entity category may also be obtained by using the subject entity category as a standard.
EXAMPLE seven
Fig. 7 is a flowchart illustrating a data mining method according to another embodiment of the present application, and this embodiment provides, on the basis of the third embodiment, another specific implementation of S14 for data mining based on an entity precisely related to the topic. In this embodiment, after the entities precisely related to the topic are classified in a standardized manner, the topic literature sets significantly related to the topic entity categories can be further obtained.
After the subject entities are classified in a standardized way, each subject entity category comprises a variable amount of subject precisely related entities. If a topic entity category comprises dozens of topic entities, the contents contained in the topic entity category cannot be summarized quickly and accurately by manpower. In this embodiment, the main content of the entity category can be known by finding the topic literature sets significantly related to the topic entity category.
In one specific implementation, the following method may be employed to obtain a topic literature set that is significantly related to a certain topic entity category.
S741, comparing the entities contained in the subject entity category to be analyzed with the entities in each document of the subject entity library TED, and counting the average frequency AF of the entities contained in the subject entity category in each document1(Average Frequency,AF)。
In this example, the average frequency of occurrence of an entity included in the subject entity category in a document refers to the average of the number of occurrences of the entity included in the subject entity category in the document.
S742, performing random simulation for M times (for example 3000 times), wherein each random simulation can perform random distribution on the category of each topic-related entity, for example, randomly distributing a category identifier, then comparing the entity corresponding to the topic entity category identifier to be analyzed with the entity of each document of the topic entity library TED, and calculating the entity contained in the category appearing in each documentAverage frequency, defined as AF2
S743, for each topic entity category and each topic abstract literature, utilizing AF1And AF2And calculating a standard score, evaluating the relevance of each topic entity category and each topic abstract document according to the obtained standard score, and obtaining a topic document set most relevant to the topic entity category.
For more intuitive presentation, the standard scores are converted into positive and negative values, for example, the numerical values are converted into positive values, and the larger the numerical value is, the more relevant the topic entity category is to the corresponding document. The N (e.g., 10) topic documents with the largest value may be selected to form a topic document set most relevant to the topic entity category.
Fig. 8 illustrates an electronic device 40 of an embodiment of the present application, including a memory 42, a processor 44, and a program 46 stored in the memory 44, the program 46 configured to be executed by the processor 44, the processor 44 when executing the program implementing the aforementioned method of data mining.
The present application also provides a storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the aforementioned method of data mining.
The electronic device may be a user terminal device, a server, or a network device, etc. in some embodiments. Such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a navigation device, an in-vehicle device, a digital TV, a desktop computer, etc., a single web server, a server group composed of a plurality of web servers, or a cloud based on cloud computing composed of a large number of hosts or web servers, etc.
The memory includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. The memory stores an operating system installed in the service node device, various application software, data, and the like.
The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
All or part of the flow of the method of the embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a processor, to instruct related hardware to implement the steps of the embodiments of the methods. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice. The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

  1. A data mining method, performed by an electronic device, the electronic device comprising: a memory, a processor, and a program stored in the memory, the processor configured to implement, when executing the program:
    obtaining a document entity database related to a specified topic;
    identifying genes that are precisely related to the topic based on the literature entity database; and
    data mining is performed based on the literature entity database and genes that are precisely related to the topic.
  2. The method of claim 1, wherein obtaining a document entity database related to a specified topic comprises:
    performing document retrieval in a specified document database by taking specified subject terms as keywords to obtain a subject document data set;
    performing text splitting on document contents in the theme document data set to obtain a theme entity library;
    searching documents in the specified document database by taking the specified reference words as keywords to obtain a reference document data set;
    performing text splitting on the document contents in the reference document data set to obtain a reference entity library; and
    comparing the subject entity library with a reference entity library to obtain a subject specific entity library and a subject shared entity library; the theme shared entity library comprises entities shared by the theme entity library and the reference entity library, and the theme specific entity library comprises entities left after the entities in the theme shared entity library are removed from the theme entity library.
  3. The method of claim 2, wherein the identifying genes that are precisely related to the topic based on the document entity database comprises:
    comparing the entities in the subject entity library with the genes in the specified gene library, and extracting the genes shared by the subject entity library and the specified gene library to form a primary gene entity list; and
    and determining the gene entities precisely related to the subject in the primary list of the gene entities according to the subject literature data set, the reference literature data set and a preset subject precisely related gene screening standard.
  4. The method of any one of claims 1 to 3, wherein the data mining based on the literature entity database and genes that are precisely related to the topic comprises:
    identifying entities that are precisely related to the subject based on the literature entity database and genes that are precisely related to the subject; and
    data mining is performed based on entities that are precisely related to the topic.
  5. The method of claim 4, wherein the identifying entities that are precisely related to the topic based on the literature entity database and genes that are precisely related to the topic comprises:
    extracting a summary data set containing gene entities precisely related to the subject in the subject literature data set to form a subject literature set;
    forming entities of each document content in the subject document set into candidate entities;
    obtaining a theme candidate entity library according to the candidate entity, the theme sharing entity library and the theme specific entity library;
    carrying out entity preprocessing on the topic candidate entity library to obtain a topic candidate document set; and
    and filtering the candidate entities in the topic candidate document set according to a preset filtering condition to obtain entities precisely related to the topic.
  6. The method of claim 5, wherein the data mining based on entities that are precisely related to the topic comprises:
    classifying entities that are precisely related to the topic into a predetermined number of topic entity categories.
  7. The method of claim 6, wherein the data mining based on entities that are precisely related to the topic comprises: obtaining the strength of association between genes that are precisely related to the subject and the entity or subject entity class.
  8. The method of claim 6, wherein the data mining based on entities that are precisely related to the topic further comprises: signal path analysis of the subject entity category is performed.
  9. The method of claim 6, wherein the data mining based on entities that are precisely related to the topic comprises: a set of topic documents is obtained that are significantly related to the topic entity category.
  10. An electronic device, comprising: a memory, a processor, and a program stored in the memory, the program configured to be executed by the processor, the processor when executing the program implementing:
    a method of data mining as claimed in any one of claims 1 to 9.
CN201980001876.9A 2019-06-28 2019-06-28 Data mining method and electronic equipment Active CN112567345B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/093713 WO2020258254A1 (en) 2019-06-28 2019-06-28 Data mining method and electronic device

Publications (2)

Publication Number Publication Date
CN112567345A true CN112567345A (en) 2021-03-26
CN112567345B CN112567345B (en) 2024-06-04

Family

ID=74060708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980001876.9A Active CN112567345B (en) 2019-06-28 2019-06-28 Data mining method and electronic equipment

Country Status (2)

Country Link
CN (1) CN112567345B (en)
WO (1) WO2020258254A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8131472B2 (en) * 2004-09-28 2012-03-06 International Business Machines Corporation Methods for hierarchical organization of data associated with medical events in databases
CN102622346A (en) * 2011-01-26 2012-08-01 中国科学院上海生命科学研究院 Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
CN103279690A (en) * 2013-06-16 2013-09-04 中国医学科学院医学信息研究所 Method for ordering medical information
CN103310129A (en) * 2013-06-13 2013-09-18 浙江大学 Evidence-based method for screening gastric cancer prognosis molecular markers

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8131472B2 (en) * 2004-09-28 2012-03-06 International Business Machines Corporation Methods for hierarchical organization of data associated with medical events in databases
CN102622346A (en) * 2011-01-26 2012-08-01 中国科学院上海生命科学研究院 Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
CN103310129A (en) * 2013-06-13 2013-09-18 浙江大学 Evidence-based method for screening gastric cancer prognosis molecular markers
CN103279690A (en) * 2013-06-16 2013-09-04 中国医学科学院医学信息研究所 Method for ordering medical information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
崔雷;刘伟;闫雷;张晗;侯跃芳;黄莹娜;张浩;: "文献数据库中书目信息共现挖掘系统的开发", 现代图书情报技术, no. 08, 25 August 2008 (2008-08-25) *
李俊;周宇葵;: "数据挖掘在生物医学工程文献检索中的应用", 图书馆学研究, no. 01, 10 January 2008 (2008-01-10) *

Also Published As

Publication number Publication date
WO2020258254A1 (en) 2020-12-30
CN112567345B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
Mao et al. MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank
Peng et al. DeepMeSH: deep semantic representation for improving large-scale MeSH indexing
Karaa et al. Medline text mining: an enhancement genetic algorithm based approach for document clustering
Matos et al. Concept-based query expansion for retrieving gene related publications from MEDLINE
CN111984851B (en) Medical data searching method, device, electronic device and storage medium
Jimeno-Yepes et al. GeneRIF indexing: sentence selection based on machine learning
KR101377114B1 (en) News snippet generation system and method for generating news snippet
Kılınç An accurate toponym-matching measure based on approximate string matching
Dessì et al. A recommender system of medical reports leveraging cognitive computing and frame semantics
Wijewickrema et al. Selecting a text similarity measure for a content-based recommender system: A comparison in two corpora
Manconi et al. Literature retrieval and mining in bioinformatics: state of the art and challenges
CN114201598B (en) Text recommendation method and text recommendation device
Huang et al. Mining physical protein-protein interactions from the literature
Dahlberg et al. A distributional semantic online lexicon for linguistic explorations of societies
Islamaj Doğan et al. Click-words: learning to predict document keywords from a user perspective
Roantree et al. Mapping longitudinal studies to risk factors in an ontology for dementia
Wawrzinek et al. Semantic facettation in pharmaceutical collections using deep learning for active substance contextualization
CN112567345B (en) Data mining method and electronic equipment
Roy et al. A tag2vec approach for questions tag suggestion on community question answering sites
CN115964474A (en) Policy keyword extraction method and device, storage medium and electronic equipment
US20210272038A1 (en) Healthcare Decision Platform
US11269937B2 (en) System and method of presenting information related to search query
Yeganova et al. A Field Sensor: computing the composition and intent of PubMed queries
Li et al. HITSZ_CDR: an end-to-end chemical and disease relation extraction system for BioCreative V
Smalheiser et al. Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant