CN112100320A - Method and device for generating terms and storage medium - Google Patents

Method and device for generating terms and storage medium Download PDF

Info

Publication number
CN112100320A
CN112100320A CN202010716035.0A CN202010716035A CN112100320A CN 112100320 A CN112100320 A CN 112100320A CN 202010716035 A CN202010716035 A CN 202010716035A CN 112100320 A CN112100320 A CN 112100320A
Authority
CN
China
Prior art keywords
term
information
text
model
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010716035.0A
Other languages
Chinese (zh)
Other versions
CN112100320B (en
Inventor
张小波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Zhengnuo Intelligent Technology Co ltd
Original Assignee
Anhui Zhengnuo Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Zhengnuo Intelligent Technology Co ltd filed Critical Anhui Zhengnuo Intelligent Technology Co ltd
Priority to CN202010716035.0A priority Critical patent/CN112100320B/en
Publication of CN112100320A publication Critical patent/CN112100320A/en
Application granted granted Critical
Publication of CN112100320B publication Critical patent/CN112100320B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses a method for generating a term, which comprises the following steps: receiving a text to be processed; acquiring word information and gene information of a text to be processed; generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information. The implementation of the invention can improve the accuracy of terms and reduce the objectivity of term generation due to manual definition, and the invention is more suitable for wide application and popularization of terms generated by terms and promotes the development of biology and medicine.

Description

Method and device for generating terms and storage medium
Technical Field
The present application relates to the field of computers, and in particular, to a method, an apparatus, and a storage medium for generating terms.
Background
There are many business fields where professional teams will construct some standardized terms so that those in the field can recognize uniformly and facilitate the learning and popularization of technology. For example: gene ontology in the fields of biology and chemistry facilitates the technical study and popularization of people in the field of biochemistry by creating a working platform for term delineation or word sense interpretation of representational normalized gene and gene product characteristics. However, general terms are mostly manually defined, organized by experts, inefficient and labor-consuming, and different experts may use different expressions to describe the same concept, possibly resulting in a problem of inconsistent term naming.
Disclosure of Invention
An object of the embodiments of the present specification is to provide a method, an apparatus, and a storage medium for generating a term, so as to implement a method capable of automatically generating a term according to a text to be processed, improve the accuracy of the term, and promote popularization and application in the field of biology. In one aspect, the present invention provides a method for generating a term, including:
receiving a text to be processed;
acquiring word information and gene information of the text to be processed;
generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
Further, the term generation model comprises a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
Further, the term generation model is constructed by adopting the following method:
collecting a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term abnormal graph according to the term name, the gene information and the abstract information of each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct the term coding sub-model;
and training and constructing the term decoding submodel according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
Further, the term decoding submodel decodes the term encoding information by using a copy mechanism to obtain a target term corresponding to the text to be processed.
Further, the constructing a term heteromorphic graph according to the term name, the genetic information and the abstract information of each sample text in the sample data set includes:
the nodes in the term abnormal graph are term names, gene information or abstract information of each sample text in the sample data set, and the edges in the term abnormal graph are word normalization values or gene term values, wherein the word normalization values represent normalization values of words in the sample text, and the gene term values are used for representing similarity between genes and terms in the sample text.
In another aspect, the present invention provides a method for constructing a term generation model, including:
constructing a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term heteromorphic graph according to terms, gene information and word information in each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct a term coding sub-model in the term generation model;
and training and constructing the term decoding submodel in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
In another aspect, the present invention provides a term generation apparatus, including:
the text receiving module is used for receiving the text to be processed;
the information acquisition module is used for acquiring word information and gene information of the text to be processed;
the term generation module is used for generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
Further, the term generation model comprises a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
In another aspect, the present invention provides a term generation apparatus, including:
the data set construction module is used for constructing a sample data set, and the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
the heteromorphic image construction module is used for constructing a term heteromorphic image according to terms, gene information and word information in each sample text in the sample data set;
the coding sub-model building module is used for learning the term abnormal graph by using a convolutional neural network algorithm to build a term coding sub-model in the term generation model;
and the decoding sub-model building module is used for training and building the term decoding sub-model in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding sub-model.
In still another aspect, the present invention provides a term generation processing apparatus, including: at least one processor and a memory for storing processor-executable instructions, the processor implementing the method described above when executing the instructions.
The term generation method, device and storage medium provided by the embodiment of the application have the following technical effects:
the term generation method, the term generation device and the storage medium provided by the disclosure can acquire corresponding word information and gene information according to a text to be processed provided by a user, and generate a target term corresponding to the text to be processed by using a pre-constructed term generation model, so that the term accuracy can be improved, the objectivity of term generation due to manual definition is reduced, the term generation method and the term generation device are more suitable for wide application and popularization, and the development of biology and medicine is promoted.
Additional aspects and advantages of embodiments of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic representation of a term termed "modulating cell growth" and associated genes identified and described in the examples provided herein;
fig. 2 is a flowchart of a method for generating terms according to an embodiment of the present application;
FIG. 3 is a diagram of a framework of a term generation model provided by an embodiment of the present application;
FIG. 4 is a method for constructing a term generation model according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a term generation apparatus according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of another term generation apparatus provided in an embodiment of the present application;
fig. 7 is a block diagram of a hardware structure of a server in a term generation method according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Gene Ontology (GO) is a widely used biological Ontology that contains a number of terms that describe Gene function in terms of molecular function, biological processes, and cellular components. These terms are hierarchically organized like a tree and can be used to annotate genes, as shown in FIG. 1, which is a term named "regulating cell growth" provided in the examples of this application, along with a schematic representation of the relevant genes with their alias names and descriptions. The gene ontology has been widely studied in the biomedical and biological research fields due to its great application value in the aspects of protein function analysis, disease association prediction, and the like.
The terms in gene ontology are widely used in biology and biomedicine, most of the previous studies have focused on inferring new GO terms, while the names of terms reflecting gene functions are still named by experts.
One of the main concerns of GO is the construction of GO, including the discovery, naming and organization of terms. In early studies, these terms were manually defined and organized by experts in a particular field of biology, and this nomenclature was inefficient in view of the large amount of biological literature published each year. In addition, different experts may use different expressions to describe the same biological concept, resulting in inconsistent nomenclature, and resulting in inconsistent material and name issues between different publications.
Recently, many researchers turned to develop the building method of GO structure. A net-abstracted ontology has been proposed that hierarchically clusters genes based on their connectivity in the molecular network and restores about 40% of the vocabulary based on the alignment between the net-abstracted ontology and GO. To further improve performance, gene clusters considered as a term in a complete biological network were identified. Although these methods automatically infer new terms and their relationships based on structured networks, new terms are still named manually by experts, which is still prone to inefficiencies and inconsistencies.
In order to automatically acquire term names to facilitate term construction, the present invention proposes a new method of generating term names based on relevant gene text information. An example of this work is shown in FIG. 1: the IGFBP3, OGFR and BAP1 genes, consisting of 0001, are designated as "regulating cell growth". Because there is some overlap between term names and gene text (aliases and descriptions), the goal of embodiments of the present specification is to generate term names based on gene text.
Therefore, the invention provides a method and a device for generating terms and a storage medium, namely the method and the device for generating the term names for GO and establishing a large-scale reference data set. In addition, a graph-based generative model is proposed that integrates the relationships between genes, words and terms for the generation of term names. Fig. 2 is a flowchart of a term generation method provided in an embodiment of the present application, as shown in fig. 2,
s102, receiving a text to be processed.
In a specific implementation process, the text to be processed may be a composition of a gene or a gene library serial number corresponding to the gene, and the like. It is understood that the text to be processed is characterized by the manner in which the genes of the term are expressed. Illustratively, the text to be processed may be the gene numbered 0001, the genes consisting of IGFBP3, OGFR and BAP1, or the expert-named gene name.
The text to be processed may include one text subset to be processed, or may include a plurality of text subsets to be processed, which may be specifically set according to actual needs, and embodiments of the present specification are not specifically limited.
And S104, acquiring word information and gene information of the text to be processed.
In a specific implementation process, the associated word information and gene information may be obtained according to the text to be processed. It can be understood that, since the gene-related composition, description and code are already stored in the data set and associated with each other, the method can obtain all associated information according to the text to be processed, where all information at least includes: word information, gene information. It should be noted that the data set may be used for information acquisition before generation of the GO term name.
S106, generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to word information and gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
In a specific implementation, a pre-constructed term generation model is utilized, wherein the term generation model integrates potential relationships between genes, words and terms to generate term names. A large number of experiments show that the experimental result shows the effectiveness of the model. Therefore, the target term corresponding to the text to be processed is generated by utilizing the pre-constructed term generation model.
The term generation method, the term generation device and the storage medium provided by the disclosure can acquire corresponding word information and gene information according to a text to be processed provided by a user, and generate a target term corresponding to the text to be processed by using a pre-constructed term generation model, so that the term accuracy can be improved, the objectivity of term generation due to manual definition is reduced, the term generation method and the term generation device are more suitable for wide application and popularization, and the development of biology and medicine is promoted.
On the basis of the above embodiments, in an embodiment of this specification, the term generation model includes a term coding sub-model and a term decoding sub-model, the term coding sub-model is used to generate term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used to decode the term coding information to obtain a target term corresponding to the text to be processed.
In a specific implementation process, the term generation model includes a term coding sub-model and a term decoding sub-model, wherein the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
Illustratively, when the text to be processed is the gene label 0001, the related gene composition (gene information: IGFBP3, OGFR and BAP1) and the corresponding explanation (word information: IGFBP3 corresponds to alias insulin-like growth factor binding protein 3, IGFBP3 corresponds to the description of changing the interaction with the cell surface receptor, OGFR corresponds to alias opioid growth factor receptor, OGFR corresponds to opioid growth factor receptor, BAP1 corresponds to binding protein 1, BAP1 corresponds to the description of participating in the regulation of the cell cycle) can be obtained according to the gene label, the gene composition (gene information) and the corresponding explanation (word information) are analyzed by a pre-constructed term generation model to obtain term encoding information of the text to be processed, and the term encoding information of the text to be processed is decoded to obtain the target term corresponding to the text to be processed, it can be concluded that the term name of this gene under the reference numeral 0001 can be a cell growth regulatory gene (regulation of cell growth).
The term generation method provided by the embodiment of the specification can accelerate the term generation speed and reduce the information processing amount of a single model through the arrangement of the term coding sub-model and the term decoding sub-model, and can increase the confidentiality of the invention by separating the generation model of the term coding information from the model of decoding the term coding information.
On the basis of the above embodiments, in an embodiment of the present specification, the term generation model is constructed by the following method:
collecting a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term abnormal graph according to the term name, the gene information and the abstract information of each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct the term coding sub-model;
and training and constructing the term decoding submodel according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
On the basis of the above embodiments, in an embodiment of the present specification, the term decoding submodel decodes the term encoding information by using a copy mechanism to obtain a target term corresponding to the text to be processed.
On the basis of the foregoing embodiment, in an embodiment of the present specification, the constructing a term heteromorphic graph according to the term name, the genetic information, and the abstract information of each sample text in the sample data set includes:
the nodes in the term abnormal graph are term names, gene information or abstract information of each sample text in the sample data set, and the edges in the term abnormal graph are word normalization values or gene term values, wherein the word normalization values represent normalization values of words in the sample text, and the gene term values are used for representing similarity between genes and terms in the sample text.
In a specific implementation, the sample data set is a large-scale data set that contains the ontological terms for homo sapiens (human). The sample data set may be the collection of term id, term name and corresponding gene id from the gene ontology union. In addition, the gene alias name and description may be from a gene card, which contains information on the general protein resource.
Exemplarily, a sample data set is established: the sample data set can be constructed by collecting samples with defined term names by using big data. The sample data set may comprise a plurality of samples, each sample containing a term id, a term name and associated genes, with alias names and descriptions, as shown in fig. 1.
On the other hand, the present invention provides a method for constructing a term generation model, as shown in fig. 4, fig. 4 is a method for constructing a term generation model provided in an embodiment of the present application, and the method includes:
s402, constructing a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
s404, constructing a term heteromorphic graph according to terms, gene information and word information in each sample text in the sample data set;
s406, learning the term abnormal pattern by using a convolutional neural network algorithm to construct a term coding sub-model in the term generation model;
s408, training and constructing the term decoding submodel in the term generation model according to term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
In a specific implementation, analysis of the sample may result in a relatively high proportion of words being shared between the term name and the associated gene, indicating the possibility of term name generation using textual information of the gene. And such patterns like "regular" often appear in term names.
The overall architecture of the graph-based generative model of the embodiment of the present specification is shown in fig. 3, fig. 3 is a framework diagram of the graph-based generative model provided by the embodiment of the present application, and it is composed of two parts, i.e., an encoder based on GCN (graph convolution network) on the left side of fig. 3 and a decoder based on graph attention on the right side of fig. 3.
First a dataset-based heteromorphic graph is constructed and then representation learning is performed using graph convolutional networks, with the GCN-based encoder aiming to encode the relationships between genes, words and terms to facilitate the generation of term names.
The nodes in the term heteromorphic graph are words, genes and terms, and the edges reflect the relationship between the words, genes and terms. It is noted that words, genes and terms can all be derived from gene text. With respect to edges, there are two types, the gene word and the term gene. The value of a word-edge can be characterized as a normalized count of words in the gene text, while if a gene can be annotated with the term, the value of a word-edge is 1.
The initial representation of the word node is the input of a word.
As shown in fig. 3, for gene nodes, alias names and descriptions of genes encoded by GRU neural network (Gated secure Unit) model are used as initial representations. For lexical nodes, pooling of the relevant gene node representations is employed as the initial representation. Then, the node is updated through a GCN model, and the effectiveness of the node in modeling the structural information is established through the following formula:
Figure BDA0002598184030000081
wherein
Figure BDA0002598184030000091
A is the adjacency matrix of the figure, and I is the identity matrix. X is an initial representation of a node, denoted X ═ t, g1...gm,w1,...,wn) Wherein g isi,wiAnd t represents the initial representation of the word and term, respectively, for the ith gene. W(0)And W(1)Representing a weight matrix for the first and second layers of the GCN.
Based on the effectiveness of the attention mechanism generation, the present invention employs a graphics attention based decoder (term decoding submodel) to generate term names. The GCN uses the attention word node notation and expresses it as:
Figure BDA0002598184030000092
aj=softmax(vTtanh(Wa[ht-1;w′j]))
whereinht-1Is the previous hidden state, w'jIs the word node representation of GCN, v is the parameter vector, WaIs a parameter matrix.
In view of the word overlap between the genetic text and the term names, embodiments of the present specification utilize a copy mechanism for decoding, making it possible to generate words from the vocabulary of the training set or the current genetic text. Initial hidden state h0Is the term node representation (t') obtained by GCN, hidden state update is:
ht=f([ht-1;wt-1;at;w′SR])
wherein f represents the RNN function, wt-1Is the font, w 'of the previously generated word'SRIs a Selective Read (SR) vector in a Copy Net (Copy Net). When the previous generated word appears in the genetic text, the next word may also come from it, hence w'SRIs the previous word node representation, otherwise it is a zero vector.
Generating a target word ytThe probability of (d) is calculated as a mixture of the probabilities of the generation mode and the replication mode as follows:
Figure BDA0002598184030000093
Figure BDA0002598184030000094
and
Figure BDA0002598184030000095
which are the scoring functions for the generation mode and the copy mode, respectively, can be seen by the following formula,
Figure BDA0002598184030000096
where V represents the vocabulary in the training set and S represents the source words in the genetic text. It should be noted that there are many fixed patterns in the names mentioned above. Therefore, the first two-letter and three-letter are extracted and treated as new words to generate terms.
The sample data set may be divided into a training set, a validation set, and a test set in a ratio of 8:1: 1. The embodiments of the present specification may adopt bilingual evaluation aid (BLEU) and Rouge1,,2,LAnd evaluating the indexes to complete the generation task. Word embedding is initialized from N (0,1), with a dimension of 300, and is updated in real-time during training. The dimension of the hidden unit of the GRU and GCN is 300. Meanwhile, a Zerewinder chart is adopted to initialize the parameters, and the exit rate is set to be 0.5. The training adopts the method disclosed in Adam and 2014, and the learning rate is 1 e-3. Bilingual interpretation quality assessment aid (BLEU)
And (4) multi-target tracking. To evaluate the effectiveness of the models proposed by the examples herein, the examples herein compare the advanced baseline into two categories, (1) TF-IDF; (2) LexRank; (3) seq2 Seq; (4) HRNNLM; (5) a Transformer. The former is an abstract model that extracts words from a gene text as term names, and the latter is a generative model that generates words from a vocabulary space as term names.
The result shows that the generated grammar model is superior to the extracted grammar model in generating language probability, so that the generated term names are more coherent. Also, the extraction model usually extracts keywords independently, which makes it difficult to form a complete and short term name. Thus, the graph-based generative model of the embodiments of the present specification achieves the best results in all cases by integrating the relationships between genes, words, and terms into the generation.
Other generative models carry unnecessary information on multiple gene sequences, which may have adverse effects on the generation of term names. Through research, the embodiment of the present specification finds that when the embodiment of the present specification regards the frequent pattern as a new word in the generation process and then performs recovery, the performance of the frequent pattern can be further improved. In addition, the replication mechanism helps to improve the generation performance, and especially proves the effectiveness of generating term names by using shared words between genes and terms in the aspect of bilingual translation quality evaluation assistant tool scoring measurement.
The utility model provides a GO-based term automatic generation method, a sample data set constructed by the embodiment of the specification, a term coding sub-model and a term decoding sub-model. Experimental results show that the term generation model provided by the embodiment of the specification is superior to other strong models by simulating the relationship among genes, words and terms.
The conventional generation model only contains sequential information of the source text generation sentences, and ignores potential structures in the text. To solve this problem, the embodiments of the present specification construct a heterogeneous graph with words, genes, and terms as nodes, and generate term names using a graph-based generation model.
On the other hand, an embodiment of the present specification provides a term generation device, as shown in fig. 5, fig. 5 is a schematic structural diagram of a term generation device provided in an embodiment of the present application, and includes:
a text receiving module 510, configured to receive a text to be processed;
an information obtaining module 520, configured to obtain word information and gene information of the text to be processed;
a term generating module 530, configured to generate a target term corresponding to the text to be processed by using a pre-constructed term generating model according to word information and gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
On the basis of the above embodiments, in an embodiment of this specification, the term generation model includes a term coding sub-model and a term decoding sub-model, the term coding sub-model is used to generate term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used to decode the term coding information to obtain a target term corresponding to the text to be processed.
In another aspect, an embodiment of the present specification provides a term generation apparatus, including:
a data set constructing module 610, configured to construct a sample data set, where the sample data set includes a plurality of sample texts and term names, gene information, and summary information corresponding to the sample texts;
the heterogeneous graph construction module 620 is configured to construct a term heterogeneous graph according to terms, gene information, and word information in each sample text in the sample data set;
the coding sub-model building module 630 is configured to learn the term heteromorphic graph by using a convolutional neural network algorithm, and build a term coding sub-model in the term generation model;
a decoding submodel constructing module 640, configured to train and construct the term decoding submodel in the term generation model according to the term coding information of each sample text in the sample data set by the term coding submodel and the term name of each sample text in the sample data set.
The conception of the device and the method is the same and is not described in detail herein.
In another aspect, an embodiment of the present specification provides a term generation processing apparatus, including: at least one processor and a memory for storing processor-executable instructions, the processor implementing the method described above when executing the instructions.
The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.
Since the technical effects of the term generation apparatus and the processing device are the same as those of the term generation method, the description thereof is omitted.
The method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal, a server or a similar operation device. Taking an example of the server running on the server, fig. 7 is a hardware structure block diagram of the server of the term generation method provided in the embodiment of the present application, as shown in fig. 7, the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 710 (the processors 710 may include but are not limited to Processing devices such as a microprocessor MCU or a programmable logic device FPGA), a memory 730 for storing data, and one or more storage media 720 (e.g., one or more mass storage devices) for storing an application 723 or data 722. Memory 730 and storage medium 720 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 720 may include one or more modules, each of which may include a series of instruction operations for the server. Still further, central processor 710 may be configured to communicate with storage medium 720 and execute a series of instruction operations in storage medium 720 on server 700. The Server 700 may also include one or more power supplies 760, one or more wired or wireless network interfaces 750, one or more input-output interfaces 740, and/or one or more operating systems 721, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
The input/output interface 740 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 700. In one example, the input/output Interface 740 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the input/output interface 740 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
It will be understood by those skilled in the art that the structure shown in fig. 7 is only an illustration and is not intended to limit the structure of the electronic device. For example, server 700 may also include more or fewer components than shown in FIG. 7, or have a different configuration than shown in FIG. 7.
Embodiments of the present application further provide a storage medium, which may be disposed in a server to store at least one instruction or at least one program for implementing a term generation method in the method embodiments, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the term generation method provided in the method embodiments.
Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware to implement the above embodiments, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for generating terms, the method comprising:
receiving a text to be processed;
acquiring word information and gene information of the text to be processed;
generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
2. The method according to claim 1, wherein the term generation model includes a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
3. The method of claim 2, wherein the term generation model is constructed using the following method:
collecting a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term abnormal graph according to the term name, the gene information and the abstract information of each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct the term coding sub-model;
and training and constructing the term decoding submodel according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
4. The method of claim 2, wherein the term decoding submodel decodes the term encoding information using a copy mechanism to obtain a target term corresponding to the text to be processed.
5. The method according to claim 3, wherein the constructing a term anomaly map according to the term names, the genetic information and the abstract information of the text of each sample in the sample data set comprises:
the nodes in the term abnormal graph are term names, gene information or abstract information of each sample text in the sample data set, and the edges in the term abnormal graph are word normalization values or gene term values, wherein the word normalization values represent normalization values of words in the sample text, and the gene term values are used for representing similarity between genes and terms in the sample text.
6. A method for constructing a term generation model is characterized by comprising the following steps:
constructing a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term heteromorphic graph according to terms, gene information and word information in each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct a term coding sub-model in the term generation model;
and training and constructing the term decoding submodel in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
7. A term generation apparatus, comprising:
the text receiving module is used for receiving the text to be processed;
the information acquisition module is used for acquiring word information and gene information of the text to be processed;
the term generation module is used for generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
8. The apparatus of claim 7, wherein the term generation model comprises a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
9. A term generation apparatus, comprising:
the data set construction module is used for constructing a sample data set, and the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
the heteromorphic image construction module is used for constructing a term heteromorphic image according to terms, gene information and word information in each sample text in the sample data set;
the coding sub-model building module is used for learning the term abnormal graph by using a convolutional neural network algorithm to build a term coding sub-model in the term generation model;
and the decoding sub-model building module is used for training and building the term decoding sub-model in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding sub-model.
10. A term generation processing device, comprising: at least one processor and a memory for storing processor-executable instructions, the processor implementing the method of any one of claims 1-6 when executing the instructions.
CN202010716035.0A 2020-07-23 2020-07-23 Term generating method, device and storage medium Active CN112100320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010716035.0A CN112100320B (en) 2020-07-23 2020-07-23 Term generating method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010716035.0A CN112100320B (en) 2020-07-23 2020-07-23 Term generating method, device and storage medium

Publications (2)

Publication Number Publication Date
CN112100320A true CN112100320A (en) 2020-12-18
CN112100320B CN112100320B (en) 2023-09-26

Family

ID=73750036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010716035.0A Active CN112100320B (en) 2020-07-23 2020-07-23 Term generating method, device and storage medium

Country Status (1)

Country Link
CN (1) CN112100320B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004310688A (en) * 2003-04-10 2004-11-04 Genaris Inc Gene structure identification method of prokaryote and estimation method of microorganism from which dna fragment derives
US20130218849A1 (en) * 2012-01-31 2013-08-22 Tata Consultancy Services Limited Automated dictionary creation for scientific terms
CN106919689A (en) * 2017-03-03 2017-07-04 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
US20180102062A1 (en) * 2016-10-07 2018-04-12 Itay Livni Learning Map Methods and Systems
CN109325226A (en) * 2018-09-10 2019-02-12 广州杰赛科技股份有限公司 Term extraction method, apparatus and storage medium based on deep learning network
US20190122145A1 (en) * 2017-10-23 2019-04-25 Baidu Online Network Technology (Beijing) Co., Ltd. Method, apparatus and device for extracting information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004310688A (en) * 2003-04-10 2004-11-04 Genaris Inc Gene structure identification method of prokaryote and estimation method of microorganism from which dna fragment derives
US20130218849A1 (en) * 2012-01-31 2013-08-22 Tata Consultancy Services Limited Automated dictionary creation for scientific terms
US20180102062A1 (en) * 2016-10-07 2018-04-12 Itay Livni Learning Map Methods and Systems
CN106919689A (en) * 2017-03-03 2017-07-04 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
US20190122145A1 (en) * 2017-10-23 2019-04-25 Baidu Online Network Technology (Beijing) Co., Ltd. Method, apparatus and device for extracting information
CN109325226A (en) * 2018-09-10 2019-02-12 广州杰赛科技股份有限公司 Term extraction method, apparatus and storage medium based on deep learning network

Also Published As

Publication number Publication date
CN112100320B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
US20170161635A1 (en) Generative machine learning systems for drug design
CN112256828B (en) Medical entity relation extraction method, device, computer equipment and readable storage medium
CN110362684A (en) A kind of file classification method, device and computer equipment
Zhang et al. Seq3seq fingerprint: towards end-to-end semi-supervised deep drug discovery
CN113707235A (en) Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
Dang et al. Stochastic variational inference for Bayesian phylogenetics: a case of CAT model
CN111627494B (en) Protein property prediction method and device based on multidimensional features and computing equipment
US20240055071A1 (en) Artificial intelligence-based compound processing method and apparatus, device, storage medium, and computer program product
CN112151127A (en) Unsupervised learning drug virtual screening method and system based on molecular semantic vector
CN114724623A (en) Method for predicting drug-target affinity of protein multi-source feature fusion
CN114822717A (en) Artificial intelligence-based drug molecule processing method, device, equipment and storage medium
WO2023284716A1 (en) Neural network searching method and related device
CN114360644A (en) Method and system for predicting combination of T cell receptor and epitope
Liu et al. Simulated annealing for optimization of graphs and sequences
WO2022188653A1 (en) Molecular scaffold hopping processing method and apparatus, medium, electronic device and computer program product
Abdine et al. Prot2text: Multimodal protein’s function generation with GNNs and transformers
Chalumeau et al. Qdax: A library for quality-diversity and population-based algorithms with hardware acceleration
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN112100320B (en) Term generating method, device and storage medium
CN112086133A (en) Drug target feature learning method and device based on text implicit information
CN116978464A (en) Data processing method, device, equipment and medium
CN114420221A (en) Knowledge graph-assisted multitask drug screening method and system
CN112686306B (en) ICD operation classification automatic matching method and system based on graph neural network
Qu et al. Hyperbolic neural networks for molecular generation
KR20230091156A (en) Drug Optimization by Active Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: F2-2, Building 2, Science and Technology Business Incubator, Huainan Hi tech Zone, Anhui Province 232000

Applicant after: Anhui Midu Intelligent Technology Co.,Ltd.

Address before: 232000 1st floor, Building 3, Science and Technology Business Incubator, High tech Zone, Huainan City, Anhui Province

Applicant before: Anhui zhengnuo Intelligent Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant