CN112100320A - Method and device for generating terms and storage medium - Google Patents
Method and device for generating terms and storage medium Download PDFInfo
- Publication number
- CN112100320A CN112100320A CN202010716035.0A CN202010716035A CN112100320A CN 112100320 A CN112100320 A CN 112100320A CN 202010716035 A CN202010716035 A CN 202010716035A CN 112100320 A CN112100320 A CN 112100320A
- Authority
- CN
- China
- Prior art keywords
- term
- information
- text
- model
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The application discloses a method for generating a term, which comprises the following steps: receiving a text to be processed; acquiring word information and gene information of a text to be processed; generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information. The implementation of the invention can improve the accuracy of terms and reduce the objectivity of term generation due to manual definition, and the invention is more suitable for wide application and popularization of terms generated by terms and promotes the development of biology and medicine.
Description
Technical Field
The present application relates to the field of computers, and in particular, to a method, an apparatus, and a storage medium for generating terms.
Background
There are many business fields where professional teams will construct some standardized terms so that those in the field can recognize uniformly and facilitate the learning and popularization of technology. For example: gene ontology in the fields of biology and chemistry facilitates the technical study and popularization of people in the field of biochemistry by creating a working platform for term delineation or word sense interpretation of representational normalized gene and gene product characteristics. However, general terms are mostly manually defined, organized by experts, inefficient and labor-consuming, and different experts may use different expressions to describe the same concept, possibly resulting in a problem of inconsistent term naming.
Disclosure of Invention
An object of the embodiments of the present specification is to provide a method, an apparatus, and a storage medium for generating a term, so as to implement a method capable of automatically generating a term according to a text to be processed, improve the accuracy of the term, and promote popularization and application in the field of biology. In one aspect, the present invention provides a method for generating a term, including:
receiving a text to be processed;
acquiring word information and gene information of the text to be processed;
generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
Further, the term generation model comprises a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
Further, the term generation model is constructed by adopting the following method:
collecting a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term abnormal graph according to the term name, the gene information and the abstract information of each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct the term coding sub-model;
and training and constructing the term decoding submodel according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
Further, the term decoding submodel decodes the term encoding information by using a copy mechanism to obtain a target term corresponding to the text to be processed.
Further, the constructing a term heteromorphic graph according to the term name, the genetic information and the abstract information of each sample text in the sample data set includes:
the nodes in the term abnormal graph are term names, gene information or abstract information of each sample text in the sample data set, and the edges in the term abnormal graph are word normalization values or gene term values, wherein the word normalization values represent normalization values of words in the sample text, and the gene term values are used for representing similarity between genes and terms in the sample text.
In another aspect, the present invention provides a method for constructing a term generation model, including:
constructing a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term heteromorphic graph according to terms, gene information and word information in each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct a term coding sub-model in the term generation model;
and training and constructing the term decoding submodel in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
In another aspect, the present invention provides a term generation apparatus, including:
the text receiving module is used for receiving the text to be processed;
the information acquisition module is used for acquiring word information and gene information of the text to be processed;
the term generation module is used for generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
Further, the term generation model comprises a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
In another aspect, the present invention provides a term generation apparatus, including:
the data set construction module is used for constructing a sample data set, and the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
the heteromorphic image construction module is used for constructing a term heteromorphic image according to terms, gene information and word information in each sample text in the sample data set;
the coding sub-model building module is used for learning the term abnormal graph by using a convolutional neural network algorithm to build a term coding sub-model in the term generation model;
and the decoding sub-model building module is used for training and building the term decoding sub-model in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding sub-model.
In still another aspect, the present invention provides a term generation processing apparatus, including: at least one processor and a memory for storing processor-executable instructions, the processor implementing the method described above when executing the instructions.
The term generation method, device and storage medium provided by the embodiment of the application have the following technical effects:
the term generation method, the term generation device and the storage medium provided by the disclosure can acquire corresponding word information and gene information according to a text to be processed provided by a user, and generate a target term corresponding to the text to be processed by using a pre-constructed term generation model, so that the term accuracy can be improved, the objectivity of term generation due to manual definition is reduced, the term generation method and the term generation device are more suitable for wide application and popularization, and the development of biology and medicine is promoted.
Additional aspects and advantages of embodiments of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic representation of a term termed "modulating cell growth" and associated genes identified and described in the examples provided herein;
fig. 2 is a flowchart of a method for generating terms according to an embodiment of the present application;
FIG. 3 is a diagram of a framework of a term generation model provided by an embodiment of the present application;
FIG. 4 is a method for constructing a term generation model according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a term generation apparatus according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of another term generation apparatus provided in an embodiment of the present application;
fig. 7 is a block diagram of a hardware structure of a server in a term generation method according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Gene Ontology (GO) is a widely used biological Ontology that contains a number of terms that describe Gene function in terms of molecular function, biological processes, and cellular components. These terms are hierarchically organized like a tree and can be used to annotate genes, as shown in FIG. 1, which is a term named "regulating cell growth" provided in the examples of this application, along with a schematic representation of the relevant genes with their alias names and descriptions. The gene ontology has been widely studied in the biomedical and biological research fields due to its great application value in the aspects of protein function analysis, disease association prediction, and the like.
The terms in gene ontology are widely used in biology and biomedicine, most of the previous studies have focused on inferring new GO terms, while the names of terms reflecting gene functions are still named by experts.
One of the main concerns of GO is the construction of GO, including the discovery, naming and organization of terms. In early studies, these terms were manually defined and organized by experts in a particular field of biology, and this nomenclature was inefficient in view of the large amount of biological literature published each year. In addition, different experts may use different expressions to describe the same biological concept, resulting in inconsistent nomenclature, and resulting in inconsistent material and name issues between different publications.
Recently, many researchers turned to develop the building method of GO structure. A net-abstracted ontology has been proposed that hierarchically clusters genes based on their connectivity in the molecular network and restores about 40% of the vocabulary based on the alignment between the net-abstracted ontology and GO. To further improve performance, gene clusters considered as a term in a complete biological network were identified. Although these methods automatically infer new terms and their relationships based on structured networks, new terms are still named manually by experts, which is still prone to inefficiencies and inconsistencies.
In order to automatically acquire term names to facilitate term construction, the present invention proposes a new method of generating term names based on relevant gene text information. An example of this work is shown in FIG. 1: the IGFBP3, OGFR and BAP1 genes, consisting of 0001, are designated as "regulating cell growth". Because there is some overlap between term names and gene text (aliases and descriptions), the goal of embodiments of the present specification is to generate term names based on gene text.
Therefore, the invention provides a method and a device for generating terms and a storage medium, namely the method and the device for generating the term names for GO and establishing a large-scale reference data set. In addition, a graph-based generative model is proposed that integrates the relationships between genes, words and terms for the generation of term names. Fig. 2 is a flowchart of a term generation method provided in an embodiment of the present application, as shown in fig. 2,
s102, receiving a text to be processed.
In a specific implementation process, the text to be processed may be a composition of a gene or a gene library serial number corresponding to the gene, and the like. It is understood that the text to be processed is characterized by the manner in which the genes of the term are expressed. Illustratively, the text to be processed may be the gene numbered 0001, the genes consisting of IGFBP3, OGFR and BAP1, or the expert-named gene name.
The text to be processed may include one text subset to be processed, or may include a plurality of text subsets to be processed, which may be specifically set according to actual needs, and embodiments of the present specification are not specifically limited.
And S104, acquiring word information and gene information of the text to be processed.
In a specific implementation process, the associated word information and gene information may be obtained according to the text to be processed. It can be understood that, since the gene-related composition, description and code are already stored in the data set and associated with each other, the method can obtain all associated information according to the text to be processed, where all information at least includes: word information, gene information. It should be noted that the data set may be used for information acquisition before generation of the GO term name.
S106, generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to word information and gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
In a specific implementation, a pre-constructed term generation model is utilized, wherein the term generation model integrates potential relationships between genes, words and terms to generate term names. A large number of experiments show that the experimental result shows the effectiveness of the model. Therefore, the target term corresponding to the text to be processed is generated by utilizing the pre-constructed term generation model.
The term generation method, the term generation device and the storage medium provided by the disclosure can acquire corresponding word information and gene information according to a text to be processed provided by a user, and generate a target term corresponding to the text to be processed by using a pre-constructed term generation model, so that the term accuracy can be improved, the objectivity of term generation due to manual definition is reduced, the term generation method and the term generation device are more suitable for wide application and popularization, and the development of biology and medicine is promoted.
On the basis of the above embodiments, in an embodiment of this specification, the term generation model includes a term coding sub-model and a term decoding sub-model, the term coding sub-model is used to generate term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used to decode the term coding information to obtain a target term corresponding to the text to be processed.
In a specific implementation process, the term generation model includes a term coding sub-model and a term decoding sub-model, wherein the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
Illustratively, when the text to be processed is the gene label 0001, the related gene composition (gene information: IGFBP3, OGFR and BAP1) and the corresponding explanation (word information: IGFBP3 corresponds to alias insulin-like growth factor binding protein 3, IGFBP3 corresponds to the description of changing the interaction with the cell surface receptor, OGFR corresponds to alias opioid growth factor receptor, OGFR corresponds to opioid growth factor receptor, BAP1 corresponds to binding protein 1, BAP1 corresponds to the description of participating in the regulation of the cell cycle) can be obtained according to the gene label, the gene composition (gene information) and the corresponding explanation (word information) are analyzed by a pre-constructed term generation model to obtain term encoding information of the text to be processed, and the term encoding information of the text to be processed is decoded to obtain the target term corresponding to the text to be processed, it can be concluded that the term name of this gene under the reference numeral 0001 can be a cell growth regulatory gene (regulation of cell growth).
The term generation method provided by the embodiment of the specification can accelerate the term generation speed and reduce the information processing amount of a single model through the arrangement of the term coding sub-model and the term decoding sub-model, and can increase the confidentiality of the invention by separating the generation model of the term coding information from the model of decoding the term coding information.
On the basis of the above embodiments, in an embodiment of the present specification, the term generation model is constructed by the following method:
collecting a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term abnormal graph according to the term name, the gene information and the abstract information of each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct the term coding sub-model;
and training and constructing the term decoding submodel according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
On the basis of the above embodiments, in an embodiment of the present specification, the term decoding submodel decodes the term encoding information by using a copy mechanism to obtain a target term corresponding to the text to be processed.
On the basis of the foregoing embodiment, in an embodiment of the present specification, the constructing a term heteromorphic graph according to the term name, the genetic information, and the abstract information of each sample text in the sample data set includes:
the nodes in the term abnormal graph are term names, gene information or abstract information of each sample text in the sample data set, and the edges in the term abnormal graph are word normalization values or gene term values, wherein the word normalization values represent normalization values of words in the sample text, and the gene term values are used for representing similarity between genes and terms in the sample text.
In a specific implementation, the sample data set is a large-scale data set that contains the ontological terms for homo sapiens (human). The sample data set may be the collection of term id, term name and corresponding gene id from the gene ontology union. In addition, the gene alias name and description may be from a gene card, which contains information on the general protein resource.
Exemplarily, a sample data set is established: the sample data set can be constructed by collecting samples with defined term names by using big data. The sample data set may comprise a plurality of samples, each sample containing a term id, a term name and associated genes, with alias names and descriptions, as shown in fig. 1.
On the other hand, the present invention provides a method for constructing a term generation model, as shown in fig. 4, fig. 4 is a method for constructing a term generation model provided in an embodiment of the present application, and the method includes:
s402, constructing a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
s404, constructing a term heteromorphic graph according to terms, gene information and word information in each sample text in the sample data set;
s406, learning the term abnormal pattern by using a convolutional neural network algorithm to construct a term coding sub-model in the term generation model;
s408, training and constructing the term decoding submodel in the term generation model according to term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
In a specific implementation, analysis of the sample may result in a relatively high proportion of words being shared between the term name and the associated gene, indicating the possibility of term name generation using textual information of the gene. And such patterns like "regular" often appear in term names.
The overall architecture of the graph-based generative model of the embodiment of the present specification is shown in fig. 3, fig. 3 is a framework diagram of the graph-based generative model provided by the embodiment of the present application, and it is composed of two parts, i.e., an encoder based on GCN (graph convolution network) on the left side of fig. 3 and a decoder based on graph attention on the right side of fig. 3.
First a dataset-based heteromorphic graph is constructed and then representation learning is performed using graph convolutional networks, with the GCN-based encoder aiming to encode the relationships between genes, words and terms to facilitate the generation of term names.
The nodes in the term heteromorphic graph are words, genes and terms, and the edges reflect the relationship between the words, genes and terms. It is noted that words, genes and terms can all be derived from gene text. With respect to edges, there are two types, the gene word and the term gene. The value of a word-edge can be characterized as a normalized count of words in the gene text, while if a gene can be annotated with the term, the value of a word-edge is 1.
The initial representation of the word node is the input of a word.
As shown in fig. 3, for gene nodes, alias names and descriptions of genes encoded by GRU neural network (Gated secure Unit) model are used as initial representations. For lexical nodes, pooling of the relevant gene node representations is employed as the initial representation. Then, the node is updated through a GCN model, and the effectiveness of the node in modeling the structural information is established through the following formula:
whereinA is the adjacency matrix of the figure, and I is the identity matrix. X is an initial representation of a node, denoted X ═ t, g1...gm,w1,...,wn) Wherein g isi,wiAnd t represents the initial representation of the word and term, respectively, for the ith gene. W(0)And W(1)Representing a weight matrix for the first and second layers of the GCN.
Based on the effectiveness of the attention mechanism generation, the present invention employs a graphics attention based decoder (term decoding submodel) to generate term names. The GCN uses the attention word node notation and expresses it as:
aj=softmax(vTtanh(Wa[ht-1;w′j]))
whereinht-1Is the previous hidden state, w'jIs the word node representation of GCN, v is the parameter vector, WaIs a parameter matrix.
In view of the word overlap between the genetic text and the term names, embodiments of the present specification utilize a copy mechanism for decoding, making it possible to generate words from the vocabulary of the training set or the current genetic text. Initial hidden state h0Is the term node representation (t') obtained by GCN, hidden state update is:
ht=f([ht-1;wt-1;at;w′SR])
wherein f represents the RNN function, wt-1Is the font, w 'of the previously generated word'SRIs a Selective Read (SR) vector in a Copy Net (Copy Net). When the previous generated word appears in the genetic text, the next word may also come from it, hence w'SRIs the previous word node representation, otherwise it is a zero vector.
Generating a target word ytThe probability of (d) is calculated as a mixture of the probabilities of the generation mode and the replication mode as follows:
andwhich are the scoring functions for the generation mode and the copy mode, respectively, can be seen by the following formula,
where V represents the vocabulary in the training set and S represents the source words in the genetic text. It should be noted that there are many fixed patterns in the names mentioned above. Therefore, the first two-letter and three-letter are extracted and treated as new words to generate terms.
The sample data set may be divided into a training set, a validation set, and a test set in a ratio of 8:1: 1. The embodiments of the present specification may adopt bilingual evaluation aid (BLEU) and Rouge1,,2,LAnd evaluating the indexes to complete the generation task. Word embedding is initialized from N (0,1), with a dimension of 300, and is updated in real-time during training. The dimension of the hidden unit of the GRU and GCN is 300. Meanwhile, a Zerewinder chart is adopted to initialize the parameters, and the exit rate is set to be 0.5. The training adopts the method disclosed in Adam and 2014, and the learning rate is 1 e-3. Bilingual interpretation quality assessment aid (BLEU)
And (4) multi-target tracking. To evaluate the effectiveness of the models proposed by the examples herein, the examples herein compare the advanced baseline into two categories, (1) TF-IDF; (2) LexRank; (3) seq2 Seq; (4) HRNNLM; (5) a Transformer. The former is an abstract model that extracts words from a gene text as term names, and the latter is a generative model that generates words from a vocabulary space as term names.
The result shows that the generated grammar model is superior to the extracted grammar model in generating language probability, so that the generated term names are more coherent. Also, the extraction model usually extracts keywords independently, which makes it difficult to form a complete and short term name. Thus, the graph-based generative model of the embodiments of the present specification achieves the best results in all cases by integrating the relationships between genes, words, and terms into the generation.
Other generative models carry unnecessary information on multiple gene sequences, which may have adverse effects on the generation of term names. Through research, the embodiment of the present specification finds that when the embodiment of the present specification regards the frequent pattern as a new word in the generation process and then performs recovery, the performance of the frequent pattern can be further improved. In addition, the replication mechanism helps to improve the generation performance, and especially proves the effectiveness of generating term names by using shared words between genes and terms in the aspect of bilingual translation quality evaluation assistant tool scoring measurement.
The utility model provides a GO-based term automatic generation method, a sample data set constructed by the embodiment of the specification, a term coding sub-model and a term decoding sub-model. Experimental results show that the term generation model provided by the embodiment of the specification is superior to other strong models by simulating the relationship among genes, words and terms.
The conventional generation model only contains sequential information of the source text generation sentences, and ignores potential structures in the text. To solve this problem, the embodiments of the present specification construct a heterogeneous graph with words, genes, and terms as nodes, and generate term names using a graph-based generation model.
On the other hand, an embodiment of the present specification provides a term generation device, as shown in fig. 5, fig. 5 is a schematic structural diagram of a term generation device provided in an embodiment of the present application, and includes:
a text receiving module 510, configured to receive a text to be processed;
an information obtaining module 520, configured to obtain word information and gene information of the text to be processed;
a term generating module 530, configured to generate a target term corresponding to the text to be processed by using a pre-constructed term generating model according to word information and gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
On the basis of the above embodiments, in an embodiment of this specification, the term generation model includes a term coding sub-model and a term decoding sub-model, the term coding sub-model is used to generate term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used to decode the term coding information to obtain a target term corresponding to the text to be processed.
In another aspect, an embodiment of the present specification provides a term generation apparatus, including:
a data set constructing module 610, configured to construct a sample data set, where the sample data set includes a plurality of sample texts and term names, gene information, and summary information corresponding to the sample texts;
the heterogeneous graph construction module 620 is configured to construct a term heterogeneous graph according to terms, gene information, and word information in each sample text in the sample data set;
the coding sub-model building module 630 is configured to learn the term heteromorphic graph by using a convolutional neural network algorithm, and build a term coding sub-model in the term generation model;
a decoding submodel constructing module 640, configured to train and construct the term decoding submodel in the term generation model according to the term coding information of each sample text in the sample data set by the term coding submodel and the term name of each sample text in the sample data set.
The conception of the device and the method is the same and is not described in detail herein.
In another aspect, an embodiment of the present specification provides a term generation processing apparatus, including: at least one processor and a memory for storing processor-executable instructions, the processor implementing the method described above when executing the instructions.
The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.
Since the technical effects of the term generation apparatus and the processing device are the same as those of the term generation method, the description thereof is omitted.
The method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal, a server or a similar operation device. Taking an example of the server running on the server, fig. 7 is a hardware structure block diagram of the server of the term generation method provided in the embodiment of the present application, as shown in fig. 7, the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 710 (the processors 710 may include but are not limited to Processing devices such as a microprocessor MCU or a programmable logic device FPGA), a memory 730 for storing data, and one or more storage media 720 (e.g., one or more mass storage devices) for storing an application 723 or data 722. Memory 730 and storage medium 720 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 720 may include one or more modules, each of which may include a series of instruction operations for the server. Still further, central processor 710 may be configured to communicate with storage medium 720 and execute a series of instruction operations in storage medium 720 on server 700. The Server 700 may also include one or more power supplies 760, one or more wired or wireless network interfaces 750, one or more input-output interfaces 740, and/or one or more operating systems 721, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
The input/output interface 740 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 700. In one example, the input/output Interface 740 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the input/output interface 740 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
It will be understood by those skilled in the art that the structure shown in fig. 7 is only an illustration and is not intended to limit the structure of the electronic device. For example, server 700 may also include more or fewer components than shown in FIG. 7, or have a different configuration than shown in FIG. 7.
Embodiments of the present application further provide a storage medium, which may be disposed in a server to store at least one instruction or at least one program for implementing a term generation method in the method embodiments, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the term generation method provided in the method embodiments.
Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware to implement the above embodiments, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (10)
1. A method for generating terms, the method comprising:
receiving a text to be processed;
acquiring word information and gene information of the text to be processed;
generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
2. The method according to claim 1, wherein the term generation model includes a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
3. The method of claim 2, wherein the term generation model is constructed using the following method:
collecting a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term abnormal graph according to the term name, the gene information and the abstract information of each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct the term coding sub-model;
and training and constructing the term decoding submodel according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
4. The method of claim 2, wherein the term decoding submodel decodes the term encoding information using a copy mechanism to obtain a target term corresponding to the text to be processed.
5. The method according to claim 3, wherein the constructing a term anomaly map according to the term names, the genetic information and the abstract information of the text of each sample in the sample data set comprises:
the nodes in the term abnormal graph are term names, gene information or abstract information of each sample text in the sample data set, and the edges in the term abnormal graph are word normalization values or gene term values, wherein the word normalization values represent normalization values of words in the sample text, and the gene term values are used for representing similarity between genes and terms in the sample text.
6. A method for constructing a term generation model is characterized by comprising the following steps:
constructing a sample data set, wherein the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
constructing a term heteromorphic graph according to terms, gene information and word information in each sample text in the sample data set;
learning the term abnormal graph by using a convolutional neural network algorithm to construct a term coding sub-model in the term generation model;
and training and constructing the term decoding submodel in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding submodel.
7. A term generation apparatus, comprising:
the text receiving module is used for receiving the text to be processed;
the information acquisition module is used for acquiring word information and gene information of the text to be processed;
the term generation module is used for generating a target term corresponding to the text to be processed by utilizing a pre-constructed term generation model according to the word information and the gene information of the text to be processed; wherein the term generation model is obtained based on the training of the incidence relation among the terms, the gene information and the word information.
8. The apparatus of claim 7, wherein the term generation model comprises a term coding sub-model and a term decoding sub-model, the term coding sub-model is used for generating term coding information of the text to be processed according to word information and gene information of the text to be processed, and the term decoding sub-model is used for decoding the term coding information to obtain a target term corresponding to the text to be processed.
9. A term generation apparatus, comprising:
the data set construction module is used for constructing a sample data set, and the sample data set comprises a plurality of sample texts and term names, gene information and abstract information corresponding to the sample texts;
the heteromorphic image construction module is used for constructing a term heteromorphic image according to terms, gene information and word information in each sample text in the sample data set;
the coding sub-model building module is used for learning the term abnormal graph by using a convolutional neural network algorithm to build a term coding sub-model in the term generation model;
and the decoding sub-model building module is used for training and building the term decoding sub-model in the term generation model according to the term coding information of each sample text in the sample data set and the term name of each sample text in the sample data set by the term coding sub-model.
10. A term generation processing device, comprising: at least one processor and a memory for storing processor-executable instructions, the processor implementing the method of any one of claims 1-6 when executing the instructions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010716035.0A CN112100320B (en) | 2020-07-23 | 2020-07-23 | Term generating method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010716035.0A CN112100320B (en) | 2020-07-23 | 2020-07-23 | Term generating method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112100320A true CN112100320A (en) | 2020-12-18 |
CN112100320B CN112100320B (en) | 2023-09-26 |
Family
ID=73750036
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010716035.0A Active CN112100320B (en) | 2020-07-23 | 2020-07-23 | Term generating method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112100320B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004310688A (en) * | 2003-04-10 | 2004-11-04 | Genaris Inc | Gene structure identification method of prokaryote and estimation method of microorganism from which dna fragment derives |
US20130218849A1 (en) * | 2012-01-31 | 2013-08-22 | Tata Consultancy Services Limited | Automated dictionary creation for scientific terms |
CN106919689A (en) * | 2017-03-03 | 2017-07-04 | 中国科学技术信息研究所 | Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge |
US20180102062A1 (en) * | 2016-10-07 | 2018-04-12 | Itay Livni | Learning Map Methods and Systems |
CN109325226A (en) * | 2018-09-10 | 2019-02-12 | 广州杰赛科技股份有限公司 | Term extraction method, apparatus and storage medium based on deep learning network |
US20190122145A1 (en) * | 2017-10-23 | 2019-04-25 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus and device for extracting information |
-
2020
- 2020-07-23 CN CN202010716035.0A patent/CN112100320B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004310688A (en) * | 2003-04-10 | 2004-11-04 | Genaris Inc | Gene structure identification method of prokaryote and estimation method of microorganism from which dna fragment derives |
US20130218849A1 (en) * | 2012-01-31 | 2013-08-22 | Tata Consultancy Services Limited | Automated dictionary creation for scientific terms |
US20180102062A1 (en) * | 2016-10-07 | 2018-04-12 | Itay Livni | Learning Map Methods and Systems |
CN106919689A (en) * | 2017-03-03 | 2017-07-04 | 中国科学技术信息研究所 | Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge |
US20190122145A1 (en) * | 2017-10-23 | 2019-04-25 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus and device for extracting information |
CN109325226A (en) * | 2018-09-10 | 2019-02-12 | 广州杰赛科技股份有限公司 | Term extraction method, apparatus and storage medium based on deep learning network |
Also Published As
Publication number | Publication date |
---|---|
CN112100320B (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20170161635A1 (en) | Generative machine learning systems for drug design | |
CN112256828B (en) | Medical entity relation extraction method, device, computer equipment and readable storage medium | |
CN110362684A (en) | A kind of file classification method, device and computer equipment | |
Zhang et al. | Seq3seq fingerprint: towards end-to-end semi-supervised deep drug discovery | |
CN113707235A (en) | Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning | |
Dang et al. | Stochastic variational inference for Bayesian phylogenetics: a case of CAT model | |
CN111627494B (en) | Protein property prediction method and device based on multidimensional features and computing equipment | |
US20240055071A1 (en) | Artificial intelligence-based compound processing method and apparatus, device, storage medium, and computer program product | |
CN112151127A (en) | Unsupervised learning drug virtual screening method and system based on molecular semantic vector | |
CN114724623A (en) | Method for predicting drug-target affinity of protein multi-source feature fusion | |
CN114822717A (en) | Artificial intelligence-based drug molecule processing method, device, equipment and storage medium | |
WO2023284716A1 (en) | Neural network searching method and related device | |
CN114360644A (en) | Method and system for predicting combination of T cell receptor and epitope | |
Liu et al. | Simulated annealing for optimization of graphs and sequences | |
WO2022188653A1 (en) | Molecular scaffold hopping processing method and apparatus, medium, electronic device and computer program product | |
Abdine et al. | Prot2text: Multimodal protein’s function generation with GNNs and transformers | |
Chalumeau et al. | Qdax: A library for quality-diversity and population-based algorithms with hardware acceleration | |
CN113571125A (en) | Drug target interaction prediction method based on multilayer network and graph coding | |
CN112100320B (en) | Term generating method, device and storage medium | |
CN112086133A (en) | Drug target feature learning method and device based on text implicit information | |
CN116978464A (en) | Data processing method, device, equipment and medium | |
CN114420221A (en) | Knowledge graph-assisted multitask drug screening method and system | |
CN112686306B (en) | ICD operation classification automatic matching method and system based on graph neural network | |
Qu et al. | Hyperbolic neural networks for molecular generation | |
KR20230091156A (en) | Drug Optimization by Active Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: F2-2, Building 2, Science and Technology Business Incubator, Huainan Hi tech Zone, Anhui Province 232000 Applicant after: Anhui Midu Intelligent Technology Co.,Ltd. Address before: 232000 1st floor, Building 3, Science and Technology Business Incubator, High tech Zone, Huainan City, Anhui Province Applicant before: Anhui zhengnuo Intelligent Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |