CN112906395A

CN112906395A - Drug relationship extraction method, device, equipment and storage medium

Info

Publication number: CN112906395A
Application number: CN202110322905.0A
Authority: CN
Inventors: 付桂振; 顾大中; 徐任翔
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-04
Anticipated expiration: 2041-03-26
Also published as: CN112906395B

Abstract

The invention relates to the field of data processing, and discloses a method, a device, equipment and a storage medium for extracting a drug relationship, which are used for solving the technical problem of insufficient accuracy of the extracted drug relationship in the prior art. The method comprises the following steps: extracting a target statement at least containing two drug entities in a document to be extracted; inputting the target sentence into a preset first feature extraction model for text feature extraction to obtain a first feature vector related to the drug entity; extracting an existing medicine information text in a pre-established medicine information base, establishing an existing medicine relation graph, and inputting the existing medicine relation graph into a preset second feature extraction model for feature extraction to obtain a second feature vector; and extracting the medicine promotion relation contained in the document to be extracted based on the combined feature vector obtained by combining the first feature vector and the second feature vector. In addition, the invention also relates to a block chain technology, and the related information of the medicine relation extraction task can be stored in the block chain.

Description

Drug relationship extraction method, device, equipment and storage medium

Technical Field

The invention relates to the field of data processing, in particular to a method, a device, equipment and a storage medium for extracting a medicine relation.

Background

In the medical field, different drugs cannot be used simply and interactively, and the simple overlapping use of some drugs has huge consequences, for example, the combined use of aspirin and Abelidol can increase the risk of hypertension; some medicines can promote each other, and the simultaneous use of two different medicines can achieve better treatment effect.

At present, medical staff mainly determine the use of medicines by inquiring historical use information and historical medicine promotion relations in the process of combined use of medicines, but the method of manually consulting medical documents is time-consuming and labor-consuming, and the medicine relations obtained by the method are not accurate enough, so that the speed of research and discovery of clinical application is greatly slowed down.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the accuracy of the extracted medicine relationship in the prior art is not enough.

The invention provides a medicine relation extraction method in a first aspect, which comprises the following steps:

extracting a target statement in a document to be extracted, wherein the target statement is a statement at least containing two drug entities;

inputting the target sentence into a preset first feature extraction model for text feature extraction to obtain a first feature vector related to a drug entity in the target sentence;

extracting an existing medicine information text in a pre-established medicine information base, and establishing an existing medicine relation graph based on the existing medicine information text;

inputting the existing medicine relation graph into a preset second feature extraction model for feature extraction to obtain a second feature vector related to the existing medicine information;

and combining the first characteristic vector and the second characteristic vector to obtain a combined characteristic vector, and extracting the medicine promotion relation contained in the document to be extracted based on the combined characteristic vector.

Optionally, in a first implementation manner of the first aspect of the present invention, the extracting a target sentence in a document to be extracted includes:

calling a character extraction algorithm to identify and extract characters in the document to be extracted to obtain text data of the document to be extracted;

inputting the text data into an entity extraction model established in advance based on a deep learning algorithm for recognition to obtain medicine entity words in the text data;

and finding out and storing sentences containing at least two drug entity words to obtain target sentences in the literature to be extracted.

Optionally, in a second implementation manner of the first aspect of the present invention, the entity extraction model includes a convolutional neural network layer, a bidirectional long and short term memory network layer, and a conditional random field layer, and the inputting the text data into an entity extraction model established in advance based on a deep learning algorithm for recognition to obtain the drug entity words in the text data includes:

inputting the text data into a convolutional neural network layer to encode words in the text data to obtain word encoding information;

inputting the word coding information into a bidirectional long and short term memory network layer, and identifying the part of speech of each word in the text data according to the context information of each word in the text data to obtain the part of speech tag probability of each word;

inputting the part-of-speech tag probability of each word into a conditional random field layer for optimization to obtain the tag optimization probability of each word;

and judging the final label of each word according to the label optimization probability, and screening to obtain the drug entity word in the text data according to the final label.

Optionally, in a third implementation manner of the first aspect of the present invention, the first feature extraction model includes a vector embedding layer, a convolution layer, and a pooling layer, and the inputting the target sentence into a preset first feature extraction model for text feature extraction to obtain a first feature vector related to a drug entity in the target sentence includes:

inputting the target statement into a vector embedding layer, and labeling words in the target statement by using vectors to obtain word labeling vectors;

inputting the word label vector into a convolutional layer for feature extraction to obtain a feature vector matrix corresponding to the word label vector;

and inputting the feature vector matrix into a pooling layer to extract the maximum features in the feature vector matrix to obtain a first feature vector.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the second feature extraction model includes a sampling layer and a natural language processing layer, and the inputting the existing drug relation diagram into a preset second feature extraction model for feature extraction to obtain a second feature vector related to the existing drug information includes:

inputting the existing drug relationship graph into a sampling layer to sample the neighbor sequence of each node in the existing drug relationship graph to obtain a node sequence set;

and inputting the node sequence set into a natural language processing layer for vector embedding to obtain a second characteristic vector related to each medicine.

Optionally, in a fifth implementation manner of the first aspect of the present invention, before the extracting a target sentence in a document to be extracted, the method further includes:

acquiring a drug promotion relation graph with labels and an unoptimized graph convolution extraction model;

and forming a relation graph training set by using the labeled drug promotion relation graphs, and calling the relation graph training set to train the unoptimized graph convolution extraction model to obtain a second feature extraction model.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the extracting, based on the combined feature vector, a medicine promotion relationship included in the document to be extracted includes:

calling a softmax function to carry out normalization processing on the combined feature vector to obtain the medicine related information probability;

and judging the drug relationship based on the drug-related information probability to obtain a drug relationship classification result, and storing the classification result as the drug information corresponding to the feature vector with the drug promotion relationship to obtain the drug promotion relationship.

A second aspect of the present invention provides a medication relation extraction apparatus, comprising:

the document extraction module is used for extracting a target statement in a document to be extracted, wherein the target statement is a statement at least containing two drug entities;

the first feature extraction module is used for inputting the target sentence into a preset first feature extraction model for text feature extraction to obtain a first feature vector related to a drug entity in the target sentence;

the system comprises a relation graph establishing module, a relation graph establishing module and a relation graph establishing module, wherein the relation graph establishing module is used for extracting an existing medicine information text in a pre-established medicine information base and establishing an existing medicine relation graph based on the existing medicine information text;

the second feature extraction module is used for inputting the existing medicine relation graph into a preset second feature extraction model for feature extraction to obtain a second feature vector related to the existing medicine information;

and the promotion relation acquisition module is used for combining the first characteristic vector and the second characteristic vector to obtain a combined characteristic vector, and extracting the medicine promotion relation contained in the document to be extracted based on the combined characteristic vector.

Optionally, in a first implementation manner of the second aspect of the present invention, the document extraction module includes:

the document data capturing unit is used for calling a character extraction algorithm to identify and extract characters in a document to be extracted to obtain text data of the document to be extracted;

the entity relationship extraction unit is used for inputting the text data into an entity extraction model which is established in advance based on a deep learning algorithm for recognition to obtain medicine entity words in the text data;

and the sentence searching unit is used for searching and storing sentences containing at least two drug entity words to obtain target sentences in the documents to be extracted.

Optionally, in a second implementation manner of the second aspect of the present invention, the entity relationship extracting unit includes:

the convolutional neural network subunit is used for inputting the text data into a convolutional neural network layer to encode words in the text data to obtain word encoding information;

the bidirectional long and short term memory network subunit is used for inputting the word coding information into a bidirectional long and short term memory network layer, and identifying the part of speech of each word in the text data according to the context information of each word in the text data to obtain the part of speech tag probability of each word;

the conditional random field subunit is used for inputting the part-of-speech tag probability of each word into a conditional random field layer for optimization to obtain the tag optimization probability of each word;

and the label screening subunit is used for judging the final label of each word according to the label optimization probability and screening the final label to obtain the drug entity word in the text data.

Optionally, in a third implementation manner of the second aspect of the present invention, the first feature extraction module includes:

the vector embedding unit is used for inputting the target statement into the vector embedding layer and labeling words in the target statement by using vectors to obtain word labeling vectors;

the convolution extraction unit is used for inputting the word label vector into a convolution layer for feature extraction to obtain a feature vector matrix corresponding to the word label vector;

and the pooling unit is used for inputting the characteristic vector matrix into a pooling layer to extract the maximum characteristic in the characteristic vector matrix to obtain a first characteristic vector.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the second feature extraction module includes:

the sampling unit is used for inputting the existing drug relationship graph into a sampling layer to sample the neighbor sequence of each node in the existing drug relationship graph to obtain a node sequence set;

and the vector embedding unit is used for inputting the node sequence set into a natural language processing layer for vector embedding to obtain a second feature vector related to each medicine.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the drug relationship extraction apparatus further includes a second feature extraction model construction module, where the second feature extraction model construction module is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the relationship promotion obtaining module includes:

the normalization unit is used for calling a softmax function to perform normalization processing on the combined feature vector to obtain the medicine related information probability;

and the classification unit is used for judging the medicine relation based on the medicine related information probability to obtain a medicine relation classification result, and storing the classification result as the medicine information corresponding to the feature vector with the medicine promotion relation to obtain the medicine promotion relation.

A third aspect of the present invention provides a medication relation extraction apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the medication relationship extraction device to perform the steps of the medication relationship extraction method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the drug relationship extraction method described above.

In the technical scheme provided by the invention, target sentences at least containing two drug entities in documents to be extracted are extracted; calling the text features in the pre-established pairs for extraction to obtain a first feature vector related to the drug entities in the target sentence; extracting an existing medicine information text in a pre-established medicine information base, and establishing an existing medicine relation graph according to the content of the existing medicine information text; calling a pre-established second feature extraction model to perform feature extraction on the existing medicine relation graph to obtain a second feature vector related to the existing medicine information; and combining the first characteristic vector and the second characteristic vector to obtain a combined characteristic vector, and extracting the medicine promotion relation contained in the document to be extracted based on the combined characteristic vector. The medicine relation extraction method provided by the embodiment of the invention integrates the information in the existing medicine information base on the basis of semantic analysis of the content in the medical literature, and improves the extraction accuracy of the medicine promotion relation by the medicine relation extraction technology.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a method for extracting a drug relationship according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a method for extracting a drug relationship according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another embodiment of a method for extracting a drug relationship according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of data processing in the method for extracting a drug relationship according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of another embodiment of a method for extracting a drug relationship according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an embodiment of a drug relationship extraction device in an embodiment of the invention;

FIG. 7 is a schematic diagram of another embodiment of a drug relationship extraction device in an embodiment of the invention;

fig. 8 is a schematic diagram of an embodiment of a medicine relation extraction device in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a medicine relation extraction method, a device, equipment and a storage medium, wherein a first feature vector is extracted from a target sentence containing at least two medicine entities in a medical document by grabbing the target sentence, and a preset first feature extraction model is utilized to extract a first feature vector from the target sentence; extracting the existing medicine information from a pre-established medicine information base, and extracting a second feature vector from the existing medicine information by using a preset second feature extraction model; and extracting the medicine promotion relation contained in the document to be extracted based on the combined feature vector obtained by combining the first feature vector and the second feature vector. The medicine relation extraction method provided by the embodiment of the invention integrates the information in the existing medicine information base on the basis of semantic analysis of the content in the medical literature, and improves the extraction accuracy of the medicine promotion relation by the medicine relation extraction technology.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a method for extracting a drug relationship according to an embodiment of the present invention includes:

101. extracting a target statement in a document to be extracted;

it is to be understood that the execution subject of the present invention may be a drug relation extraction device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

In this embodiment, a document to be extracted to be subjected to drug promotion relationship extraction is first acquired, where the document to be extracted may be a document in a discourse collection and a medical journal of a medical drug. And extracting the text content in the obtained document to be extracted, and identifying the named entity in the extracted text content by using a named entity identification function to obtain the named entity related to the medicine. In this embodiment, specifically, to extract the promotion relationship between the drugs, when describing the drug promotion relationship in the literature, there are several sentences including at least two drug entities, and these sentences are used as the target sentences in this step, so after the named entities related to the drugs are obtained, the sentences including at least two drug named entities are screened from the text content of the literature to be extracted, so as to obtain the target sentences, and these target sentences are temporarily stored.

For example, a document to be extracted includes the following statements: … … A drug test for combined drug administration of gastric cancer was carried out, and when captopril and metformin were applied to … … patients in 174 patients, the drug named entities in the example sentence containing captopril and metformin were recognized by the named entity recognition function, and the example sentence was temporarily stored as a target sentence.

102. Inputting the target sentence into a preset first feature extraction model for text feature extraction to obtain a first feature vector related to the drug entity in the target sentence;

in this step, a first feature extraction model is established based on a segmented convolutional neural network (PCNN), and the first feature extraction model performs entity relationship extraction on a target sentence.

Specifically, when the feature extraction is performed, the target sentence obtained in the above steps is used as an input, and the obtained target sentence is processed by using a Natural Language Processing (NLP) technology, in this embodiment, a Vector embedding (Vector replication) function in the Natural Language Processing technology is used to perform Vector embedding on words included in the target sentence, that is, the words included in the target sentence are subjected to Vector labeling based on the expression meanings of the words, so as to obtain a word change Vector related to the expression meanings of each word. And meanwhile, coding the position information of each word according to the position of each word in the target sentence to obtain the position vector of the word. And combining the word change vector and the position change vector of each word to obtain an input vector.

After the input vectors are obtained, feature training is carried out on the input vectors by utilizing the convolutional layer, and a plurality of features corresponding to each input vector are expressed to obtain a feature matrix. And performing segmented pooling on the target statement according to the position of the drug entity in the target statement by using the obtained feature matrix to obtain a first feature vector with the maximum feature.

103. Extracting an existing medicine information text in a pre-established medicine information base, and establishing an existing medicine relation graph based on the existing medicine information text;

the method comprises the steps of obtaining a pre-established medicine information base, and obtaining an open existing medicine information text according to the medicine information base, wherein the existing medicine information text comprises known medicine promotion relations.

In this embodiment, the existing drug database may be used as a drug information database, and the content of the open text in the drug database is crawled by an information crawler tool to obtain an existing drug information text with a promotion relationship, where the drug database is a comprehensive and free-to-access online database, and includes information about drugs and drug targets.

In this embodiment, each drug having a promotion relationship is regarded as a node, and if there is a promotion relationship between two drugs, an edge is added between the nodes. The medicine promotion relations in the medicine information base are integrated based on the logic, and finally a medicine relation graph can be obtained.

104. Inputting the existing medicine relation diagram into a preset second feature extraction model for feature extraction to obtain a second feature vector related to the existing medicine information;

and inputting the obtained drug relation graph into a second feature extraction model established based on a graph convolution neural network, wherein the second feature extraction model firstly identifies each node in the drug relation graph, then extracts a plurality of node sequences in the graph by using a random walk algorithm, and combines the node sequences into a drug sequence set. And calling a natural language processing layer in the second feature extraction model, and performing vector representation on each node in the drug sequence set by using a vector to obtain a second feature vector.

105. And combining the first characteristic vector and the second characteristic vector to obtain a combined characteristic vector, and extracting the medicine promotion relation contained in the document to be extracted based on the combined characteristic vector.

After the first feature vector obtained in the document to be extracted and the second feature vector obtained in the medicine information base are obtained in the previous step, the first feature vector and the second feature vector are spliced to obtain a combined vector, for example, when the first feature vector extraction is performed on the medicine target relation statement by using the first feature extraction model, the extracted statement contains two medicines, namely aspirin and captopril, and the extracted first vector related to aspirin and captopril is combined with the existing second feature vector related to aspirin and captopril extracted by using the second feature extraction model in the medicine information base to obtain the combined vector.

And then, judging whether two drug entities contained in the currently acquired target sentence have a promotion relationship or not by utilizing a pre-established classifier based on the acquired combination vector, and storing the two drug entities with the drug promotion relationship according to the judgment result to obtain the drug promotion relationship.

The medicine relation extraction method in the embodiment of the invention performs semantic analysis on the content in the medical literature and integrates the information in the existing medicine information base, thereby improving the extraction accuracy of the medicine promotion relation by the medicine relation extraction technology.

Referring to fig. 2, another embodiment of the method for extracting a drug relationship according to the embodiment of the present invention includes:

201. calling a character extraction algorithm to identify and extract characters in the document to be extracted to obtain text data of the document to be extracted;

the embodiment of the invention is explained by taking a server as an execution main body, the server calls an information crawler tool to crawl the website disclosure content of medical documents such as articles, journal articles and the like, specifically, academic document websites such as Chinese knowledge network and the like are stored in the server, a character extraction algorithm is called to crawl the related information of the disclosed medical documents in the stored academic document websites, URLs of the webpages with the related information of the medical documents are stored, the medical document content contained in the webpages is downloaded to obtain documents to be extracted, and text data in the documents to be extracted are extracted by using the character extraction algorithm.

In addition, the method for acquiring the text data may also be that the text content of the existing document to be extracted is directly input into the server of the embodiment in the form of an electronic document, and the server may acquire the text data in the document to be extracted by recognizing the electronic document.

202. Inputting the text data into a convolutional neural network layer to encode words in the text data to obtain word encoding information;

after the text data is extracted, Vector embedding (Vector reconstruction), also called word embedding, is performed on the acquired text data first in the step, and specifically, words are represented in a Vector form through a neural network. And performing word segmentation on characters in the text data, inputting the text data after word segmentation into a Convolutional Neural network layer, and identifying information in the text data by using the trained Convolutional Neural network layer, wherein the Convolutional Neural Network (CNN) in the step is a feed-forward Neural network (feed-forward Neural network) which contains convolution calculation and has a Deep structure, and is one of representative algorithms of Deep Learning (Deep Learning).

Specifically, in this step, a convolutional neural network is trained in advance, the trained convolutional neural network is called to perform feature extraction on words in the acquired text data, content information represented by words in the text data is acquired preliminarily, and the acquired content information represented by the words is encoded into word representations in the text data, that is, numbers of specific rows in a vector table are extracted according to indexes (or positions) of the words to be converted to form a vector to represent the words, so that word encoding information in the text data is obtained.

203. Inputting word coding information into a bidirectional long and short term memory network layer, and identifying the part of speech of each word in text data according to the context information of each word in the text data to obtain the part of speech tag probability of each word;

in this embodiment, the word encoding information is input into the bidirectional long-short term memory network layer in the entity extraction model in this embodiment, and the context information of each word is identified based on the word encoding information. In the step, a bidirectional Long and Short Term Memory network (Bi-directional Long Short Term Memory, BilSTM) layer is formed by combining a forward Long and Short Term Memory network (LSTM) and a backward Long and Short Term Memory network (LSTM), word coding information is input into the bidirectional Long and Short Term Memory network to model the context information of each word, association between words is established, the part of speech of each word in text data is identified according to the association information of the context between words, and part of speech tag probability of each word is obtained.

204. Inputting the part-of-speech tag probability of each word into a conditional random field layer for optimization to obtain the tag optimization probability of each word;

205. judging a final label of each word according to the label optimization probability, and screening to obtain drug entity words in the text data according to the final labels;

and taking the part-of-speech tag probability output by the long-short term memory network in the previous step as the input of a Conditional Random Field (CRF) layer, counting the part-of-speech tag probability of each word in the text data, and optimizing the part-of-speech tag probability according to information such as the connection sequence of the words in the sentence to obtain the optimized tag probability of each word, namely outputting the corresponding probability information of the part of speech to which each word is the best possible to belong. The Conditional Random Field (CRF) used in this embodiment is a serialization Labeling Algorithm (Sequence Labeling Algorithm), and a Conditional Random Field layer capable of Labeling the drug entity words is obtained in advance by training according to a word training set with labels.

And judging the final label of each word according to the optimized label probability, screening out the drug entity words in the text data according to the final label, and labeling the drug entity words in the text.

206. Finding out sentences containing at least two drug entity words and storing the sentences to obtain target sentences in the literature to be extracted;

since the sentences containing the drug relationships in the medical literature necessarily contain at least two drug entity words, in the step, the sentences in the text data obtained in the previous step are screened and searched according to the obtained drug entity words, and the sentences containing at least two drug entity words are searched and stored to obtain the target sentences.

207. Inputting the target sentence into a preset first feature extraction model for text feature extraction to obtain a first feature vector related to the drug entity in the target sentence;

208. extracting an existing medicine information text in a pre-established medicine information base, and establishing an existing medicine relation graph based on the existing medicine information text;

209. and inputting the existing medicine relation graph into a preset second feature extraction model for feature extraction to obtain a second feature vector related to the existing medicine information.

In this embodiment, the specific contents in

steps

207, 208, and 209 are substantially the same as those in

steps

102, 103, and 104 in the foregoing embodiment, and therefore, the detailed description thereof is omitted here.

According to the medicine relation extraction method in the embodiment of the invention, the deep learning technology is utilized to carry out semantic analysis on the content in the medical literature, the target sentences containing the medicine entity relation in the medical literature are identified, the target sentences are analyzed, and the information in the existing medicine information base is integrated, so that the medicine relation in the medical literature is extracted, and the extraction accuracy of the medicine relation extraction technology on the medicine promotion relation is improved.

Referring to fig. 3 and fig. 4, another embodiment of the method for extracting a drug relationship according to the embodiment of the present invention includes:

301. extracting a target statement in a document to be extracted;

the specific content in this step is substantially the same as that in step 101 in the previous embodiment, and therefore, the detailed description thereof is omitted.

302. Inputting a target statement into a vector embedding layer, and labeling words in the target statement by using vectors to obtain word labeling vectors;

303. inputting the word label vector into the convolutional layer for feature extraction to obtain a feature vector matrix corresponding to the word label vector;

after the target sentence is obtained, the target sentence is input into a first feature extraction model established based on a segmented Convolutional Neural network (PCNN) to perform entity relationship extraction, and specifically, the first feature extraction model in the embodiment includes a vector embedding layer, a Convolutional layer, and a pooling layer.

In this embodiment, a Vector embedding layer is first called to perform Vector embedding (Vector embedding) on words in a target sentence to obtain Vector Representation of the target sentence, where the Vector embedding (Vector embedding) is also called word embedding, and specifically, words are represented in a Vector form through a neural network, in this embodiment, a word2vec technology is specifically adopted to perform Vector embedding on words in the target sentence to obtain Vector Representation of the words, and then position vectors of drug entity words in each target sentence are represented, and as shown in fig. 4, the Vector Representation and the position vectors of the words are combined to obtain word labeling vectors.

In addition, after the word position vector in the target sentence is labeled, the sentence is segmented according to the position of the drug entity in the sentence, for example, the sentence "… … we have conducted a drug test for a combination of gastric cancer, captopril and metformin are used for 174 patients … …" segmented at the drug entity position, and "… … we have conducted a drug test for a combination of gastric cancer, captopril … …", "… … captopril and metformin … …", and "… … metformin is used for 174 patients … …", and the word information vector is also segmented.

Further, as shown in fig. 4, after the word label vector is obtained, the input word label vector is input into a Convolution layer (Convolution) for feature extraction, features obtained after Convolution of each word label vector are used as a matrix column, and all the obtained features form a feature vector matrix.

304. Inputting the feature vector matrix into a pooling layer, extracting the maximum features in the feature vector matrix, and obtaining a first feature vector;

after the feature vector matrix is obtained, inputting the feature vector matrix into a Pooling layer to extract the maximum feature in each layer, wherein the Pooling layer in the step is a segmented maximum Pooling layer (PIecewise Max Pooling), when Pooling operation is carried out, the maximum value in each layer is returned instead of directly finding the maximum value in the whole feature, the vector containing the maximum feature is obtained, and the obtained vector is stored as a first feature vector.

305. Extracting an existing medicine information text in a pre-established medicine information base, and establishing an existing medicine relation graph based on the existing medicine information text;

the specific content in step 305 in this embodiment is substantially the same as that in step 103 in the previous embodiment, and therefore, the detailed description thereof is omitted here.

306. Inputting the existing drug relation graph into a sampling layer to sample the neighbor sequence of each node in the existing drug relation graph to obtain a node sequence set;

307. and inputting the node sequence set into a natural language processing layer for vector embedding to obtain a second feature vector related to each medicine.

The second feature extraction model in this embodiment includes a natural language processing layer, the obtained drug relationship graph is input into a second feature extraction model established based on a graph convolution neural network, a sampling layer in the second feature extraction model is first called to identify each node in the drug relationship graph, then a random walk algorithm is used to extract a plurality of node sequences in the graph, and the node sequences are combined into a drug sequence set. And then, calling a natural language processing layer in the second feature extraction model, and performing vector representation on each node in the drug sequence set by using a vector to obtain a second feature vector.

Specifically, the second feature extraction model described in this step is described by taking Node2vec as an example, where Node2vec is a semi-supervised machine learning algorithm, which can be used to learn relationship features in a network graph, and the idea is to map Node information in the network graph into vectors, so that the vectors representing the nodes can fully represent information of the original network graph.

In this embodiment, when the Node2vec is called to the medicine relationship graph to be obtained in the previous step, a target function f (u) to be optimized is first established, where the target function f (u) is a mapping function that maps the Node u into a word vector; defining N (u) as a set of adjacent nodes of the Node u sampled by the sampling strategy S, wherein the Node2vec aims to maximize the probability of the adjacent nodes under the condition of giving each Node u. Specifically, when random walking is performed, a neighborhood is explored in a Breadth First Search (BFS) and Depth First Search (DFS) mode to perform neighborhood sampling, and a node sequence set is obtained. And then, calling a natural language processing layer to process each node sequence in the obtained node sequence set, representing each node as a vector to obtain a vector representing information of each medicine node, wherein each node sequence in the obtained node sequence set can be processed by using Word2vec, and Word2vec consists of a double-layer shallow neural network and can be used for mapping a vector corresponding to a Word according to each Word to represent the relationship between the words.

308. And combining the first characteristic vector and the second characteristic vector to obtain a combined characteristic vector, and classifying the drug entity relationship based on the combined characteristic vector to obtain a drug promotion relationship.

The specific content in this step is substantially the same as that in step 105 in the previous embodiment, and is not described herein again.

According to the medicine relation extraction method in the embodiment of the invention, when semantic analysis is carried out on the content in the medical literature, the segmented convolutional neural network is used for extracting the first feature vector of the medicine entity relation contained in the literature, the second feature vector obtained by constructing the relation between medicines according to the existing medical information medicine information base is used for extracting the medicine promotion relation by integrating the first feature vector and the second feature vector, and the extraction accuracy of the medicine promotion relation by the medicine relation extraction technology is improved.

Referring to fig. 4 and 5, another embodiment of the method for extracting a drug relationship according to the embodiment of the present invention includes:

501. acquiring a drug promotion relation graph with labels and an unoptimized graph convolution extraction model;

502. forming a relation graph training set by the labeled drug promotion relation graphs, and calling the relation graph training set to train an unoptimized graph convolution extraction model to obtain a second feature extraction model;

the method comprises the steps of obtaining medicine information in an existing medicine information base, taking medicines as nodes in an undirected graph according to the existing medicine information, adding edges among the medicines with promotion relations, and obtaining a medicine promotion relation graph based on the existing medicine information. The acquired medicine promotion relation graphs are split and labeled to obtain a plurality of medicine promotion relation graphs, the obtained medicine promotion relation graphs form a relation graph training set, an unoptimized graph convolution extraction model is trained by using the relation graph training set, and parameters in the unoptimized graph convolution extraction model are adjusted according to the result of training and the label, so that a second feature extraction model is obtained.

503. Calling a character extraction algorithm to identify and extract characters in the document to be extracted to obtain text data of the document to be extracted;

504. inputting the text data into a convolutional neural network layer to encode words in the text data to obtain word encoding information;

505. inputting word coding information into a bidirectional long and short term memory network layer, and identifying the part of speech of each word in text data according to the context information of each word in the text data to obtain the part of speech tag probability of each word;

506. inputting the part-of-speech tag probability of each word into a conditional random field layer for optimization to obtain the tag optimization probability of each word;

507. judging a final label of each word according to the label optimization probability, and screening to obtain drug entity words in the text data according to the final labels;

508. finding out sentences containing at least two drug entity words and storing the sentences to obtain target sentences in the literature to be extracted;

the specific contents in steps 503-508 in this embodiment are substantially the same as those in steps 201-206 in the foregoing embodiment, and are not described herein again.

509. Inputting a target statement into a vector embedding layer, and labeling words in the target statement by using vectors to obtain word labeling vectors;

510. inputting the word label vector into the convolutional layer for feature extraction to obtain a feature vector matrix corresponding to the word label vector;

511. inputting the feature vector matrix into a pooling layer, extracting the maximum features in the feature vector matrix, and obtaining a first feature vector;

the specific contents in

steps

509, 510, and 511 in this embodiment are substantially the same as those in

steps

302, 303, and 304 in the foregoing embodiment, and are not described again here.

512. Extracting an existing medicine information text in a pre-established medicine information base, and establishing an existing medicine relation graph based on the existing medicine information text;

the specific content in step 512 in this embodiment is substantially the same as that in step 103 in the previous embodiment, and therefore, the detailed description thereof is omitted here.

513. Inputting the existing drug relation graph into a sampling layer to sample the neighbor sequence of each node in the existing drug relation graph to obtain a node sequence set;

514. inputting the node sequence set into a natural language processing layer for vector embedding to obtain a second characteristic vector related to each medicine;

the specific contents in

steps

513 and 514 in this embodiment are substantially the same as those in

steps

306 and 307 in the foregoing embodiment, and are not described herein again.

515. Combining the first feature vector and the second feature vector to obtain a combined feature vector;

516. Calling a softmax function to carry out normalization processing on the combined feature vector to obtain the medicine related information probability;

517. and judging the drug relationship based on the drug-related information probability to obtain a drug relationship classification result, and storing the drug information corresponding to the feature vector with the drug promotion relationship in the classification result to obtain the drug promotion relationship.

The drug promotion relation classifier is established in advance based on a softmax function, and softmax is popularization of a logistic regression model on a multi-classification problem, wherein class labels can take more than two values. And calling a softmax function to carry out normalization processing on the combined feature vector, and calculating the probability of the relevant information of the medicines, namely calculating the probability of the promotion relationship between the two medicines.

Judging the relationship between the drug entities based on the calculated drug-related information probability to obtain a classification result, storing the drug information corresponding to the feature vector of the classification result with the drug promotion relationship, and outputting the type information of the promotion relationship between the drugs with the promotion relationship at present to obtain the extracted drug relationship.

According to the medicine relation extraction method in the embodiment of the invention, the deep learning technology is utilized to carry out semantic analysis on the contents in the medical literature, the target sentences containing the medicine entity relation in the medical literature are identified and analyzed; and a second feature extraction model established based on the graph convolution network is used for extracting information in the existing medicine information base, so that the medicine relation in the medical literature is extracted and obtained, and the extraction accuracy of the medicine relation extraction technology on the medicine promotion relation is improved.

With reference to fig. 6, the method for extracting a drug relationship in an embodiment of the present invention is described above, and a device for extracting a drug relationship in an embodiment of the present invention is described below, where an embodiment of the device for extracting a drug relationship in an embodiment of the present invention includes:

the document extraction module 601 is configured to extract a target sentence in a document to be extracted, where the target sentence is a sentence including at least two drug entities;

a first feature extraction module 602, configured to input the target sentence into a preset first feature extraction model for text feature extraction, so as to obtain a first feature vector related to a drug entity in the target sentence;

the relationship graph establishing module 603 is configured to extract an existing drug information text in a pre-established drug information base, and establish an existing drug relationship graph based on the existing drug information text;

a second feature extraction module 604, configured to input the existing drug relationship graph into a preset second feature extraction model for feature extraction, so as to obtain a second feature vector related to the existing drug information;

a promotion relationship obtaining module 605, configured to combine the first feature vector and the second feature vector to obtain a combined feature vector, and extract a medicine promotion relationship included in the document to be extracted based on the combined feature vector.

Referring to fig. 7, another embodiment of the apparatus for extracting a pharmaceutical relationship according to the embodiment of the present invention includes:

Optionally, the document extraction module 601 includes:

a document data capturing unit 6011, configured to invoke a character extraction algorithm to identify and extract characters in a document to be extracted, so as to obtain text data of the document to be extracted;

an entity relationship extraction unit 6012, configured to input the text data into an entity extraction model established in advance based on a deep learning algorithm for recognition, so as to obtain a drug entity word in the text data;

the sentence searching unit 6013 is configured to search and store a sentence including at least two kinds of the drug entity terms, so as to obtain a target sentence in a document to be extracted.

Optionally, the entity relationship extracting unit 6012 includes:

Optionally, the first feature extraction module 602 includes:

the vector embedding unit 6021 is configured to input the target sentence into the vector embedding layer, label words in the target sentence with vectors, and obtain word label vectors;

a convolution extraction unit 6022, configured to input the term labeling vector into a convolution layer for feature extraction, so as to obtain a feature vector matrix corresponding to the term labeling vector;

a pooling unit 6023, configured to input the feature vector matrix into a pooling layer to extract a maximum feature in the feature vector matrix, so as to obtain a first feature vector.

Optionally, the second feature extraction module 604 includes:

a sampling unit 6041, configured to input the existing drug relationship graph into a sampling layer, and sample a neighbor sequence of each node in the existing drug relationship graph to obtain a node sequence set;

a vector embedding unit 6042, configured to input the node sequence set into the natural language processing layer for vector embedding, so as to obtain a second feature vector associated with each drug.

Optionally, the drug relationship extraction device further includes a second feature extraction model construction module, where the second feature extraction model construction module is specifically configured to:

Optionally, the relationship promotion obtaining module 605 includes:

Fig. 6 and 7 above describe the medicine relation extracting device in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the medicine relation extracting device in the embodiment of the present invention is described in detail from the perspective of the hardware processing.

Fig. 8 is a schematic structural diagram of a drug relationship extraction device according to an embodiment of the present invention, where the drug relationship extraction device 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 833 or data 832. Memory 820 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instructions operating on the medication relation extraction device 800. Still further, the processor 810 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the medication relation extraction device 800.

The medication relationship extraction apparatus 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input-output interfaces 860, and/or one or more operating systems 831, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the medication relationship extraction device illustrated in fig. 8 does not constitute a limitation of the medication relationship extraction device and may include more or fewer components than illustrated, or some components in combination, or a different arrangement of components.

The present invention also provides a medication relation extraction device, which includes a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the medication relation extraction method in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, which may also be a volatile computer readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the drug relationship extraction method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A drug relationship extraction method is characterized by comprising the following steps:

2. The drug relationship extraction method according to claim 1, wherein the extracting of the target sentence in the document to be extracted includes:

3. The method for extracting drug relationship according to claim 2, wherein the entity extraction model comprises a convolutional neural network layer, a bidirectional long-short term memory network layer and a conditional random field layer, and the step of inputting the text data into an entity extraction model established in advance based on a deep learning algorithm for recognition to obtain the drug entity words in the text data comprises:

4. The method of claim 1, wherein the first feature extraction model comprises a vector embedding layer, a convolution layer and a pooling layer, and the inputting the target sentence into a preset first feature extraction model for text feature extraction to obtain a first feature vector related to a drug entity in the target sentence comprises:

5. The method of claim 1, wherein the second feature extraction model comprises a sampling layer and a natural language processing layer, and the step of inputting the existing drug relationship diagram into a preset second feature extraction model for feature extraction to obtain a second feature vector related to existing drug information comprises:

6. The method for extracting pharmaceutical relationship according to claim 5, further comprising, before the extracting the target sentence in the document to be extracted:

7. The method for extracting drug relationship according to any one of claims 1 to 6, wherein the extracting of the drug promoting relationship contained in the document to be extracted based on the combined feature vector includes:

8. A medication relation extraction device, characterized in that it comprises:

9. A medication relation extraction device, characterized by comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the one medication relationship extraction device to perform the steps of the medication relationship extraction method of any one of claims 1-7.

10. A computer readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of a method of drug relationship extraction as claimed in any one of claims 1-7.