CN112906395B

CN112906395B - Drug relation extraction method, device, equipment and storage medium

Info

Publication number: CN112906395B
Application number: CN202110322905.0A
Authority: CN
Inventors: 付桂振; 顾大中; 徐任翔
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2023-08-15
Anticipated expiration: 2041-03-26
Also published as: CN112906395A

Abstract

The invention relates to the field of data processing, and discloses a method, a device, equipment and a storage medium for extracting a medicine relationship, which are used for solving the technical problem of insufficient accuracy of the medicine relationship extracted in the prior art. The method comprises the following steps: extracting target sentences containing at least two drug entities from a document to be extracted; inputting the target sentence into a preset first feature extraction model to extract text features, and obtaining a first feature vector related to the pharmaceutical entity; extracting the existing drug information text in the pre-established drug information library, establishing an existing drug relation diagram, inputting the existing drug relation diagram into a preset second feature extraction model for feature extraction, and obtaining a second feature vector; and extracting the drug promotion relationship contained in the document to be extracted based on the combined feature vector obtained by combining the first feature vector and the second feature vector. In addition, the invention also relates to a blockchain technology, and related information of the drug relation extraction task can be stored in the blockchain.

Description

Drug relation extraction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a drug relationship.

Background

In the medical field, often different drugs cannot be used simply interchangeably, and the simple superposition of certain drugs can have enormous consequences, for example, the combination of aspirin with abbe's tilobate may increase the risk of hypertension; some medicines can promote each other, and the simultaneous use of two different medicines can possibly achieve better treatment effect.

At present, medical staff mainly determine the use of medicines by inquiring historical use information and historical medicine promotion relations in the process of jointly using medicines, but the method of manually consulting medical literature is time-consuming and labor-consuming, and the medicine relations obtained by the method are not accurate enough, so that the speed of finding clinical application by research is greatly slowed down.

Disclosure of Invention

The invention mainly aims to solve the technical problem of insufficient accuracy of the medicine relationship extracted by the prior art.

The first aspect of the present invention provides a method for extracting a drug relationship, comprising:

extracting target sentences in a document to be extracted, wherein the target sentences are sentences containing at least two pharmaceutical entities;

inputting the target sentence into a preset first feature extraction model to extract text features, and obtaining a first feature vector related to a pharmaceutical entity in the target sentence;

Extracting an existing drug information text in a pre-established drug information base, and establishing an existing drug relation diagram based on the existing drug information text;

inputting the existing medicine relation diagram into a preset second feature extraction model for feature extraction to obtain a second feature vector related to the existing medicine information;

and combining the first characteristic vector and the second characteristic vector to obtain a combined characteristic vector, and extracting the drug promotion relationship contained in the document to be extracted based on the combined characteristic vector.

Optionally, in a first implementation manner of the first aspect of the present invention, the extracting a target sentence in a document to be extracted includes:

invoking a text extraction algorithm to identify and extract the text in the document to be extracted to obtain text data of the document to be extracted;

inputting the text data into an entity extraction model established based on a deep learning algorithm in advance for recognition to obtain pharmaceutical entity words in the text data;

searching out sentences containing at least two types of the pharmaceutical entity words and storing the sentences to obtain target sentences in the to-be-extracted documents.

Optionally, in a second implementation manner of the first aspect of the present invention, the entity extraction model includes a convolutional neural network layer, a two-way long-short-term memory network layer, and a conditional random field layer, and the inputting the text data into the entity extraction model established in advance based on a deep learning algorithm for recognition, to obtain the pharmaceutical entity word in the text data includes:

Inputting the text data into a convolutional neural network layer to encode words in the text data, so as to obtain word encoding information;

inputting the word coding information into a two-way long-short-term memory network layer, and identifying the part of speech of each word in the text data according to the context information of each word in the text data to obtain the part of speech tag probability of each word;

inputting the part-of-speech tag probability of each word into a conditional random field layer for optimization to obtain tag optimization probability of each word;

and judging the final label of each word according to the label optimization probability, and screening according to the final label to obtain the pharmaceutical entity words in the text data.

Optionally, in a third implementation manner of the first aspect of the present invention, the first feature extraction model includes a vector embedding layer, a convolution layer and a pooling layer, and the inputting the target sentence into a preset first feature extraction model to perform text feature extraction, so as to obtain a first feature vector related to a pharmaceutical entity in the target sentence includes:

marking words in the target sentence by adopting vectors in the target sentence input vector embedding layer to obtain word marking vectors;

Inputting the word labeling vector into a convolution layer for feature extraction to obtain a feature vector matrix corresponding to the word labeling vector;

and inputting the feature vector matrix into a pooling layer to extract the maximum feature in the feature vector matrix, so as to obtain a first feature vector.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the second feature extraction model includes a sampling layer and a natural language processing layer, and inputting the existing drug relation graph into a preset second feature extraction model to perform feature extraction, and obtaining a second feature vector related to the existing drug information includes:

inputting the existing medicine relation graph into a sampling layer to sample the neighbor sequence of each node in the existing medicine relation graph to obtain a node sequence set;

and inputting the node sequence set into a natural language processing layer to perform vector embedding, so as to obtain a second characteristic vector related to each drug.

Optionally, in a fifth implementation manner of the first aspect of the present invention, before the extracting a target sentence in a document to be extracted, the method further includes:

obtaining a marked medicine promotion relation graph and an unoptimized graph convolution extraction model;

And forming a relationship diagram training set by using the marked medicine promotion relationship diagram, and calling the relationship diagram training set to train the non-optimized diagram convolution extraction model to obtain a second feature extraction model.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the extracting, based on the combined feature vector, a drug promotion relationship included in the document to be extracted includes:

calling a softmax function to normalize the combined feature vector to obtain the drug related information probability;

judging the drug relation based on the drug related information probability to obtain a drug relation classification result, and storing the classification result as drug information corresponding to the feature vector with the drug promotion relation to obtain the drug promotion relation.

A second aspect of the present invention provides a drug relationship extraction device comprising:

the document extraction module is used for extracting target sentences in the document to be extracted, wherein the target sentences are sentences containing at least two pharmaceutical entities;

the first feature extraction module is used for inputting the target sentence into a preset first feature extraction model to extract text features, so as to obtain a first feature vector related to a pharmaceutical entity in the target sentence;

The relation diagram establishing module is used for extracting the existing medicine information text in the pre-established medicine information base and establishing an existing medicine relation diagram based on the existing medicine information text;

the second feature extraction module is used for inputting the existing medicine relation diagram into a preset second feature extraction model to perform feature extraction to obtain a second feature vector related to the existing medicine information;

and the promotion relation acquisition module is used for combining the first characteristic vector and the second characteristic vector to obtain a combined characteristic vector, and extracting the medicine promotion relation contained in the document to be extracted based on the combined characteristic vector.

Optionally, in a first implementation manner of the second aspect of the present invention, the document extraction module includes:

the document data grabbing unit is used for calling a text extraction algorithm to identify and extract the text in the document to be extracted to obtain text data of the document to be extracted;

the entity relation extraction unit is used for inputting the text data into an entity extraction model established in advance based on a deep learning algorithm for recognition, so as to obtain a pharmaceutical entity word in the text data;

and the sentence searching unit is used for searching sentences containing at least two medicinal entity words and storing the sentences to obtain target sentences in the to-be-extracted document.

Optionally, in a second implementation manner of the second aspect of the present invention, the entity relationship extraction unit includes:

the convolutional neural network subunit is used for inputting the text data into a convolutional neural network layer to encode words in the text data so as to obtain word encoding information;

the two-way long-short-term memory network subunit is used for inputting the word coding information into a two-way long-term memory network layer, and identifying the part of speech of each word in the text data according to the context information of each word in the text data to obtain the part of speech tag probability of each word;

the conditional random field subunit is used for inputting the part-of-speech tag probability of each word into the conditional random field layer for optimization to obtain the tag optimization probability of each word;

and the label screening subunit is used for judging the final label of each word according to the label optimization probability, and screening and obtaining the pharmaceutical entity words in the text data according to the final label.

Optionally, in a third implementation manner of the second aspect of the present invention, the first feature extraction module includes:

the vector embedding unit is used for inputting the target sentence into a vector embedding layer and labeling words in the target sentence by adopting vectors to obtain word labeling vectors;

The convolution extraction unit is used for inputting the word annotation vector into a convolution layer to perform feature extraction so as to obtain a feature vector matrix corresponding to the word annotation vector;

and the pooling unit is used for inputting the feature vector matrix into a pooling layer to extract the maximum feature in the feature vector matrix so as to obtain a first feature vector.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the second feature extraction module includes:

the sampling unit is used for inputting the existing medicine relation graph into a sampling layer to sample the neighbor sequence of each node in the existing medicine relation graph to obtain a node sequence set;

and the vector embedding unit is used for inputting the node sequence set into a natural language processing layer to perform vector embedding so as to obtain a second characteristic vector related to each medicament.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the drug relation extracting device further includes a second feature extraction model building module, where the second feature extraction model building module is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the facilitating relationship obtaining module includes:

the normalization unit is used for calling a softmax function to normalize the combined feature vector so as to obtain the drug related information probability;

and the classification unit is used for judging the drug relation based on the drug related information probability to obtain a drug relation classification result, and storing the drug information corresponding to the characteristic vector with the drug promotion relation as the classification result to obtain the drug promotion relation.

A third aspect of the present invention provides a drug relationship extraction apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the drug relationship extraction device to perform the steps of the drug relationship extraction method described above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the steps of the above-described drug relationship extraction method.

In the technical scheme provided by the invention, target sentences which at least contain two drug entities in the document to be extracted are extracted; invoking a pre-established centered text feature to extract so as to obtain a first feature vector related to the pharmaceutical entity in the target sentence; extracting the existing drug information text in a pre-established drug information library, and establishing an existing drug relation diagram according to the content of the existing drug information text; invoking a second pre-established feature extraction model to perform feature extraction on the existing drug relation graph to obtain a second feature vector related to the existing drug information; and combining the first feature vector and the second feature vector to obtain a combined feature vector, and extracting the drug promotion relationship contained in the document to be extracted based on the combined feature vector. The medicine relation extraction method provided by the embodiment of the invention synthesizes the information in the existing medicine information base on the basis of carrying out semantic analysis on the content in the medical literature, and improves the extraction accuracy of medicine promotion relation by the medicine relation extraction technology.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a drug relationship extraction method according to the present invention;

FIG. 2 is a schematic diagram of another embodiment of a drug relationship extraction method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another embodiment of a drug relationship extraction method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of data processing in a drug relationship extraction method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another embodiment of a drug relationship extraction method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an embodiment of a drug relationship extraction device according to the present invention;

FIG. 7 is a schematic diagram of another embodiment of a drug relationship extraction device according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an embodiment of a drug relationship extraction apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method, a device, equipment and a storage medium for extracting a drug relationship, which are used for extracting a first feature vector from target sentences by grabbing target sentences containing at least two drug entities in medical documents and utilizing a preset first feature extraction model; extracting the existing drug information from a pre-established drug information base, and extracting a second feature vector from the existing drug information by utilizing a preset second feature extraction model; and extracting the drug promotion relationship contained in the document to be extracted based on the combined feature vector obtained by combining the first feature vector and the second feature vector. The medicine relation extraction method provided by the embodiment of the invention synthesizes the information in the existing medicine information base on the basis of carrying out semantic analysis on the content in the medical literature, and improves the extraction accuracy of medicine promotion relation by the medicine relation extraction technology.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1, and one embodiment of a method for extracting a drug relationship in an embodiment of the present invention includes:

101. extracting target sentences in a document to be extracted;

it will be appreciated that the execution subject of the present invention may be a drug relationship extraction device, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

In this embodiment, first, a document to be extracted is obtained, where the document to be extracted may be a document in a medical journal or a discussion of medical drugs. Extracting the text content in the acquired document to be extracted, and identifying the named entity in the extracted text content by utilizing a named entity identification function to obtain the named entity related to the medicine. In this embodiment, the extraction of the promotion relationship between the drugs is specifically intended, and when describing the promotion relationship of the drugs, there are several sentences containing at least two drug entities in the literature, and these sentences are used as the target sentences in this step, so after obtaining the drug related named entities, sentences containing at least two drug named entities are screened from the text content of the literature to be extracted to obtain target sentences, and these target sentences are temporarily stored.

For example, a document to be extracted contains the following sentences: according to … …, a drug test aiming at gastric cancer one-time combined drug administration is carried out, and captopril and metformin are used for 174 patients … …, wherein the drug naming entity in an example sentence is identified to contain captopril and metformin through a naming entity identification function, and the example sentence is temporarily stored as a target sentence for the sentence containing the two drug naming entities.

102. Inputting the target sentence into a preset first feature extraction model to extract text features, and obtaining a first feature vector related to a pharmaceutical entity in the target sentence;

in this step, a first feature extraction model is built based on a segmented convolutional neural network (Piecewise ConvolutionalNeural Networks, PCNN), and the first feature extraction model performs entity relationship extraction on the target sentence.

When the feature extraction is specifically performed, the obtained target sentence in the above step is used as input, the obtained target sentence is first processed by using a natural language processing (Natural Language Processing, NLP) technology, and in this embodiment, a vector embedding (Vector Representation) function in the natural language processing technology is used to perform vector embedding on the words included in the target sentence, that is, the words included in the target sentence are labeled with vectors based on the meaning of the representation of the words, so as to obtain a word change vector related to the meaning of each word representation. Meanwhile, the position information of each word is encoded according to the position of each word in the target sentence, and a position vector of the word is obtained. And combining the word change vector and the position change vector of each word to obtain an input vector.

After the input vectors are obtained, the convolutional layer is utilized to perform feature training on the input vectors, and a plurality of feature representations corresponding to each input vector are obtained to obtain a feature matrix. And carrying out segmentation pooling on the target statement according to the position of the pharmaceutical entity in the target statement on the obtained feature matrix to obtain a first feature vector with the largest feature.

103. Extracting an existing drug information text in a pre-established drug information library, and establishing an existing drug relation diagram based on the existing drug information text;

and acquiring a pre-established drug information library, and acquiring a public existing drug information text according to the drug information library, wherein the existing drug information text comprises a known drug promotion relationship.

In this embodiment, an existing drug bank database may be used as a drug information database, and the information crawler tool is used to crawl the content of the disclosure in the drug bank database to obtain the existing drug information text with a promotion relationship, where the drug bank database is an online database that is accessed comprehensively and freely, and includes information about drugs and drug targets.

In this embodiment, each drug having a promoting relationship is used as a node, and if there is a promoting relationship between two drugs, an edge is added between the nodes of the two drugs. And integrating the drug promotion relations in the drug information base based on the logic, and finally obtaining a drug relation diagram.

104. Inputting the existing medicine relation diagram into a preset second feature extraction model to perform feature extraction to obtain a second feature vector related to the existing medicine information;

inputting the obtained drug relation graph into a second feature extraction model established based on a graph convolution neural network, wherein the second feature extraction model firstly identifies each node in the drug relation graph, then uses a random walk algorithm to extract a plurality of node sequences in the graph, and forms the node sequences into a drug sequence set. And calling a natural language processing layer in the second feature extraction model, and using vectors to represent each node in the drug sequence set by vectors to obtain a second feature vector.

105. And combining the first feature vector and the second feature vector to obtain a combined feature vector, and extracting the drug promotion relationship contained in the document to be extracted based on the combined feature vector.

And after the first feature vector obtained from the document to be extracted and the second feature vector obtained from the drug information library are obtained in the previous step, splicing the first feature vector and the second feature vector to obtain a combined vector, for example, when the first feature extraction model is used for extracting the first feature vector of the drug target relation statement, the extracted statement contains two drugs of aspirin and captopril, and then the first feature vector related to the extracted aspirin and captopril is combined with the second feature vector related to the existing aspirin and captopril extracted by using the second feature extraction model in the drug information library to obtain the combined vector.

And then judging whether the two drug entities contained in the currently acquired target sentence have promotion relations or not based on the acquired combination vector by utilizing a pre-established classifier, and storing the two drug entities with the judgment result of having the drug promotion relations to obtain the drug promotion relations.

According to the medicine relation extraction method provided by the embodiment of the invention, the content in the medical literature is subjected to semantic analysis and the information in the existing medicine information base is integrated, so that the extraction accuracy of the medicine promotion relation by the medicine relation extraction technology is improved.

Referring to fig. 2, another embodiment of the drug relationship extraction method according to the embodiment of the present invention includes:

201. invoking a text extraction algorithm to identify and extract the text in the document to be extracted to obtain text data of the document to be extracted;

the embodiment of the invention uses a server as an execution main body to explain, in the embodiment, the server calls an information crawler tool to crawl the website disclosures of medical documents such as papers, journal articles and the like, specifically, firstly, academic document websites such as China know net and the like are stored in the server, a text extraction algorithm is called to crawl the related information of the published medical documents in the stored academic document websites, URL (uniform resource locator) with the related information web pages of the medical documents is stored, the medical document contents contained in the URL are downloaded to obtain a document to be extracted, and text data in the document to be extracted is extracted by using the text extraction algorithm.

In addition, the method for obtaining text data may also be that text content of an existing document to be extracted is directly input into the server in the embodiment in an electronic document mode, and the server may obtain text data in the document to be extracted by identifying the electronic document.

202. Inputting the text data into a convolutional neural network layer to encode words in the text data, and obtaining word encoding information;

after extracting the text data, in this step, the obtained text data is first vector-embedded, the vector embedding (Vector Representation) is also called word embedding, and the words are specifically represented by using a form of vectors through a neural network. The text in the text data is subjected to word segmentation, the text data subjected to word segmentation is input into a convolutional neural network layer, and information in the text data is identified by using the trained convolutional neural network layer, wherein the convolutional neural network (Convolutional Neural Networks, CNN) in the step is a feedforward neural network (Feedforward Neural Networks) which comprises convolutional calculation and has a Deep structure, and is one of representative algorithms of Deep Learning.

Specifically, in this step, training the convolutional neural network in advance, calling the trained convolutional neural network to perform feature extraction on the words in the acquired text data, preliminarily acquiring content information represented by the words in the text data, and encoding the content information represented by the acquired words into word representation in the text data, that is, extracting numbers of specific rows in a vector table according to indexes (or positions) of the words to be converted so as to combine the numbers into a vector for representing the words, thereby acquiring word encoding information in the text data.

203. Inputting word coding information into a two-way long-short-term memory network layer, and identifying the part of speech of each word in text data according to the context information of each word in the text data to obtain part of speech tag probability of each word;

in this embodiment, the word coding information is input into the bidirectional long-short-term memory network layer in the entity extraction model in this embodiment, and the context information of each word is identified based on the word coding information. In the step, a Bi-directionalLong Short-Term Memory (BiLSTM) layer is formed by combining a forward Short-Term Memory (LSTM) layer and a backward Long-Term Memory (LSTM) layer, word coding information is input into the Bi-directionalLong Short-Term Memory to model context information of each word in the Bi-directional Long-Term Memory, association among the words is established, part of speech of each word in text data is identified according to the association information of the context among the words, and part of speech tag probability of each word is obtained.

204. Inputting part-of-speech tag probability of each word into a conditional random field layer for optimization to obtain tag optimization probability of each word;

205. Judging the final label of each word according to the label optimization probability, and screening to obtain the pharmaceutical entity words in the text data according to the final label;

taking the part-of-speech tag probability output by the long-short term memory network in the previous step as the input of a conditional random field (Conditional Random Field, CRF) layer, counting the part-of-speech tag probability of each word in text data, optimizing the part-of-speech tag probability according to the connection sequence of the words in sentences and other information, and obtaining the optimized tag probability of each word, namely outputting corresponding probability information of the part of speech to which each word most probably belongs. The conditional random field (Conditional Random Field, CRF) used in this embodiment is a serialization labeling algorithm (Sequence Labeling Algorithm), and is trained in advance according to a training set of labeled words to obtain a conditional random field layer capable of labeling drug entity words.

And judging the final label of each word according to the optimized label probability, screening the pharmaceutical entity words in the text data according to the final label, and labeling the pharmaceutical entity words in the text.

206. Searching and storing sentences containing at least two pharmaceutical entity words to obtain target sentences in the document to be extracted;

Since the sentences containing the medicine relations in the medical document necessarily contain at least two kinds of medicine entity words, in the step, according to the obtained medicine entity words, the sentences in the text data obtained in the step are screened and searched, and the sentences containing at least two kinds of medicine entity words are searched and stored at the same time, so that the target sentences are obtained.

207. Inputting the target sentence into a preset first feature extraction model to extract text features, and obtaining a first feature vector related to a pharmaceutical entity in the target sentence;

208. extracting an existing drug information text in a pre-established drug information library, and establishing an existing drug relation diagram based on the existing drug information text;

209. and inputting the existing medicine relation diagram into a preset second feature extraction model to perform feature extraction, so as to obtain a second feature vector related to the existing medicine information.

In this embodiment, the specific contents in steps 207, 208 and 209 are substantially the same as those in steps 102, 103 and 104 in the previous embodiment, so that the detailed description is omitted here.

According to the medicine relation extraction method provided by the embodiment of the invention, the content in the medical document is subjected to semantic analysis by using the deep learning technology, the target statement containing the medicine entity relation in the medical document is identified, the target statement is analyzed, and the information in the existing medicine information base is synthesized, so that the medicine relation in the medical document is extracted, and the extraction accuracy of the medicine relation extraction technology on the medicine promotion relation is improved.

Referring to fig. 3 and 4, another embodiment of the method for extracting a drug relationship according to the present invention includes:

301. extracting target sentences in a document to be extracted;

the specific content in this step is substantially the same as that in step 101 in the foregoing embodiment, so that the description thereof will not be repeated here.

302. Marking words in the target sentence by adopting vectors in the target sentence input vector embedding layer to obtain word marking vectors;

303. inputting the word labeling vectors into a convolution layer for feature extraction to obtain a feature vector matrix corresponding to the word labeling vectors;

after the target sentence is acquired, the target sentence is input into a first feature extraction model established based on a segmented convolutional neural network (Piecewise Convolutional Neural Networks, PCNN) for entity relation extraction in the embodiment, and specifically, the first feature extraction model in the embodiment includes a vector embedding layer, a convolutional layer and a pooling layer.

In this embodiment, a vector embedding layer is called to perform vector embedding (Vector Representation) on words in a target sentence to obtain a vector representation of the target sentence, wherein the vector embedding (Vector Representation) is also called word embedding, and particularly uses a form of vectors to represent words through a neural network, and in this embodiment, a word2vec technology is specifically adopted to perform vector embedding on words in the target sentence to obtain a vector representation of the words, and then a position vector of a pharmaceutical entity word in each target sentence in the sentence is represented, as shown in fig. 4, and the vector representation and the position vector of the words are combined to obtain a word labeling vector.

In addition, after labeling the word position vector in the target sentence, the sentence is segmented according to the drug entity position in the sentence, for example, the sentence "… … we develop a drug test for the primary drug combination of gastric cancer, the captopril and the metformin are used for the drug entity position of the 174 patients … …" to be segmented, the result is "… … we develop a drug test for the primary drug combination of gastric cancer, the captopril … …", "… … captopril and the metformin … …" and the "… … metformin are used for the 174 patients … …", and the word information vector is segmented together.

Further, as shown in fig. 4, after the term labeling vectors are obtained, the input term labeling vectors are input into a Convolution layer (Convolition) for feature extraction, features obtained after Convolution of each term labeling vector are used as a column of a matrix, and all the obtained features form a feature vector matrix.

304. Inputting the feature vector matrix into a pooling layer to extract the maximum feature in the feature vector matrix, so as to obtain a first feature vector;

after the feature vector matrix is obtained, inputting the feature vector matrix into a pooling layer to extract the maximum feature in each layer, wherein the pooling layer in the step is a segmented maximum pooling layer (Piecewise Max Pooling), and returning the maximum value in each layer instead of directly finding the maximum value in the whole feature when pooling operation is carried out, so as to obtain a vector containing the maximum feature, and storing the obtained vector as a first feature vector.

305. Extracting an existing drug information text in a pre-established drug information library, and establishing an existing drug relation diagram based on the existing drug information text;

the specific content in step 305 in this embodiment is substantially the same as that in step 103 in the previous embodiment, so that the description is omitted here.

306. Inputting the existing drug relation graph into a sampling layer to sample the neighbor sequence of each node in the existing drug relation graph to obtain a node sequence set;

307. and inputting the node sequence set into a natural language processing layer to perform vector embedding, so as to obtain a second characteristic vector related to each drug.

The second feature extraction model in the embodiment comprises a natural language processing layer, the obtained drug relation graph is input into a second feature extraction model established based on a graph convolution neural network, a sampling layer in the second feature extraction model is firstly called to identify each node in the drug relation graph, then a plurality of node sequences are extracted in the graph by using a random walk algorithm, and the node sequences form a drug sequence set. And then, calling a natural language processing layer in the second feature extraction model, and using vectors to represent each node in the drug sequence set by vectors to obtain a second feature vector.

Specifically, the second feature extraction model described in this step is described by taking Node2vec as an example, where Node2vec is a semi-supervised machine learning algorithm that can be used to learn relational features in a network graph, and the idea is to map Node information in the network graph into vectors, so that the vectors representing the nodes can fully represent the information of the original network graph.

In this embodiment, when the Node2vec pair drug relation diagram obtained in the previous step is called, firstly, an objective function f (u) to be optimized is established, where the objective function f (u) is a mapping function for mapping the Node u into a word vector; defining N (u) as a neighboring Node set of the Node u sampled by the sampling strategy S, the goal of the Node2vec in this step is to maximize the probability of occurrence of its neighboring nodes given each Node u. Specifically, when random walk is performed, neighbor sampling is performed by exploring the neighbor in a breadth first search (Breadth First Search, BFS) and a depth first search (Depth First Search, DFS) manner, so as to obtain a node sequence set. And then, calling a natural language processing layer to process each node sequence in the obtained node sequence set, and representing each node as a vector to obtain a vector representing the information of each drug node, wherein Word2vec can be used for processing each node sequence in the obtained node sequence set, and consists of a double-layer shallow neural network and can be used for mapping vectors corresponding to words according to each Word so as to represent the relationship among the words.

308. And combining the first feature vector and the second feature vector to obtain a combined feature vector, and classifying the drug entity relationship based on the combined feature vector to obtain a drug promotion relationship.

The details of this step are substantially the same as those of step 105 in the previous embodiment, and will not be described here again.

According to the medicine relation extraction method, when semantic analysis is carried out on the content in the medical literature, the segmented convolutional neural network is utilized to extract the first feature vector of the medicine entity relation contained in the literature, the second feature vector obtained by constructing the relation among medicines according to the existing medical information medicine information base, the first feature vector and the second feature vector are synthesized to extract the medicine promotion relation, and the extraction accuracy of the medicine relation extraction technology on the medicine promotion relation is improved.

Referring to fig. 4 and 5, another embodiment of the method for extracting a drug relationship according to the present invention includes:

501. obtaining a marked medicine promotion relation graph and an unoptimized graph convolution extraction model;

502. forming a relationship diagram training set by using the marked medicine promotion relationship diagram, and calling the relationship diagram training set to train the non-optimized diagram convolution extraction model to obtain a second feature extraction model;

Acquiring medicine information in the existing medicine information library, taking medicines as nodes in the undirected graph according to the existing medicine information, adding edges between medicines with promotion relations, and acquiring a medicine promotion relation graph based on the existing medicine information. Splitting and labeling the acquired medicine promotion relationship graphs to obtain a plurality of medicine promotion relationship graphs, forming a relationship graph training set by the acquired medicine promotion relationship graphs, training an unoptimized graph convolution extraction model by using the relationship graph training set, and comparing the result of training with the label so as to adjust parameters in the unoptimized graph convolution extraction model, thereby obtaining a second feature extraction model.

503. Invoking a text extraction algorithm to identify and extract the text in the document to be extracted to obtain text data of the document to be extracted;

504. inputting the text data into a convolutional neural network layer to encode words in the text data, and obtaining word encoding information;

505. inputting word coding information into a two-way long-short-term memory network layer, and identifying the part of speech of each word in text data according to the context information of each word in the text data to obtain part of speech tag probability of each word;

506. Inputting part-of-speech tag probability of each word into a conditional random field layer for optimization to obtain tag optimization probability of each word;

507. judging the final label of each word according to the label optimization probability, and screening to obtain the pharmaceutical entity words in the text data according to the final label;

508. searching and storing sentences containing at least two pharmaceutical entity words to obtain target sentences in the document to be extracted;

the specific contents of steps 503-508 in this embodiment are substantially the same as those of steps 201-206 in the previous embodiment, and will not be described here again.

509. Marking words in the target sentence by adopting vectors in the target sentence input vector embedding layer to obtain word marking vectors;

510. inputting the word labeling vectors into a convolution layer for feature extraction to obtain a feature vector matrix corresponding to the word labeling vectors;

511. inputting the feature vector matrix into a pooling layer to extract the maximum feature in the feature vector matrix, so as to obtain a first feature vector;

the specific contents of steps 509, 510 and 511 in this embodiment are substantially the same as those of steps 302, 303 and 304 in the previous embodiment, and will not be described here again.

512. Extracting an existing drug information text in a pre-established drug information library, and establishing an existing drug relation diagram based on the existing drug information text;

The specific content in step 512 in this embodiment is substantially the same as that in step 103 in the previous embodiment, so that the description is omitted here.

513. Inputting the existing drug relation graph into a sampling layer to sample the neighbor sequence of each node in the existing drug relation graph to obtain a node sequence set;

514. inputting the node sequence set into a natural language processing layer for vector embedding to obtain a second characteristic vector related to each drug;

the details of steps 513 and 514 in this embodiment are substantially the same as those of steps 306 and 307 in the previous embodiment, and will not be described again here.

515. Combining the first feature vector and the second feature vector to obtain a combined feature vector;

516. Calling a softmax function to normalize the combined feature vector so as to obtain the probability of the drug related information;

517. judging the drug relation based on the drug related information probability to obtain a drug relation classification result, and storing the drug information corresponding to the feature vector with the drug promotion relation as the classification result to obtain the drug promotion relation.

A drug promotion relation classifier is established in advance based on a softmax function, wherein softmax is the popularization of a logistic regression model on a multi-classification problem, and in the multi-classification problem, class labels can take more than two values. And (3) invoking a softmax function to normalize the combined feature vector, and calculating the probability of the drug related information, namely, calculating the probability of the promotion relationship between the two drugs.

Judging the relation among the drug entities based on the calculated drug related information probability to obtain a classification result, storing drug information corresponding to the feature vector of the drug promotion relation of the classification result, and outputting the type information of the promotion relation among the drugs with the promotion relation at present to obtain the extracted drug relation.

According to the drug relation extraction method, the content in the medical document is subjected to semantic analysis by using a deep learning technology, a target sentence containing a drug entity relation in the medical document is identified, and the target sentence is analyzed; and the information in the existing drug information library is extracted by using a second characteristic extraction model established based on the graph convolution network, so that the drug relationship in the medical literature is extracted, and the extraction accuracy of the drug promotion relationship by the drug relationship extraction technology is improved.

The method for extracting a drug relationship in the embodiment of the present invention is described above, and the device for extracting a drug relationship in the embodiment of the present invention is described below, referring to fig. 6, and one embodiment of the device for extracting a drug relationship in the embodiment of the present invention includes:

a document extraction module 601, configured to extract a target sentence in a document to be extracted, where the target sentence is a sentence containing at least two pharmaceutical entities;

the first feature extraction module 602 is configured to input the target sentence into a preset first feature extraction model to perform text feature extraction, so as to obtain a first feature vector related to a pharmaceutical entity in the target sentence;

a relationship diagram establishing module 603, configured to extract an existing drug information text in a pre-established drug information library, and establish an existing drug relationship diagram based on the existing drug information text;

a second feature extraction module 604, configured to input the existing drug relationship diagram into a preset second feature extraction model to perform feature extraction, so as to obtain a second feature vector related to the existing drug information;

and the promotion relation acquisition module 605 is configured to combine the first feature vector and the second feature vector to obtain a combined feature vector, and extract a drug promotion relation included in the document to be extracted based on the combined feature vector.

Referring to fig. 7, another embodiment of the drug relationship extraction device according to the present invention includes:

Optionally, the document extraction module 601 includes:

a document data grabbing unit 6011, configured to invoke a text extraction algorithm to identify and extract text in a document to be extracted, so as to obtain text data of the document to be extracted;

the entity relation extracting unit 6012 is configured to input the text data into an entity extraction model that is built in advance based on a deep learning algorithm for recognition, so as to obtain a pharmaceutical entity word in the text data;

and the sentence searching unit 6013 is used for searching and storing sentences containing at least two medicinal entity words to obtain target sentences in the document to be extracted.

Optionally, the entity relationship extraction unit 6012 includes:

Optionally, the first feature extraction module 602 includes:

the vector embedding unit 6021 is configured to input the target sentence into a vector embedding layer, label the word in the target sentence with a vector, and obtain a word labeling vector;

the convolution extraction unit 6022 is configured to input the word labeling vector into a convolution layer to perform feature extraction, so as to obtain a feature vector matrix corresponding to the word labeling vector;

and a pooling unit 6023, configured to input the feature vector matrix to a pooling layer to extract the maximum feature in the feature vector matrix, so as to obtain a first feature vector.

Optionally, the second feature extraction module 604 includes:

the sampling unit 6041 is configured to input the existing drug relationship graph into a sampling layer to sample a neighbor sequence of each node in the existing drug relationship graph, so as to obtain a node sequence set;

The vector embedding unit 6042 is configured to input the node sequence set into a natural language processing layer for vector embedding, so as to obtain a second feature vector related to each drug.

Optionally, the drug relation extraction device further includes a second feature extraction model building module, where the second feature extraction model building module is specifically configured to:

Optionally, the facilitation relationship acquisition module 605 includes:

The drug relation extracting apparatus in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 6 and fig. 7 above, and the drug relation extracting device in the embodiment of the present invention is described in detail from the point of view of hardware processing below.

Fig. 8 is a schematic diagram of a drug relationship extraction device according to an embodiment of the present invention, where the drug relationship extraction device 800 may vary considerably in configuration or performance, and may include one or more processors (central processing units, CPU) 810 (e.g., one or more processors) and memory 820, one or more storage media 830 (e.g., one or more mass storage devices) storing application programs 833 or data 832. Wherein memory 820 and storage medium 830 can be transitory or persistent. The program stored on the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the drug relationship extraction device 800. Still further, the processor 810 may be arranged to communicate with the storage medium 830 and execute a series of instruction operations in the storage medium 830 on the drug relationship extraction device 800.

The medication relationship extraction device 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input/output interfaces 860, and/or one or more operating systems 831, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the drug relationship extraction device structure illustrated in fig. 8 is not limiting of the drug relationship extraction device and may include more or fewer components than shown, or certain components in combination, or a different arrangement of components.

The present invention also provides a drug relationship extraction apparatus comprising a memory and a processor, the memory storing computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the drug relationship extraction method in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, or a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the drug relationship extraction method.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A drug relationship extraction method, characterized in that the drug relationship extraction method comprises:

Combining the first feature vector and the second feature vector to obtain a combined feature vector, and extracting a drug promotion relationship contained in the document to be extracted based on the combined feature vector;

the extracting the target sentence in the document to be extracted comprises the following steps:

searching and storing sentences containing at least two medicinal entity words to obtain target sentences in the document to be extracted;

the entity extraction model comprises a convolutional neural network layer, a two-way long-short-term memory network layer and a conditional random field layer, the text data is input into the entity extraction model established based on a deep learning algorithm in advance for recognition, and the obtaining of the pharmaceutical entity words in the text data comprises the following steps:

judging the final label of each word according to the label optimization probability, and screening to obtain the pharmaceutical entity words in the text data according to the final label;

the first feature extraction model comprises a vector embedding layer, a convolution layer and a pooling layer, the target sentence is input into a preset first feature extraction model for text feature extraction, and the obtaining of the first feature vector related to the pharmaceutical entity in the target sentence comprises the following steps:

inputting the feature vector matrix into a pooling layer to extract the maximum feature in the feature vector matrix to obtain a first feature vector;

The second feature extraction model includes a sampling layer and a natural language processing layer, and the inputting the existing drug relation graph into a preset second feature extraction model to perform feature extraction, and the obtaining the second feature vector related to the existing drug information includes:

2. The method of extracting a pharmaceutical relationship according to claim 1, further comprising, before the extracting the target sentence in the document to be extracted:

3. The drug-relationship extraction method according to claim 1 or 2, wherein the extracting the drug-promotion relationship contained in the document to be extracted based on the combined feature vector includes:

4. A drug relation extracting apparatus that performs the steps of the drug relation extracting method according to any one of claims 1 to 3, the drug relation extracting apparatus comprising:

5. A drug relationship extraction apparatus, characterized in that the drug relationship extraction apparatus comprises: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the one drug relationship extraction device to perform the steps of the drug relationship extraction method of any one of claims 1-3.

6. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the steps of a method of drug relationship extraction as claimed in any one of claims 1 to 3.