WO2020074017A1

WO2020074017A1 - Deep learning-based method and device for screening for keywords in medical document

Info

Publication number: WO2020074017A1
Application number: PCT/CN2019/118858
Authority: WO
Inventors: 赵荣生; 宋再伟; 刘爽; 周旻
Original assignee: 北京大学第三医院; 北京诺道认知医学科技有限公司
Priority date: 2018-10-12
Filing date: 2019-11-15
Publication date: 2020-04-16
Also published as: CN109359300A

Abstract

Embodiments of the present application discloses a deep learning-based method and device for screening for keywords in a medical document, capable of enhancing the accuracy of keyword screening in medical documents. The method comprises: performing sentence segmentation on a medical document to be processed, performing word segmentation on the component sentences, labeling and encoding the component words according to the order of appearance of the component words in the medical document to be processed, and generating a word vector matrix for the component sentences (S1);inputting the word vector matrix of the component sentences into a pre-trained deep learning-based Bilstm-CRF model, and obtaining the keywords in the medical document to be processed (S2).

Description

Method and device for keyword selection in medical literature based on deep learning

cross reference

This application cites the Chinese patent application No. 2018111880516 with the patent name “Keyword screening method and device in medical literature based on deep learning” filed on October 12, 2018, which is fully incorporated by reference into this application.

Technical field

Embodiments of the present application relate to the field of computers, and in particular to a method and device for keyword selection in medical literature based on deep learning.

Background technique

Keyword extraction refers to the use of computer technology to select words or terms that reflect the content of the topic from reports and documents according to certain requirements. This provides a brief summary for the document, enabling readers to understand the important information and core content of the document in a short time. Because the keywords are very refined, the keywords can be used to measure the text similarity at a small calculation cost. Therefore, it has important applications in literature retrieval, automatic summarization, text classification, text clustering, etc.

Existing keyword extraction methods are mainly divided into three categories: (1) Based on statistical features, the weights of candidate words are determined according to the frequency or position of words, and those with larger weights are selected as keywords. Although this method is simple to operate, it will ignore the words that are distributed in a small position in the text and are in a relatively biased position but have a key significance for the article; The network calculates the criticality of words. This method mainly uses the co-occurrence relationship of high-frequency words to construct a word network, and it is also impossible to extract words that are important to the document but not frequently; (3) a semantic-based method to judge the importance of words from a semantic perspective and extract keywords . However, at present, this method only uses synonyms to match synonyms. However, most keywords that express the same topic are not synonyms or synonyms, so most of the words with the same topic are not semantically related, resulting in the method not being able to play its due role.

Summary of the invention

In view of the deficiencies and defects existing in the prior art, embodiments of the present application provide a method and device for keyword selection in medical literature based on deep learning.

On the one hand, the embodiments of the present application provide a method for keyword selection in medical literature based on deep learning, including:

S1. Segment the medical documents to be processed, segment the clauses, and generate the word vector matrix of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;

S2. Input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.

On the other hand, the embodiments of the present application provide a keyword selection device in medical literature based on deep learning, including:

The generating unit is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;

The input unit is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;

Wherein, the processor and the memory communicate with each other through the bus;

The above method is implemented when the processor executes the computer program.

According to a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the foregoing method.

The method and device for keyword selection in medical literature based on deep learning provided by the embodiments of the present application use the trained Bilstm-CRF model based on deep learning to filter keywords in medical literature, because the constructed Bilstm-CRF model can be combined with context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of keyword selection in medical documents compared to existing technologies.

BRIEF DESCRIPTION

FIG. 1 is a schematic flowchart of an embodiment of a method for keyword selection in medical literature based on deep learning in this application;

FIG. 2 is a schematic structural diagram of an embodiment of a keyword selection device in medical literature based on deep learning of the present application;

FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

detailed description

To make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are Apply for some embodiments, not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the protection scope of the embodiments of the present application.

Referring to FIG. 1, this embodiment discloses a method for keyword selection in medical literature based on deep learning, including:

The keyword selection method in the medical literature based on deep learning provided by the embodiments of the present application uses the trained Bilstm-CRF model based on deep learning to filter the keywords in the medical literature. Because the constructed Bilstm-CRF model can combine contextual semantics, Capturing the local relevance of the document, so that the solution can improve the accuracy of keyword selection in medical literature compared to the prior art.

Based on the foregoing method embodiments, the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.

In this embodiment, before using the Bilstm-CRF model for keyword selection, the Bilstm-CRF model needs to be constructed, and the training data is used to train the Bilstm-CRF model. Specifically, the training process of the Bilstm-CRF model is as follows:

(1) The word vector sequence (x ₁ , x ₂ , ..., x _{max_len} ) composed of each participle of the sentence in the training sample is used as the input of each time step of the bidirectional LSTM.

(2) The second layer of the model is a bidirectional LSTM layer, which is used to automatically extract word features. Hidden state sequence output from forward LSTM

With reverse LSTM output

Perform a bitwise stitching to get the complete hidden state sequence:

among them

(3) Immediately after accessing a linear layer, each element of the hidden state vector is mapped from 2n dimensions to k dimensions, where k = 4 represents the number of word segmentation categories. Output matrix is set _{_{P = (p 1, p 2}} , ..., p max_len), each of dimension p _ij p _i x _i represents the word score value of the first classification to the class label j.

(4) The fourth layer of the model is the CRF layer, which has a state transition matrix A of (k + 2) * (k + 2) size, A _ij represents the transition score from the i-th label to the j-th label , The meaning of this matrix is that when labeling a participle label in a clause, the label value that has been labelled before needs to be considered. If the target value sequence of a clause is y = (y ₁ , y ₂ , ..., y _{max_len} ), then the model's label for clause x is equal to the score of y:

The log-likelihood function of the model is defined as:

In the formula, Y _x is the set of dependent variables, indicating all label items.

(5) Through multiple rounds of iterative training and parameter adjustment, find the optimal parameters and state transition probabilities that maximize the objective function.

Of course, before training the model, you need to generate a sentence vector matrix for the training sample data. The process is as follows:

(1) Each participle of the clause is id-coded according to the order in which it appears in the document. The starting value of the encoding is 1, and the ending value is the vocabulary size N of the document.

(2) Record the number of the most participles in all clauses as max_len, and then fill the id-encoded clauses with 0 to expand them to make the length reach max_len, where the number of zero codes is (max_len-number of participles).

(3) Randomly initialize the word vector matrix, each row of the matrix is represented as a word vector, corresponding to the word segmentation coded 0 to N in order from top to bottom, the number of columns of the matrix is the length of the word vector n = 300.

(4) Find the word vector corresponding to each id-encoded participle in the clause. If the number of training samples is m, construct a three-dimensional matrix of [m, max_len, 300] size as the input of the model.

It should be noted that when training the model, the model output needs to be constructed based on the training data. The specific method is: according to the PICO index matrix, label all the words in the clause. If the word segmentation appears in the index matrix, set the tag value to P or I-C or O according to the corresponding relationship; if the word segmentation does not appear in the index matrix, the tag value is N. Take the entire label sequence as the target value of the model.

The model constructed in this application can combine the context semantics of word segmentation, and limit the output of unreasonable label sequences by calculating the state transition probability according to the internal connection of the label set.

On the basis of the foregoing method embodiments, the medical documents to be processed are segmented, and the segmentation is segmented, including:

The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.

In this embodiment, the word segmentation process is exemplified as follows:

For the example sentence: Objective To evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia. Methods Relevant databases at home and abroad were searched by computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database, ... Firstly, they were sentenced according to punctuation marks. The result of the sentence was:

(1) Objective to evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia;

(2) The method searches the relevant domestic and foreign databases through computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database.

Then use the word segmentation algorithm to segment the sentence, and the result is:

1) ['Purpose', 'Evaluation', 'Sub', 'Methyl', 'Tetrahydrofolate', 'Reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate ',' Treatment ',' Acute ',' Lymph ',' Cell ',' Leukemia ',' Process', 'Medium', 'Poison Side', 'Reaction', 'The', 'Relevance'];

2) ('Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vipu', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].

Finally, part of the participles are merged in combination with the medical vocabulary. For the participle 1) of the first clause (1), it is necessary to merge "Ya", "Methyl", "Tetrahydrofolate" and "Reductase" into one The complete medical term "methylenetetrahydrofolate reductase" needs to merge "lymphoid" and "cell" into a complete medical term "lymphocyte" and needs to merge "poison" and "reaction" into a complete The medical term "toxic side effects". The merged result is:

a) ['Purpose', 'Evaluation', 'Methylenetetrahydrofolate reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate', 'Treatment', 'Acute' , 'Lymphocyte', 'Leukemia', 'Process', 'Medium', 'Toxic Side Effects', 'De', 'Relevance'];

b) ['Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vip', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].

On the basis of the foregoing method embodiments, the word vector matrix of clauses is generated by identifying and encoding the tokens in the order in which the tokens appear in the medical document to be processed, including:

According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;

The word vector matrix is generated based on the result of zero-filling expansion.

The process of generating the test data word vector matrix in this embodiment is the same as the process of generating the training sample word vector matrix during the aforementioned training model, which will not be repeated here.

In this embodiment, when generating the word vector matrix of clauses, firstly, each token of the clauses is identified and encoded (id encoding) according to the order in which they appear in the document. The starting value of the encoding is 1, and the ending value is the vocabulary of the document Quantity. Then record the number of the most participles in all clauses as max_sentence_len, and then add 0 to expand the id-encoded clauses to make the length reach max_sentence_len, which is the word vector of the clause, where the number of 0s in the word vector is equal to max_sentence_len-Number of words.

Referring to FIG. 2, this embodiment discloses a device for keyword selection in medical literature based on deep learning, including:

The generating unit 1 is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed ;

The input unit 2 is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.

Specifically, the generating unit 1 performs sentence segmentation on the medical document to be processed, performs word segmentation on the sentence segment, and generates the word vector of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical document to be processed The input unit 2 inputs the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.

The keyword selection device for medical literature based on deep learning provided by the embodiments of the present application uses the trained Bilstm-CRF model based on deep learning to filter keywords in medical literature. Because the constructed Bilstm-CRF model can combine contextual semantics, Capturing the local relevance of the document, so that the solution can improve the accuracy of keyword selection in medical literature compared to the prior art.

Based on the foregoing device embodiments, the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.

Based on the foregoing device embodiments, the generating unit is specifically configured to:

The keyword selection device in the medical literature based on deep learning of this embodiment may be used to execute the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.

FIG. 3 shows a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present application. As shown in FIG. 3, the electronic device may include: a processor 11, a memory 12, a bus 13, and stored on the memory 12 and may be A computer program running on the processor 11;

Wherein, the processor 11 and the memory 12 communicate with each other through the bus 13;

When the processor 11 executes the computer program, the method provided by each of the above method embodiments is implemented, for example, including: segmenting a medical document to be processed, segmenting a clause, and performing a word segmentation on the medical subject to be processed according to the segmentation Identify and encode the word segments in the order of appearance in the literature to generate the word vector matrix of the sentence; input the word vector matrix of the sentence into the pre-trained Bilstm-CRF model based on deep learning to obtain the medical treatment to be processed Key sentences in the literature.

Embodiments of the present application provide a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the method provided by the foregoing method embodiments is implemented, for example, including: medical documents to be processed To perform sentence segmentation, to perform word segmentation on the sentence segment, to generate the word vector matrix of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical literature to be processed; In the trained Bilstm-CRF model based on deep learning, key sentences in the medical literature to be processed are obtained.

Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.

This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the application. It should be understood that each flow and / or block in the flowchart and / or block diagram and a combination of the flow and / or block in the flowchart and / or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device A device for realizing the functions specified in one block or multiple blocks of one flow or multiple blocks of a flowchart.

These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is any such actual relationship or order. Moreover, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, but also those not explicitly listed Or other elements that are inherent to this process, method, article, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article or equipment that includes the element. The terms "upper", "lower", etc. indicate the orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, only to facilitate the description of this application and simplify the description, rather than to indicate or imply that the device or element It has a specific orientation, is constructed and operated in a specific orientation, and therefore cannot be understood as a limitation of the present application. Unless otherwise clearly specified and defined, the terms "installation", "connected", and "connection" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, It can also be an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be a connection between two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in this application according to specific situations.

In the description of this application, a large number of specific details are explained. However, it can be understood that the embodiments of the present application can be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description. Similarly, it should be understood that in order to streamline the disclosure of the present application and to help understand one or more of the various inventive aspects, in the above description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together into a single embodiment , Figures, or their descriptions. However, the disclosed method should not be interpreted as reflecting the intention that the claimed application claims more features than those explicitly recited in each claim. Rather, as the claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Therefore, the claims that follow the specific implementation are hereby expressly incorporated into the specific implementation, where each claim itself serves as a separate embodiment of the present application. It should be noted that the embodiments in the present application and the features in the embodiments can be combined with each other without conflict. This application is not limited to any single aspect, nor to any single embodiment, nor to any combination and / or substitution of these aspects and / or embodiments. Moreover, each aspect and / or embodiment of the present application may be used alone or in combination with one or more other aspects and / or embodiments thereof.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions of the technical solutions of the embodiments of the present application. The scope should be covered by the scope of the claims and the description of this application.

Claims

A method for selecting key sentences in medical literature based on deep learning, which is characterized by:

S1. Segment the medical documents to be processed, segment the clauses, and generate the word vector matrix of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;

S2. Input the word vector matrix of the clause into the pre-trained Bilstm-CRF model based on deep learning to obtain key sentences in the medical literature to be processed.
The method according to claim 1, wherein the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.
The method according to claim 2, characterized in that segmenting the medical document to be processed and segmenting the clause include:

The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
The method according to claim 3, characterized in that the word vector matrix of the sentence is generated by identifying and encoding the word segmentation according to the order in which the word segmentation appears in the medical document to be processed, including:

According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;

The word vector matrix is generated based on the result of zero-filling expansion.
A device for screening critical sentences in medical literature based on deep learning, which is characterized by including:

The generating unit is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;

The input unit is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain key sentences in the medical literature to be processed.
The device according to claim 5, wherein the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.
The apparatus according to claim 6, wherein the generating unit is specifically configured to:

The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
The apparatus according to claim 7, wherein the generating unit is specifically configured to:

According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;

The word vector matrix is generated based on the result of zero-filling expansion.
An electronic device, characterized in that it includes: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;

Wherein, the processor and the memory communicate with each other through the bus;

When the processor executes the computer program, the method according to any one of claims 1-4 is implemented.
A non-transitory computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1-4 is implemented.