WO2020074023A1

WO2020074023A1 - Deep learning-based method and device for screening for key sentences in medical document

Info

Publication number: WO2020074023A1
Application number: PCT/CN2019/124561
Authority: WO
Inventors: 赵荣生; 宋再伟; 黄振城; 王则远; 周旻
Original assignee: 北京大学第三医院; 北京诺道认知医学科技有限公司
Priority date: 2018-10-12
Filing date: 2019-12-11
Publication date: 2020-04-16
Also published as: CN109472021A

Abstract

A deep learning-based method and device for screening for key sentences in a medical document, capable of enhancing the accuracy of key sentence screening in medical documents. The method comprises: S1, performing sentence segmentation on a medical document to be processed, performing word segmentation on the component sentences, labeling and encoding the component words according to the order of appearance of the component words in the medical document to be processed, and generating word vectors for the component sentences; S2, inputting the word vectors of the component sentences into a pre-trained deep learning-based convolutional neural network model, and obtaining the key sentences in the medical document to be processed.

Description

Method and device for screening key sentences in medical literature based on deep learning

Cross-reference of related applications

This application requires the priority of the Chinese patent application filed on October 12, 2018 with the application number 2018111880412 and the invention titled "Method and Device for Screening Key Sentences in Medical Literature Based on Deep Learning", which is fully incorporated by reference This disclosure.

Technical field

Embodiments of the present disclosure relate to the field of computers, and in particular to a method and device for screening critical sentences in medical literature based on deep learning.

Background technique

The main content of the text is often covered in a set of important key sentences. These key sentences can clearly express the content characteristics of the text (such as domain categories, theme ideas, central meaning, etc.). It is based on this understanding. In the fields of information retrieval, information extraction and knowledge extraction, the identification and screening of key sentences that can represent the main content of the text is a very important step. It is important for the disclosure of subject literature and the reflection of the knowledge hidden in the text. Meaning. The key sentence screening is simply to identify and extract sentences containing useful information according to certain purpose requirements, so as to condense the text and obtain rich information from a small amount of data.

Traditional key sentence screening methods are generally based on statistical methods, using statistical information such as location and frequency to find the sentence that best represents the subject of the article as the key sentence. According to the structure of the article, it can be divided into unstructured screening analysis type and structured screening analysis type. The former calculates the weight of the sentence of the article, and finds the sentence with the highest weight as the key sentence. The latter first analyzes the semantic structure of the article to find out the topic structure of the article, and then extracts sentences from each topic to form a key sentence. However, the statistical method of filtering based on structure or weight is easy to ignore the content of the sentence itself in actual operation, and the key sentences that are distributed in the text but contain the content of the subject words are filtered out, and the redundancy is greater. In the field of natural language processing, the widely used deep learning algorithm focuses on the content of the sentence itself, and automatically learns the sample features by simulating the structure of the neural network of the human brain, so as to filter out key sentences containing key information and prepare for further analysis. However, the algorithm has so far been limited to analyzing isolated sentences. The contextual relationship between sentences and sentences has not been systematically studied in terms of the constraints and effects of this sentence.

Summary of the invention

In view of the deficiencies and defects existing in the prior art, embodiments of the present disclosure provide a method and device for screening key sentences in medical literature based on deep learning.

On the one hand, an embodiment of the present disclosure proposes a method for screening key sentences in medical literature based on deep learning, including:

S1. Segment the medical documents to be processed, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;

S2. Input the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.

On the other hand, an embodiment of the present disclosure proposes a device for screening critical sentences in medical literature based on deep learning, including:

The generating unit is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation in the order in which the segmentation appears in the medical document to be processed;

The input unit is configured to input the word vectors of the clauses into a pre-trained deep learning-based convolutional neural network model to obtain key sentences in the medical literature to be processed.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;

Wherein, the processor and the memory communicate with each other through the bus;

The above method is implemented when the processor executes the computer program.

According to a fourth aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the above method.

The method and device for screening critical sentences in medical literature based on deep learning provided by embodiments of the present disclosure utilizes the trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can Combined with context semantics, the local relevance of the document is captured, so that the scheme can improve the accuracy of screening key sentences in medical literature compared to the existing technology.

BRIEF DESCRIPTION

FIG. 1 is a schematic flowchart of an embodiment of a method for screening key sentences in medical literature based on deep learning of the present disclosure;

2 is a schematic structural diagram of an embodiment of a key sentence selection device in medical literature based on deep learning of the present disclosure;

FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

detailed description

To make the objectives, technical solutions, and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are Some embodiments are disclosed, but not all embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the protection scope of the embodiments of the present disclosure.

Referring to FIG. 1, this embodiment discloses a method for selecting key sentences in medical literature based on deep learning, including:

The method for screening critical sentences in medical literature based on deep learning provided by an embodiment of the present disclosure uses a trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.

On the basis of the foregoing method embodiments, the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window For the vectors obtained by splicing, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.

In this embodiment, before using the convolutional neural network model for key sentence selection, it is necessary to construct the convolutional neural network model and use the training data to train the convolutional neural network model. Specifically, the convolutional layer of the model uses multi-filter windows with widths of 3, 4, and 5, each window corresponds to 100 filters, and slides different windows to traverse each participle in the clause. After the convolution calculation, each filter The device can get a feature map set. The calculation formula of the feature set is as follows:

C _i = f (w · x _{i: i + h-1} + b),

among them:

Represents the vector generated by the stitching of word vectors with a window size of h from the i-th participle x _i , w is a filter matrix corresponding to this window,

Is the deviation term, f is a nonlinear function, and C _i is the new feature produced. Then corresponding to {x _{1: h} , x _{2: h + 1} , ..., x _{n-h + 1: n} }, the feature map set can be expressed as:

C = [C ₁ , C ₂ , ... C _{n-h + 1} ].

The pooling layer of the model adopts the Max-over-time-pooling method. For the feature set generated by different filters in each filter window, the maximum value of the set is taken as an important representative feature. In this way, the features of different sliding window sizes become a fixed length, which are spliced together to form a feature vector of 3 * 100 length. The last layer of the model is a fully connected softmax layer, which outputs the probability of each category. Through multiple rounds of iterative training and parameter adjustment, the optimal model parameters are found.

It should be noted that when training the model, the model output needs to be constructed for the training data. The specific method is: according to the PICO index matrix, if a clause does not contain any matrix elements, it means that the clause does not contain relevant fields worth studying. Key information, so the target value of the clause is set to 0. If the clause contains one or more elements of the matrix, it means that the clause may contain important information. In order to avoid missing key information, it needs to be filtered out for subsequent In-depth study, so the target value of the clause is set to 1.

On the basis of the foregoing method embodiments, the medical documents to be processed are segmented, and the segmentation is segmented, including:

The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.

In this embodiment, the word segmentation process is exemplified as follows:

For the example sentence: Objective To evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia. Methods Relevant databases at home and abroad were searched by computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database, ... Firstly, they were sentenced according to punctuation marks. The result of the sentence was:

(1) Objective to evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia;

(2) The method searches the relevant domestic and foreign databases through computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database.

Then use the word segmentation algorithm to segment the sentence, and the result is:

1) ['Purpose', 'Evaluation', 'Sub', 'Methyl', 'Tetrahydrofolate', 'Reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate ',' Treatment ',' Acute ',' Lymph ',' Cell ',' Leukemia ',' Process', 'Medium', 'Poison Side', 'Reaction', 'The', 'Relevance'];

2) ('Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vipu', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].

Finally, part of the participles are merged in combination with the medical vocabulary. For the participle 1) of the first clause (1), it is necessary to merge "Ya", "Methyl", "Tetrahydrofolate" and "Reductase" into one The complete medical term "methylenetetrahydrofolate reductase" needs to merge "lymphoid" and "cell" into a complete medical term "lymphocyte" and needs to merge "poison" and "reaction" into a complete The medical term "toxic side effects". The merged result is:

a) ['Purpose', 'Evaluation', 'Methylenetetrahydrofolate reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate', 'Treatment', 'Acute' , 'Lymphocyte', 'Leukemia', 'Process', 'Medium', 'Toxic Side Effects', 'De', 'Relevance'];

b) ['Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vip', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].

On the basis of the foregoing method embodiments, the word vectors of the clauses are generated by identifying and encoding the tokens in the order in which the tokens appear in the medical document to be processed, including:

According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.

In this embodiment, when generating the word vectors of the clauses, firstly, the tokens of the clauses are identified and encoded (id encoding) according to the order in which they appear in the document. The starting value of the encoding is 1 and the ending value is the vocabulary of the document size. Then record the number of the most participles in all clauses as max_sentence_len, and then add 0 to expand the id-encoded clauses to make the length reach max_sentence_len, which is the word vector of the clause, where the number of 0s in the word vector is equal to max_sentence_len-Number of words.

Referring to FIG. 2, this embodiment discloses a device for screening critical sentences in medical literature based on deep learning, including:

The generating unit 1 is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;

The input unit 2 is configured to input the word vectors of the clauses into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.

Specifically, the generating unit 1 performs sentence segmentation on the medical document to be processed, performs word segmentation on the sentence segment, and generates the word vector of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical document to be processed The input unit 2 inputs the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain the key sentence in the medical literature to be processed.

The key sentence screening device in the medical literature based on deep learning provided by the embodiments of the present disclosure uses the trained deep learning-based convolutional neural network model to screen key sentences in the medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.

On the basis of the foregoing device embodiments, the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window For the vectors obtained by splicing, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.

Based on the foregoing device embodiments, the generating unit is specifically used to:

According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.

The key sentence selection device in the medical literature based on deep learning of this embodiment may be used to execute the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.

FIG. 3 shows a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 3, the electronic device may include: a processor 11, a memory 12, a bus 13, and stored on the memory 12 and may be A computer program running on the processor 11;

Wherein, the processor 11 and the memory 12 communicate with each other through the bus 13;

When the processor 11 executes the computer program, the method provided by each of the above method embodiments is implemented, for example, including: segmenting a medical document to be processed, segmenting a clause, and performing a word segmentation on the medical subject to be processed according to the segmentation The order of occurrence in the literature is to encode the word segmentation to generate the word vector of the sentence; input the word vector of the sentence into a pre-trained convolutional neural network model based on deep learning to obtain the medical literature to be processed Key sentence in

Embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the method provided by the foregoing method embodiments is implemented, for example, including: medical documents to be processed Perform clause segmentation, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which they appear in the medical literature to be processed; input the word vectors of the clauses in advance In the deep learning-based convolutional neural network model, key sentences in the medical literature to be processed are obtained.

Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.

This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the application. It should be understood that each flow and / or block in the flowchart and / or block diagram and a combination of the flow and / or block in the flowchart and / or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device A device for realizing the functions specified in one block or multiple blocks of one flow or multiple blocks of a flowchart.

These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is any such actual relationship or order. Moreover, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, but also those not explicitly listed Or other elements that are inherent to this process, method, article, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article or equipment that includes the element. The terms "upper", "lower", etc. indicate the orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, only for the convenience of describing the present disclosure and simplifying the description, not to indicate or imply that the device or element referred It has a specific orientation, is constructed and operated in a specific orientation, and therefore cannot be understood as a limitation of the present disclosure. Unless otherwise clearly specified and defined, the terms "installation", "connected", and "connection" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, It can also be an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be a connection between two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present disclosure according to specific situations.

The specification of the present disclosure explains a lot of specific details. However, it can be understood that the embodiments of the present disclosure can be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description. Similarly, it should be understood that in order to streamline the disclosure and help understand one or more of the various inventive aspects, in the above description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together into a single embodiment , Figures, or their descriptions. However, the disclosed method should not be interpreted as reflecting the intention that the claimed disclosure requires more features than those expressly recited in each claim. Rather, as the claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Therefore, the claims following a specific embodiment are hereby expressly incorporated into the specific embodiment, where each claim itself serves as a separate embodiment of the present disclosure. It should be noted that the embodiments in the present application and the features in the embodiments can be combined with each other without conflict. The present disclosure is not limited to any single aspect, nor to any single embodiment, nor to any combination and / or substitution of these aspects and / or embodiments. Moreover, each aspect and / or embodiment of the present disclosure may be used alone or in combination with one or more other aspects and / or embodiments thereof.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present disclosure, but not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions of the technical solutions of the embodiments of the present disclosure The scope should be covered by the scope of the claims and the description of the present disclosure.

Claims

A method for selecting key sentences in medical literature based on deep learning, which is characterized by:

S1. Segment the medical documents to be processed, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;

S2. Input the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.
The method according to claim 1, wherein the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the corresponding filter window In the vector obtained by stitching the word segmentation vectors within, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
The method according to claim 2, characterized in that segmenting the medical document to be processed and segmenting the clause include:

The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
The method according to claim 3, characterized in that, by identifying and encoding the word segmentation according to the order in which the word segmentation appears in the medical document to be processed, the word vector of the sentence segmentation includes:

According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
A device for screening critical sentences in medical literature based on deep learning, which is characterized by including:

The generating unit is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation in the order in which the segmentation appears in the medical document to be processed;

The input unit is configured to input the word vectors of the clauses into a pre-trained deep learning-based convolutional neural network model to obtain key sentences in the medical literature to be processed.
The device according to claim 5, characterized in that the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the corresponding filter window In the vector obtained by stitching the word segmentation vectors within, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
The device according to claim 6, wherein the generating unit is specifically configured to:

The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
The device according to claim 7, wherein the generating unit is specifically configured to:

According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
An electronic device, characterized in that it includes: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;

Wherein, the processor and the memory communicate with each other through the bus;

When the processor executes the computer program, the method according to any one of claims 1-4 is implemented.
A non-transitory computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1-4 is implemented.