WO2020074023A1 - Deep learning-based method and device for screening for key sentences in medical document - Google Patents

Deep learning-based method and device for screening for key sentences in medical document Download PDF

Info

Publication number
WO2020074023A1
WO2020074023A1 PCT/CN2019/124561 CN2019124561W WO2020074023A1 WO 2020074023 A1 WO2020074023 A1 WO 2020074023A1 CN 2019124561 W CN2019124561 W CN 2019124561W WO 2020074023 A1 WO2020074023 A1 WO 2020074023A1
Authority
WO
WIPO (PCT)
Prior art keywords
processed
medical
clauses
segmentation
word
Prior art date
Application number
PCT/CN2019/124561
Other languages
French (fr)
Chinese (zh)
Inventor
赵荣生
宋再伟
黄振城
王则远
周旻
Original Assignee
北京大学第三医院
北京诺道认知医学科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学第三医院, 北京诺道认知医学科技有限公司 filed Critical 北京大学第三医院
Publication of WO2020074023A1 publication Critical patent/WO2020074023A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Embodiments of the present disclosure relate to the field of computers, and in particular to a method and device for screening critical sentences in medical literature based on deep learning.
  • the main content of the text is often covered in a set of important key sentences. These key sentences can clearly express the content characteristics of the text (such as domain categories, theme ideas, central meaning, etc.). It is based on this understanding.
  • the identification and screening of key sentences that can represent the main content of the text is a very important step. It is important for the disclosure of subject literature and the reflection of the knowledge hidden in the text. Meaning.
  • the key sentence screening is simply to identify and extract sentences containing useful information according to certain purpose requirements, so as to condense the text and obtain rich information from a small amount of data.
  • Traditional key sentence screening methods are generally based on statistical methods, using statistical information such as location and frequency to find the sentence that best represents the subject of the article as the key sentence.
  • the structure of the article it can be divided into unstructured screening analysis type and structured screening analysis type.
  • the former calculates the weight of the sentence of the article, and finds the sentence with the highest weight as the key sentence.
  • the latter first analyzes the semantic structure of the article to find out the topic structure of the article, and then extracts sentences from each topic to form a key sentence.
  • the statistical method of filtering based on structure or weight is easy to ignore the content of the sentence itself in actual operation, and the key sentences that are distributed in the text but contain the content of the subject words are filtered out, and the redundancy is greater.
  • the widely used deep learning algorithm focuses on the content of the sentence itself, and automatically learns the sample features by simulating the structure of the neural network of the human brain, so as to filter out key sentences containing key information and prepare for further analysis.
  • the algorithm has so far been limited to analyzing isolated sentences.
  • the contextual relationship between sentences and sentences has not been systematically studied in terms of the constraints and effects of this sentence.
  • embodiments of the present disclosure provide a method and device for screening key sentences in medical literature based on deep learning.
  • an embodiment of the present disclosure proposes a method for screening key sentences in medical literature based on deep learning, including:
  • an embodiment of the present disclosure proposes a device for screening critical sentences in medical literature based on deep learning, including:
  • the generating unit is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation in the order in which the segmentation appears in the medical document to be processed;
  • the input unit is configured to input the word vectors of the clauses into a pre-trained deep learning-based convolutional neural network model to obtain key sentences in the medical literature to be processed.
  • an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;
  • processor and the memory communicate with each other through the bus;
  • an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the above method.
  • the method and device for screening critical sentences in medical literature based on deep learning utilizes the trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can Combined with context semantics, the local relevance of the document is captured, so that the scheme can improve the accuracy of screening key sentences in medical literature compared to the existing technology.
  • FIG. 1 is a schematic flowchart of an embodiment of a method for screening key sentences in medical literature based on deep learning of the present disclosure
  • FIG. 2 is a schematic structural diagram of an embodiment of a key sentence selection device in medical literature based on deep learning of the present disclosure
  • FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • this embodiment discloses a method for selecting key sentences in medical literature based on deep learning, including:
  • the method for screening critical sentences in medical literature based on deep learning uses a trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.
  • the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window
  • the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
  • the convolutional layer of the model uses multi-filter windows with widths of 3, 4, and 5, each window corresponds to 100 filters, and slides different windows to traverse each participle in the clause.
  • each filter The device can get a feature map set.
  • the calculation formula of the feature set is as follows:
  • the pooling layer of the model adopts the Max-over-time-pooling method. For the feature set generated by different filters in each filter window, the maximum value of the set is taken as an important representative feature. In this way, the features of different sliding window sizes become a fixed length, which are spliced together to form a feature vector of 3 * 100 length.
  • the last layer of the model is a fully connected softmax layer, which outputs the probability of each category. Through multiple rounds of iterative training and parameter adjustment, the optimal model parameters are found.
  • the model output needs to be constructed for the training data.
  • the specific method is: according to the PICO index matrix, if a clause does not contain any matrix elements, it means that the clause does not contain relevant fields worth studying. Key information, so the target value of the clause is set to 0. If the clause contains one or more elements of the matrix, it means that the clause may contain important information. In order to avoid missing key information, it needs to be filtered out for subsequent In-depth study, so the target value of the clause is set to 1.
  • the medical documents to be processed are segmented, and the segmentation is segmented, including:
  • the medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  • the word vectors of the clauses are generated by identifying and encoding the tokens in the order in which the tokens appear in the medical document to be processed, including:
  • the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
  • the tokens of the clauses are identified and encoded (id encoding) according to the order in which they appear in the document.
  • the starting value of the encoding is 1 and the ending value is the vocabulary of the document size.
  • max_sentence_len the number of the most participles in all clauses as max_sentence_len, and then add 0 to expand the id-encoded clauses to make the length reach max_sentence_len, which is the word vector of the clause, where the number of 0s in the word vector is equal to max_sentence_len-Number of words.
  • this embodiment discloses a device for screening critical sentences in medical literature based on deep learning, including:
  • the generating unit 1 is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;
  • the input unit 2 is configured to input the word vectors of the clauses into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.
  • the generating unit 1 performs sentence segmentation on the medical document to be processed, performs word segmentation on the sentence segment, and generates the word vector of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical document to be processed
  • the input unit 2 inputs the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain the key sentence in the medical literature to be processed.
  • the key sentence screening device in the medical literature based on deep learning uses the trained deep learning-based convolutional neural network model to screen key sentences in the medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.
  • the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window
  • the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
  • the generating unit is specifically used to:
  • the medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  • the generating unit is specifically used to:
  • the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
  • the key sentence selection device in the medical literature based on deep learning of this embodiment may be used to execute the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.
  • FIG. 3 shows a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device may include: a processor 11, a memory 12, a bus 13, and stored on the memory 12 and may be A computer program running on the processor 11;
  • processor 11 and the memory 12 communicate with each other through the bus 13;
  • the method provided by each of the above method embodiments is implemented, for example, including: segmenting a medical document to be processed, segmenting a clause, and performing a word segmentation on the medical subject to be processed according to the segmentation
  • the order of occurrence in the literature is to encode the word segmentation to generate the word vector of the sentence; input the word vector of the sentence into a pre-trained convolutional neural network model based on deep learning to obtain the medical literature to be processed Key sentence in
  • Embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored.
  • the method provided by the foregoing method embodiments is implemented, for example, including: medical documents to be processed Perform clause segmentation, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which they appear in the medical literature to be processed; input the word vectors of the clauses in advance
  • key sentences in the medical literature to be processed are obtained.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • the terms “installation”, “connected”, and “connection” should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, It can also be an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be a connection between two components.

Abstract

A deep learning-based method and device for screening for key sentences in a medical document, capable of enhancing the accuracy of key sentence screening in medical documents. The method comprises: S1, performing sentence segmentation on a medical document to be processed, performing word segmentation on the component sentences, labeling and encoding the component words according to the order of appearance of the component words in the medical document to be processed, and generating word vectors for the component sentences; S2, inputting the word vectors of the component sentences into a pre-trained deep learning-based convolutional neural network model, and obtaining the key sentences in the medical document to be processed.

Description

基于深度学习的医学文献中关键句筛选方法及装置Method and device for screening key sentences in medical literature based on deep learning
相关申请的交叉引用Cross-reference of related applications
本申请要求于2018年10月12日提交的申请号为2018111880412,发明名称为“基于深度学习的医学文献中关键句筛选方法及装置”的中国专利申请的优先权,其通过引用方式全部并入本公开。This application requires the priority of the Chinese patent application filed on October 12, 2018 with the application number 2018111880412 and the invention titled "Method and Device for Screening Key Sentences in Medical Literature Based on Deep Learning", which is fully incorporated by reference This disclosure.
技术领域Technical field
本公开实施例涉及计算机领域,具体涉及一种基于深度学习的医学文献中关键句筛选方法及装置。Embodiments of the present disclosure relate to the field of computers, and in particular to a method and device for screening critical sentences in medical literature based on deep learning.
背景技术Background technique
文本中的主要内容往往涵盖在一组重要的关键句中,这些关键句能将文本的内容特征(例如领域类别、主题思想、中心意义等)鲜明的表示出来,正是基于这种认识,在信息检索、信息抽取和知识抽取等领域中,对能够表示文本主要内容的关键句识别与筛选是其中非常重要的一个步骤,其对学科文献的揭示、反映文本中隐含的知识等都具有重要的意义。关键句筛选简单来说就是根据一定的目的要求,通过计算机技术辨别并提取包含有用信息的句子,从而对文本进行浓缩从少量的数据中获得丰富的信息。The main content of the text is often covered in a set of important key sentences. These key sentences can clearly express the content characteristics of the text (such as domain categories, theme ideas, central meaning, etc.). It is based on this understanding. In the fields of information retrieval, information extraction and knowledge extraction, the identification and screening of key sentences that can represent the main content of the text is a very important step. It is important for the disclosure of subject literature and the reflection of the knowledge hidden in the text. Meaning. The key sentence screening is simply to identify and extract sentences containing useful information according to certain purpose requirements, so as to condense the text and obtain rich information from a small amount of data.
传统的关键句筛选方法一般基于统计方法,利用如位置、频数等统计信息找到最能代表文章主题的句子作为关键句。按照文章的结构划分,又可分为无结构筛选分析型和有结构筛选分析型。前者通过对文章的句子进行权重计算,找到权重排名靠前的句子作为关键句。后者首先对文章进行语义结构分析,找出文章的主题结构,然后从各个主题中分别抽取句子组成关键句。然而基于结构或者权重进行筛选的统计方法,在实际操作中容易忽略句子本身的内容,筛选掉在文中分布较小但是包含主题词内容的关键句,且冗余性较大。在自然语言处理领域应用较为广泛的深度学习算法关注点在句子内容本身,通过模拟人脑神经网络结构自动学习样本特征,从而筛选出包含关键信息的关键句,为进一步分析做准备。但是该算法迄今为止仅限于分析孤立的句子,句子与句子之间的上下文关系对本句的约 束和影响还缺乏系统的研究。Traditional key sentence screening methods are generally based on statistical methods, using statistical information such as location and frequency to find the sentence that best represents the subject of the article as the key sentence. According to the structure of the article, it can be divided into unstructured screening analysis type and structured screening analysis type. The former calculates the weight of the sentence of the article, and finds the sentence with the highest weight as the key sentence. The latter first analyzes the semantic structure of the article to find out the topic structure of the article, and then extracts sentences from each topic to form a key sentence. However, the statistical method of filtering based on structure or weight is easy to ignore the content of the sentence itself in actual operation, and the key sentences that are distributed in the text but contain the content of the subject words are filtered out, and the redundancy is greater. In the field of natural language processing, the widely used deep learning algorithm focuses on the content of the sentence itself, and automatically learns the sample features by simulating the structure of the neural network of the human brain, so as to filter out key sentences containing key information and prepare for further analysis. However, the algorithm has so far been limited to analyzing isolated sentences. The contextual relationship between sentences and sentences has not been systematically studied in terms of the constraints and effects of this sentence.
发明内容Summary of the invention
针对现有技术存在的不足和缺陷,本公开实施例提供一种基于深度学习的医学文献中关键句筛选方法及装置。In view of the deficiencies and defects existing in the prior art, embodiments of the present disclosure provide a method and device for screening key sentences in medical literature based on deep learning.
一方面,本公开实施例提出一种基于深度学习的医学文献中关键句筛选方法,包括:On the one hand, an embodiment of the present disclosure proposes a method for screening key sentences in medical literature based on deep learning, including:
S1、对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;S1. Segment the medical documents to be processed, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;
S2、将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。S2. Input the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.
另一方面,本公开实施例提出一种基于深度学习的医学文献中关键句筛选装置,包括:On the other hand, an embodiment of the present disclosure proposes a device for screening critical sentences in medical literature based on deep learning, including:
生成单元,用于对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;The generating unit is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation in the order in which the segmentation appears in the medical document to be processed;
输入单元,用于将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。The input unit is configured to input the word vectors of the clauses into a pre-trained deep learning-based convolutional neural network model to obtain key sentences in the medical literature to be processed.
第三方面,本公开实施例提供一种电子设备,包括:处理器、存储器、总线及存储在存储器上并可在处理器上运行的计算机程序;In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;
其中,所述处理器,存储器通过所述总线完成相互间的通信;Wherein, the processor and the memory communicate with each other through the bus;
所述处理器执行所述计算机程序时实现上述方法。The above method is implemented when the processor executes the computer program.
第四方面,本公开实施例提供一种非暂态计算机可读存储介质,所述存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述方法。According to a fourth aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the above method.
本公开实施例提供的基于深度学习的医学文献中关键句筛选方法及装置,利用训练好的基于深度学习的卷积神经网络模型筛选医学文献中的关键句,因构建的卷积神经网络模型能够结合上下文语义,捕捉到文献的局部相关性,从而使得本方案相较于现有技术能提高医学文献中关键句筛选的准确度。The method and device for screening critical sentences in medical literature based on deep learning provided by embodiments of the present disclosure utilizes the trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can Combined with context semantics, the local relevance of the document is captured, so that the scheme can improve the accuracy of screening key sentences in medical literature compared to the existing technology.
附图说明BRIEF DESCRIPTION
图1为本公开基于深度学习的医学文献中关键句筛选方法一实施例的流程示意图;FIG. 1 is a schematic flowchart of an embodiment of a method for screening key sentences in medical literature based on deep learning of the present disclosure;
图2为本公开基于深度学习的医学文献中关键句筛选装置一实施例的结构示意图;2 is a schematic structural diagram of an embodiment of a key sentence selection device in medical literature based on deep learning of the present disclosure;
图3为本公开实施例提供的一种电子设备的实体结构示意图。FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
具体实施方式detailed description
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开实施例保护的范围。To make the objectives, technical solutions, and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are Some embodiments are disclosed, but not all embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the protection scope of the embodiments of the present disclosure.
参看图1,本实施例公开一种基于深度学习的医学文献中关键句筛选方法,包括:Referring to FIG. 1, this embodiment discloses a method for selecting key sentences in medical literature based on deep learning, including:
S1、对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;S1. Segment the medical documents to be processed, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;
S2、将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。S2. Input the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.
本公开实施例提供的基于深度学习的医学文献中关键句筛选方法,利用训练好的基于深度学习的卷积神经网络模型筛选医学文献中的关键句,因构建的卷积神经网络模型能够结合上下文语义,捕捉到文献的局部相关性,从而使得本方案相较于现有技术能提高医学文献中关键句筛选的准确度。The method for screening critical sentences in medical literature based on deep learning provided by an embodiment of the present disclosure uses a trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.
在前述方法实施例的基础上,所述卷积神经网络模型的卷积层采用多元过滤窗口,每个窗口对应第一数量个过滤器,卷积运算的自变量为对应过滤窗口内的分词向量拼接得到的向量,所述卷积神经网络模型的池化层采用Max-over-time-pooling方式。On the basis of the foregoing method embodiments, the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window For the vectors obtained by splicing, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
本实施例中,在使用卷积神经网络模型进行关键句筛选之前,需要对 卷积神经网络模型进行构建,并使用训练数据对卷积神经网络模型进行训练。具体地,模型的卷积层采用宽度为3、4、5的多元过滤窗口,每个窗口对应100个过滤器,滑动不同窗口遍历分句中每一个分词,经过卷积计算后,每一个过滤器可得到一个特征映射集。该特征集的计算公式如下:In this embodiment, before using the convolutional neural network model for key sentence selection, it is necessary to construct the convolutional neural network model and use the training data to train the convolutional neural network model. Specifically, the convolutional layer of the model uses multi-filter windows with widths of 3, 4, and 5, each window corresponds to 100 filters, and slides different windows to traverse each participle in the clause. After the convolution calculation, each filter The device can get a feature map set. The calculation formula of the feature set is as follows:
C i=f(w·x i:i+h-1+b), C i = f (w · x i: i + h-1 + b),
其中:
Figure PCTCN2019124561-appb-000001
表示从第i个分词x i起窗口大小为h的词向量拼接生成的向量,w为此窗口对应的一个过滤器矩阵,
Figure PCTCN2019124561-appb-000002
是偏差项,f是非线性函数,C i为产生的新特征。则对应于{x 1:h,x 2:h+1,...,x n-h+1:n},特征映射集可以表示成:
among them:
Figure PCTCN2019124561-appb-000001
Represents the vector generated by the stitching of word vectors with a window size of h from the i-th participle x i , w is a filter matrix corresponding to this window,
Figure PCTCN2019124561-appb-000002
Is the deviation term, f is a nonlinear function, and C i is the new feature produced. Then corresponding to {x 1: h , x 2: h + 1 , ..., x n-h + 1: n }, the feature map set can be expressed as:
C=[C 1,C 2,...C n-h+1]。 C = [C 1 , C 2 , ... C n-h + 1 ].
模型的池化层采用Max-over-time-pooling方式,对于每一过滤窗口的不同过滤器产生的特征集,取集合最大值作为重要代表特征。这样不同滑窗大小的特征变成固定的长度,拼接一起组成3*100长度的特征向量。模型的最后一层为全连接的softmax层,输出每个类别的概率。通过多轮迭代训练与参数调整,找到最优模型参数。The pooling layer of the model adopts the Max-over-time-pooling method. For the feature set generated by different filters in each filter window, the maximum value of the set is taken as an important representative feature. In this way, the features of different sliding window sizes become a fixed length, which are spliced together to form a feature vector of 3 * 100 length. The last layer of the model is a fully connected softmax layer, which outputs the probability of each category. Through multiple rounds of iterative training and parameter adjustment, the optimal model parameters are found.
需要说明的是,在训练模型时,需要针对训练数据进行模型输出构建,具体方法为:依据PICO指标矩阵,若一个分句未包含任意矩阵元素,则表示该分句不含有相关领域值得研究的关键信息,因此该分句的目标值设为0,若分句包含矩阵的一个或者多个元素,则表示该分句可能含有重要信息,为了避免漏掉关键信息,需要将其筛选出来以便后续深入研究,因此将该分句的目标值设为1。It should be noted that when training the model, the model output needs to be constructed for the training data. The specific method is: according to the PICO index matrix, if a clause does not contain any matrix elements, it means that the clause does not contain relevant fields worth studying. Key information, so the target value of the clause is set to 0. If the clause contains one or more elements of the matrix, it means that the clause may contain important information. In order to avoid missing key information, it needs to be filtered out for subsequent In-depth study, so the target value of the clause is set to 1.
在前述方法实施例的基础上,所述对待处理的医学文献进行分句,对分句进行分词,包括:On the basis of the foregoing method embodiments, the medical documents to be processed are segmented, and the segmentation is segmented, including:
依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
本实施例中,对分词过程举例说明如下:In this embodiment, the word segmentation process is exemplified as follows:
对于例句:目的评价亚甲基四氢叶酸还原酶基因多态性在甲氨喋呤治疗急性淋巴细胞白血病过程中毒副反应的相关性。方法通过计算机检索国内外相关数据库:EMBASE,CNKI,维普中文科技期刊数据库以及万方数据库,…,首先依据标点符号对其进行分句,分句结果为:For the example sentence: Objective To evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia. Methods Relevant databases at home and abroad were searched by computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database, ... Firstly, they were sentenced according to punctuation marks. The result of the sentence was:
(1)目的评价亚甲基四氢叶酸还原酶基因多态性在甲氨喋呤治疗急性淋巴细胞白血病过程中毒副反应的相关性;(1) Objective to evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia;
(2)方法通过计算机检索国内外相关数据库:EMBASE,CNKI,维普中文科技期刊数据库以及万方数据库。(2) The method searches the relevant domestic and foreign databases through computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database.
然后利用分词算法对分句进行分词,分词结果为:Then use the word segmentation algorithm to segment the sentence, and the result is:
1)['目的','评价','亚','甲基','四氢叶酸','还原酶','基因','多态性','在','甲氨喋呤','治疗','急性','淋巴','细胞','白血病','过程','中','毒副','反应','的','相关性'];1) ['Purpose', 'Evaluation', 'Sub', 'Methyl', 'Tetrahydrofolate', 'Reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate ',' Treatment ',' Acute ',' Lymph ',' Cell ',' Leukemia ',' Process', 'Medium', 'Poison Side', 'Reaction', 'The', 'Relevance'];
2)['方法','通过','计算机','检索','国内外','相关','数据库','EMBASE','CNKI','维普','中文','科技','期刊','数据库','以及','万方','数据库']。2) ('Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vipu', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].
最后结合医学词库对部分分词进行合并,则对于第一个分句(1)的分词1),需要将“亚”、“甲基”、“四氢叶酸”和“还原酶”合并成一个完整的医学名词“亚甲基四氢叶酸还原酶”,需要将“淋巴”和“细胞”合并成一个完整的医学名词“淋巴细胞”,需要将“毒副”和“反应”合并成一个完整的医学名词“毒副反应”。合并结果为:Finally, part of the participles are merged in combination with the medical vocabulary. For the participle 1) of the first clause (1), it is necessary to merge "Ya", "Methyl", "Tetrahydrofolate" and "Reductase" into one The complete medical term "methylenetetrahydrofolate reductase" needs to merge "lymphoid" and "cell" into a complete medical term "lymphocyte" and needs to merge "poison" and "reaction" into a complete The medical term "toxic side effects". The merged result is:
a)['目的','评价','亚甲基四氢叶酸还原酶','基因','多态性','在','甲氨喋呤','治疗','急性','淋巴细胞','白血病','过程','中','毒副反应','的','相关性'];a) ['Purpose', 'Evaluation', 'Methylenetetrahydrofolate reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate', 'Treatment', 'Acute' , 'Lymphocyte', 'Leukemia', 'Process', 'Medium', 'Toxic Side Effects', 'De', 'Relevance'];
b)['方法','通过','计算机','检索','国内外','相关','数据库','EMBASE','CNKI','维普','中文','科技','期刊','数据库','以及','万方','数据库']。b) ['Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vip', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].
在前述方法实施例的基础上,所述通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量,包括:On the basis of the foregoing method embodiments, the word vectors of the clauses are generated by identifying and encoding the tokens in the order in which the tokens appear in the medical document to be processed, including:
按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等,将填零扩充后的分句作为对应分句的词向量。According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
本实施例中,生成分句的词向量时,首先将分句的各个分词根据在文献中出现的顺序进行标识编码(id编码),编码的起始值为1,终止值为文献的词汇量大小。然后将所有分句中包含最多分词的个数记录为max_sentence_len,之后将id编码的分句进行填0扩充,使其长度达到max_sentence_len,即得到分句的词向量,其中词向量中0的数量等于 max_sentence_len-分词数。In this embodiment, when generating the word vectors of the clauses, firstly, the tokens of the clauses are identified and encoded (id encoding) according to the order in which they appear in the document. The starting value of the encoding is 1 and the ending value is the vocabulary of the document size. Then record the number of the most participles in all clauses as max_sentence_len, and then add 0 to expand the id-encoded clauses to make the length reach max_sentence_len, which is the word vector of the clause, where the number of 0s in the word vector is equal to max_sentence_len-Number of words.
参看图2,本实施例公开一种基于深度学习的医学文献中关键句筛选装置,包括:Referring to FIG. 2, this embodiment discloses a device for screening critical sentences in medical literature based on deep learning, including:
生成单元1,用于对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;The generating unit 1 is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;
输入单元2,用于将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。The input unit 2 is configured to input the word vectors of the clauses into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.
具体地,所述生成单元1对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;所述输入单元2将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。Specifically, the generating unit 1 performs sentence segmentation on the medical document to be processed, performs word segmentation on the sentence segment, and generates the word vector of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical document to be processed The input unit 2 inputs the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain the key sentence in the medical literature to be processed.
本公开实施例提供的基于深度学习的医学文献中关键句筛选装置,利用训练好的基于深度学习的卷积神经网络模型筛选医学文献中的关键句,因构建的卷积神经网络模型能够结合上下文语义,捕捉到文献的局部相关性,从而使得本方案相较于现有技术能提高医学文献中关键句筛选的准确度。The key sentence screening device in the medical literature based on deep learning provided by the embodiments of the present disclosure uses the trained deep learning-based convolutional neural network model to screen key sentences in the medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.
在前述装置实施例的基础上,所述卷积神经网络模型的卷积层采用多元过滤窗口,每个窗口对应第一数量个过滤器,卷积运算的自变量为对应过滤窗口内的分词向量拼接得到的向量,所述卷积神经网络模型的池化层采用Max-over-time-pooling方式。On the basis of the foregoing device embodiments, the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window For the vectors obtained by splicing, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
在前述装置实施例的基础上,所述生成单元,具体用于:Based on the foregoing device embodiments, the generating unit is specifically used to:
依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
在前述装置实施例的基础上,所述生成单元,具体用于:Based on the foregoing device embodiments, the generating unit is specifically used to:
按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等,将填零扩充后的分句作为对应分句的词向量。According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
本实施例的基于深度学习的医学文献中关键句筛选装置,可以用于执行前述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The key sentence selection device in the medical literature based on deep learning of this embodiment may be used to execute the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.
图3示出了本公开实施例提供的一种电子设备的实体结构示意图,如图3所示,该电子设备可以包括:处理器11、存储器12、总线13及存储在存储器12上并可在处理器11上运行的计算机程序;FIG. 3 shows a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 3, the electronic device may include: a processor 11, a memory 12, a bus 13, and stored on the memory 12 and may be A computer program running on the processor 11;
其中,所述处理器11,存储器12通过所述总线13完成相互间的通信;Wherein, the processor 11 and the memory 12 communicate with each other through the bus 13;
所述处理器11执行所述计算机程序时实现上述各方法实施例所提供的方法,例如包括:对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。When the processor 11 executes the computer program, the method provided by each of the above method embodiments is implemented, for example, including: segmenting a medical document to be processed, segmenting a clause, and performing a word segmentation on the medical subject to be processed according to the segmentation The order of occurrence in the literature is to encode the word segmentation to generate the word vector of the sentence; input the word vector of the sentence into a pre-trained convolutional neural network model based on deep learning to obtain the medical literature to be processed Key sentence in
本公开实施例提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例所提供的方法,例如包括:对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。Embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the method provided by the foregoing method embodiments is implemented, for example, including: medical documents to be processed Perform clause segmentation, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which they appear in the medical literature to be processed; input the word vectors of the clauses in advance In the deep learning-based convolutional neural network model, key sentences in the medical literature to be processed are obtained.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产 生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the application. It should be understood that each flow and / or block in the flowchart and / or block diagram and a combination of the flow and / or block in the flowchart and / or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device A device for realizing the functions specified in one block or multiple blocks of one flow or multiple blocks of a flowchart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。术语“上”、“下”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本公开和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本公开的限制。除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本公开中的具体含义。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is any such actual relationship or order. Moreover, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, but also those not explicitly listed Or other elements that are inherent to this process, method, article, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article or equipment that includes the element. The terms "upper", "lower", etc. indicate the orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, only for the convenience of describing the present disclosure and simplifying the description, not to indicate or imply that the device or element referred It has a specific orientation, is constructed and operated in a specific orientation, and therefore cannot be understood as a limitation of the present disclosure. Unless otherwise clearly specified and defined, the terms "installation", "connected", and "connection" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, It can also be an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be a connection between two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present disclosure according to specific situations.
本公开的说明书中,说明了大量具体细节。然而能够理解的是,本公开的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未 详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。类似地,应当理解,为了精简本公开公开并帮助理解各个发明方面中的一个或多个,在上面对本公开的示例性实施例的描述中,本公开的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本公开要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本公开的单独实施例。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。本公开并不局限于任何单一的方面,也不局限于任何单一的实施例,也不局限于这些方面和/或实施例的任意组合和/或置换。而且,可以单独使用本公开的每个方面和/或实施例或者与一个或更多其他方面和/或其实施例结合使用。The specification of the present disclosure explains a lot of specific details. However, it can be understood that the embodiments of the present disclosure can be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description. Similarly, it should be understood that in order to streamline the disclosure and help understand one or more of the various inventive aspects, in the above description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together into a single embodiment , Figures, or their descriptions. However, the disclosed method should not be interpreted as reflecting the intention that the claimed disclosure requires more features than those expressly recited in each claim. Rather, as the claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Therefore, the claims following a specific embodiment are hereby expressly incorporated into the specific embodiment, where each claim itself serves as a separate embodiment of the present disclosure. It should be noted that the embodiments in the present application and the features in the embodiments can be combined with each other without conflict. The present disclosure is not limited to any single aspect, nor to any single embodiment, nor to any combination and / or substitution of these aspects and / or embodiments. Moreover, each aspect and / or embodiment of the present disclosure may be used alone or in combination with one or more other aspects and / or embodiments thereof.
最后应说明的是:以上各实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述各实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的范围,其均应涵盖在本公开的权利要求和说明书的范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present disclosure, but not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions of the technical solutions of the embodiments of the present disclosure The scope should be covered by the scope of the claims and the description of the present disclosure.

Claims (10)

  1. 一种基于深度学习的医学文献中关键句筛选方法,其特征在于,包括:A method for selecting key sentences in medical literature based on deep learning, which is characterized by:
    S1、对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;S1. Segment the medical documents to be processed, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;
    S2、将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。S2. Input the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.
  2. 根据权利要求1所述的方法,其特征在于,所述卷积神经网络模型的卷积层采用多元过滤窗口,每个窗口对应第一数量个过滤器,卷积运算的自变量为对应过滤窗口内的分词向量拼接得到的向量,所述卷积神经网络模型的池化层采用Max-over-time-pooling方式。The method according to claim 1, wherein the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the corresponding filter window In the vector obtained by stitching the word segmentation vectors within, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
  3. 根据权利要求2所述的方法,其特征在于,所述对待处理的医学文献进行分句,对分句进行分词,包括:The method according to claim 2, characterized in that segmenting the medical document to be processed and segmenting the clause include:
    依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  4. 根据权利要求3所述的方法,其特征在于,所述通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量,包括:The method according to claim 3, characterized in that, by identifying and encoding the word segmentation according to the order in which the word segmentation appears in the medical document to be processed, the word vector of the sentence segmentation includes:
    按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等,将填零扩充后的分句作为对应分句的词向量。According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
  5. 一种基于深度学习的医学文献中关键句筛选装置,其特征在于,包括:A device for screening critical sentences in medical literature based on deep learning, which is characterized by including:
    生成单元,用于对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量;The generating unit is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation in the order in which the segmentation appears in the medical document to be processed;
    输入单元,用于将所述分句的词向量输入预先训练好的基于深度学习的卷积神经网络模型中,得到所述待处理的医学文献中的关键句。The input unit is configured to input the word vectors of the clauses into a pre-trained deep learning-based convolutional neural network model to obtain key sentences in the medical literature to be processed.
  6. 根据权利要求5所述的装置,其特征在于,所述卷积神经网络模型的卷积层采用多元过滤窗口,每个窗口对应第一数量个过滤器,卷积运算的自变量为对应过滤窗口内的分词向量拼接得到的向量,所述卷积神经网络模型的池化层采用Max-over-time-pooling方式。The device according to claim 5, characterized in that the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the corresponding filter window In the vector obtained by stitching the word segmentation vectors within, the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
  7. 根据权利要求6所述的装置,其特征在于,所述生成单元,具体用于:The device according to claim 6, wherein the generating unit is specifically configured to:
    依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  8. 根据权利要求7所述的装置,其特征在于,所述生成单元,具体用于:The device according to claim 7, wherein the generating unit is specifically configured to:
    按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等,将填零扩充后的分句作为对应分句的词向量。According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
  9. 一种电子设备,其特征在于,包括:处理器、存储器、总线及存储在存储器上并可在处理器上运行的计算机程序;An electronic device, characterized in that it includes: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;
    其中,所述处理器,存储器通过所述总线完成相互间的通信;Wherein, the processor and the memory communicate with each other through the bus;
    所述处理器执行所述计算机程序时实现如权利要求1-4中任一项所述的方法。When the processor executes the computer program, the method according to any one of claims 1-4 is implemented.
  10. 一种非暂态计算机可读存储介质,其特征在于,所述存储介质上存储有计算机程序,该计算机程序被处理器执行时实现如权利要求1-4中任一项所述的方法。A non-transitory computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1-4 is implemented.
PCT/CN2019/124561 2018-10-12 2019-12-11 Deep learning-based method and device for screening for key sentences in medical document WO2020074023A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811188041.2A CN109472021A (en) 2018-10-12 2018-10-12 Critical sentence screening technique and device in medical literature based on deep learning
CN201811188041.2 2018-10-12

Publications (1)

Publication Number Publication Date
WO2020074023A1 true WO2020074023A1 (en) 2020-04-16

Family

ID=65663812

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124561 WO2020074023A1 (en) 2018-10-12 2019-12-11 Deep learning-based method and device for screening for key sentences in medical document

Country Status (2)

Country Link
CN (1) CN109472021A (en)
WO (1) WO2020074023A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667781A (en) * 2020-12-31 2021-04-16 北京万方数据股份有限公司 Malignant tumor document acquisition method and device

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472021A (en) * 2018-10-12 2019-03-15 北京诺道认知医学科技有限公司 Critical sentence screening technique and device in medical literature based on deep learning
CN110298028B (en) * 2019-05-21 2023-08-18 杭州未名信科科技有限公司 Method and device for extracting key sentences of text paragraphs
CN110728143A (en) * 2019-09-23 2020-01-24 上海蜜度信息技术有限公司 Method and equipment for identifying document key sentences
CN112735543A (en) * 2020-12-30 2021-04-30 杭州依图医疗技术有限公司 Medical data processing method and device and storage medium
CN112749277B (en) * 2020-12-30 2023-08-04 杭州依图医疗技术有限公司 Medical data processing method, device and storage medium
CN114860900A (en) * 2022-04-07 2022-08-05 海信集团控股股份有限公司 Sentencing prediction method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776580A (en) * 2017-01-20 2017-05-31 中山大学 The theme line recognition methods of the deep neural network CNN and RNN of mixing
CN106980683A (en) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 Blog text snippet generation method based on deep learning
US20170213130A1 (en) * 2016-01-21 2017-07-27 Ebay Inc. Snippet extractor: recurrent neural networks for text summarization at industry scale
CN108595411A (en) * 2018-03-19 2018-09-28 南京邮电大学 More text snippet acquisition methods in a kind of same subject text set
CN109359300A (en) * 2018-10-12 2019-02-19 北京大学第三医院 Keyword screening technique and device in medical literature based on deep learning
CN109472021A (en) * 2018-10-12 2019-03-15 北京诺道认知医学科技有限公司 Critical sentence screening technique and device in medical literature based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170213130A1 (en) * 2016-01-21 2017-07-27 Ebay Inc. Snippet extractor: recurrent neural networks for text summarization at industry scale
CN106776580A (en) * 2017-01-20 2017-05-31 中山大学 The theme line recognition methods of the deep neural network CNN and RNN of mixing
CN106980683A (en) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 Blog text snippet generation method based on deep learning
CN108595411A (en) * 2018-03-19 2018-09-28 南京邮电大学 More text snippet acquisition methods in a kind of same subject text set
CN109359300A (en) * 2018-10-12 2019-02-19 北京大学第三医院 Keyword screening technique and device in medical literature based on deep learning
CN109472021A (en) * 2018-10-12 2019-03-15 北京诺道认知医学科技有限公司 Critical sentence screening technique and device in medical literature based on deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667781A (en) * 2020-12-31 2021-04-16 北京万方数据股份有限公司 Malignant tumor document acquisition method and device

Also Published As

Publication number Publication date
CN109472021A (en) 2019-03-15

Similar Documents

Publication Publication Date Title
WO2020074023A1 (en) Deep learning-based method and device for screening for key sentences in medical document
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN110399457B (en) Intelligent question answering method and system
CN111159223B (en) Interactive code searching method and device based on structured embedding
WO2018157805A1 (en) Automatic questioning and answering processing method and automatic questioning and answering system
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN108009182B (en) Information extraction method and device
US20170330084A1 (en) Clarification of Submitted Questions in a Question and Answer System
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
US11151179B2 (en) Method, apparatus and electronic device for determining knowledge sample data set
CN104636466B (en) Entity attribute extraction method and system for open webpage
US11934461B2 (en) Applying natural language pragmatics in a data visualization user interface
CN107463548B (en) Phrase mining method and device
CN111813950B (en) Building field knowledge graph construction method based on neural network self-adaptive optimization tuning
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
KR102491172B1 (en) Natural language question-answering system and learning method
CN110414004B (en) Method and system for extracting core information
CN111899089A (en) Enterprise risk early warning method and system based on knowledge graph
CN108280164B (en) Short text filtering and classifying method based on category related words
CN111625659A (en) Knowledge graph processing method, device, server and storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
WO2020074017A1 (en) Deep learning-based method and device for screening for keywords in medical document
CN109741824B (en) Medical inquiry method based on machine learning
CN113851219A (en) Intelligent diagnosis guiding method based on multi-mode knowledge graph
CN108959314A (en) A kind of semantic retrieving method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19870423

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19870423

Country of ref document: EP

Kind code of ref document: A1