WO2020074017A1 - Deep learning-based method and device for screening for keywords in medical document - Google Patents

Deep learning-based method and device for screening for keywords in medical document Download PDF

Info

Publication number
WO2020074017A1
WO2020074017A1 PCT/CN2019/118858 CN2019118858W WO2020074017A1 WO 2020074017 A1 WO2020074017 A1 WO 2020074017A1 CN 2019118858 W CN2019118858 W CN 2019118858W WO 2020074017 A1 WO2020074017 A1 WO 2020074017A1
Authority
WO
WIPO (PCT)
Prior art keywords
medical
processed
segmentation
layer
word vector
Prior art date
Application number
PCT/CN2019/118858
Other languages
French (fr)
Chinese (zh)
Inventor
赵荣生
宋再伟
刘爽
周旻
Original Assignee
北京大学第三医院
北京诺道认知医学科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学第三医院, 北京诺道认知医学科技有限公司 filed Critical 北京大学第三医院
Publication of WO2020074017A1 publication Critical patent/WO2020074017A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • Embodiments of the present application relate to the field of computers, and in particular to a method and device for keyword selection in medical literature based on deep learning.
  • Keyword extraction refers to the use of computer technology to select words or terms that reflect the content of the topic from reports and documents according to certain requirements. This provides a brief summary for the document, enabling readers to understand the important information and core content of the document in a short time. Because the keywords are very refined, the keywords can be used to measure the text similarity at a small calculation cost. Therefore, it has important applications in literature retrieval, automatic summarization, text classification, text clustering, etc.
  • Existing keyword extraction methods are mainly divided into three categories: (1) Based on statistical features, the weights of candidate words are determined according to the frequency or position of words, and those with larger weights are selected as keywords. Although this method is simple to operate, it will ignore the words that are distributed in a small position in the text and are in a relatively biased position but have a key significance for the article; The network calculates the criticality of words. This method mainly uses the co-occurrence relationship of high-frequency words to construct a word network, and it is also impossible to extract words that are important to the document but not frequently; (3) a semantic-based method to judge the importance of words from a semantic perspective and extract keywords . However, at present, this method only uses synonyms to match synonyms. However, most keywords that express the same topic are not synonyms or synonyms, so most of the words with the same topic are not semantically related, resulting in the method not being able to play its due role.
  • embodiments of the present application provide a method and device for keyword selection in medical literature based on deep learning.
  • the embodiments of the present application provide a method for keyword selection in medical literature based on deep learning, including:
  • the embodiments of the present application provide a keyword selection device in medical literature based on deep learning, including:
  • the generating unit is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;
  • the input unit is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
  • an embodiment of the present application provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;
  • processor and the memory communicate with each other through the bus;
  • an embodiment of the present application provides a non-transitory computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the foregoing method.
  • the method and device for keyword selection in medical literature based on deep learning use the trained Bilstm-CRF model based on deep learning to filter keywords in medical literature, because the constructed Bilstm-CRF model can be combined with context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of keyword selection in medical documents compared to existing technologies.
  • FIG. 1 is a schematic flowchart of an embodiment of a method for keyword selection in medical literature based on deep learning in this application;
  • FIG. 2 is a schematic structural diagram of an embodiment of a keyword selection device in medical literature based on deep learning of the present application
  • FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • this embodiment discloses a method for keyword selection in medical literature based on deep learning, including:
  • the keyword selection method in the medical literature based on deep learning uses the trained Bilstm-CRF model based on deep learning to filter the keywords in the medical literature. Because the constructed Bilstm-CRF model can combine contextual semantics, Capturing the local relevance of the document, so that the solution can improve the accuracy of keyword selection in medical literature compared to the prior art.
  • the second layer of the Bilstm-CRF model is a bidirectional LSTM layer
  • the third layer is a linear layer
  • the fourth layer is a CRF layer.
  • the Bilstm-CRF model needs to be constructed, and the training data is used to train the Bilstm-CRF model.
  • the training process of the Bilstm-CRF model is as follows:
  • the word vector sequence (x 1 , x 2 , ..., x max_len ) composed of each participle of the sentence in the training sample is used as the input of each time step of the bidirectional LSTM.
  • the second layer of the model is a bidirectional LSTM layer, which is used to automatically extract word features.
  • Hidden state sequence output from forward LSTM With reverse LSTM output Perform a bitwise stitching to get the complete hidden state sequence:
  • the log-likelihood function of the model is defined as:
  • Y x is the set of dependent variables, indicating all label items.
  • Each participle of the clause is id-coded according to the order in which it appears in the document.
  • the starting value of the encoding is 1, and the ending value is the vocabulary size N of the document.
  • the model output needs to be constructed based on the training data.
  • the specific method is: according to the PICO index matrix, label all the words in the clause. If the word segmentation appears in the index matrix, set the tag value to P or I-C or O according to the corresponding relationship; if the word segmentation does not appear in the index matrix, the tag value is N. Take the entire label sequence as the target value of the model.
  • the model constructed in this application can combine the context semantics of word segmentation, and limit the output of unreasonable label sequences by calculating the state transition probability according to the internal connection of the label set.
  • the medical documents to be processed are segmented, and the segmentation is segmented, including:
  • the medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  • the word vector matrix of clauses is generated by identifying and encoding the tokens in the order in which the tokens appear in the medical document to be processed, including:
  • the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;
  • the word vector matrix is generated based on the result of zero-filling expansion.
  • the process of generating the test data word vector matrix in this embodiment is the same as the process of generating the training sample word vector matrix during the aforementioned training model, which will not be repeated here.
  • each token of the clauses is identified and encoded (id encoding) according to the order in which they appear in the document.
  • the starting value of the encoding is 1, and the ending value is the vocabulary of the document Quantity.
  • max_sentence_len the number of the most participles in all clauses as max_sentence_len, and then add 0 to expand the id-encoded clauses to make the length reach max_sentence_len, which is the word vector of the clause, where the number of 0s in the word vector is equal to max_sentence_len-Number of words.
  • this embodiment discloses a device for keyword selection in medical literature based on deep learning, including:
  • the generating unit 1 is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed ;
  • the input unit 2 is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
  • the generating unit 1 performs sentence segmentation on the medical document to be processed, performs word segmentation on the sentence segment, and generates the word vector of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical document to be processed
  • the input unit 2 inputs the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
  • the keyword selection device for medical literature based on deep learning uses the trained Bilstm-CRF model based on deep learning to filter keywords in medical literature. Because the constructed Bilstm-CRF model can combine contextual semantics, Capturing the local relevance of the document, so that the solution can improve the accuracy of keyword selection in medical literature compared to the prior art.
  • the second layer of the Bilstm-CRF model is a bidirectional LSTM layer
  • the third layer is a linear layer
  • the fourth layer is a CRF layer.
  • the generating unit is specifically configured to:
  • the medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  • the generating unit is specifically configured to:
  • the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;
  • the word vector matrix is generated based on the result of zero-filling expansion.
  • the keyword selection device in the medical literature based on deep learning of this embodiment may be used to execute the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.
  • FIG. 3 shows a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present application.
  • the electronic device may include: a processor 11, a memory 12, a bus 13, and stored on the memory 12 and may be A computer program running on the processor 11;
  • processor 11 and the memory 12 communicate with each other through the bus 13;
  • the method provided by each of the above method embodiments is implemented, for example, including: segmenting a medical document to be processed, segmenting a clause, and performing a word segmentation on the medical subject to be processed according to the segmentation Identify and encode the word segments in the order of appearance in the literature to generate the word vector matrix of the sentence; input the word vector matrix of the sentence into the pre-trained Bilstm-CRF model based on deep learning to obtain the medical treatment to be processed Key sentences in the literature.
  • Embodiments of the present application provide a non-transitory computer-readable storage medium on which a computer program is stored.
  • the method provided by the foregoing method embodiments is implemented, for example, including: medical documents to be processed To perform sentence segmentation, to perform word segmentation on the sentence segment, to generate the word vector matrix of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical literature to be processed;
  • medical documents to be processed To perform sentence segmentation, to perform word segmentation on the sentence segment, to generate the word vector matrix of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical literature to be processed
  • the trained Bilstm-CRF model based on deep learning, key sentences in the medical literature to be processed are obtained.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • the terms “installation”, “connected”, and “connection” should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, It can also be an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be a connection between two components.

Abstract

Embodiments of the present application discloses a deep learning-based method and device for screening for keywords in a medical document, capable of enhancing the accuracy of keyword screening in medical documents. The method comprises: performing sentence segmentation on a medical document to be processed, performing word segmentation on the component sentences, labeling and encoding the component words according to the order of appearance of the component words in the medical document to be processed, and generating a word vector matrix for the component sentences (S1);inputting the word vector matrix of the component sentences into a pre-trained deep learning-based Bilstm-CRF model, and obtaining the keywords in the medical document to be processed (S2).

Description

基于深度学习的医学文献中关键词筛选方法及装置Method and device for keyword selection in medical literature based on deep learning
交叉引用cross reference
本申请引用于2018年10月12日提交的专利名称为“基于深度学习的医学文献中关键词筛选方法及装置”的第2018111880516号中国专利申请,其通过引用被全部并入本申请。This application cites the Chinese patent application No. 2018111880516 with the patent name “Keyword screening method and device in medical literature based on deep learning” filed on October 12, 2018, which is fully incorporated by reference into this application.
技术领域Technical field
本申请实施例涉及计算机领域,具体涉及一种基于深度学习的医学文献中关键词筛选方法及装置。Embodiments of the present application relate to the field of computers, and in particular to a method and device for keyword selection in medical literature based on deep learning.
背景技术Background technique
关键词抽取是指根据一定的目的要求,依靠计算机技术从报告、文献中选择反映主题内容的单词或者术语。从而为文献提供一个简短的概括,使读者能够在短时间内了解文献的重要信息与核心内容,由于关键词十分精炼,故可以利用关键词以很小的计算代价进行文本相似性的度量。因此在文献检索、自动文摘、文本分类、文本聚类等方面有着重要的应用。Keyword extraction refers to the use of computer technology to select words or terms that reflect the content of the topic from reports and documents according to certain requirements. This provides a brief summary for the document, enabling readers to understand the important information and core content of the document in a short time. Because the keywords are very refined, the keywords can be used to measure the text similarity at a small calculation cost. Therefore, it has important applications in literature retrieval, automatic summarization, text classification, text clustering, etc.
现有的关键词提取方法主要分为3类:(1)基于统计特征的方法,根据词语出现的频率或者位置确定候选词的权重,筛选出权重较大者作为关键词。该方法虽然操作简单,但是会忽略掉在文中分布较小、位置较偏但是对于文章具有关键意义的词语;(2)基于词语网络的方法,根据一定的规则将文档映射成词语网络,利用该网络计算词语的关键度。该方法主要利用高频词的共现关系构建词语网络,同样不能提取出对文档重要但频率不高的词语;(3)基于语义的方法,从语义角度判断词语的重要性,提取出关键词。但是目前该方法仅仅采用同义词与近义词匹配,然而表达同一主题的关键词,大多不是同义词或近义词,使同主题的词语大部分未能得到语义关联,导致该方法不能发挥应有作用。Existing keyword extraction methods are mainly divided into three categories: (1) Based on statistical features, the weights of candidate words are determined according to the frequency or position of words, and those with larger weights are selected as keywords. Although this method is simple to operate, it will ignore the words that are distributed in a small position in the text and are in a relatively biased position but have a key significance for the article; The network calculates the criticality of words. This method mainly uses the co-occurrence relationship of high-frequency words to construct a word network, and it is also impossible to extract words that are important to the document but not frequently; (3) a semantic-based method to judge the importance of words from a semantic perspective and extract keywords . However, at present, this method only uses synonyms to match synonyms. However, most keywords that express the same topic are not synonyms or synonyms, so most of the words with the same topic are not semantically related, resulting in the method not being able to play its due role.
发明内容Summary of the invention
针对现有技术存在的不足和缺陷,本申请实施例提供一种基于深度学习的医学文献中关键词筛选方法及装置。In view of the deficiencies and defects existing in the prior art, embodiments of the present application provide a method and device for keyword selection in medical literature based on deep learning.
一方面,本申请实施例提出一种基于深度学习的医学文献中关键词筛选方法,包括:On the one hand, the embodiments of the present application provide a method for keyword selection in medical literature based on deep learning, including:
S1、对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;S1. Segment the medical documents to be processed, segment the clauses, and generate the word vector matrix of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;
S2、将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键词。S2. Input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
另一方面,本申请实施例提出一种基于深度学习的医学文献中关键词筛选装置,包括:On the other hand, the embodiments of the present application provide a keyword selection device in medical literature based on deep learning, including:
生成单元,被配置为对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;The generating unit is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;
输入单元,被配置为将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键词。The input unit is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
第三方面,本申请实施例提供一种电子设备,包括:处理器、存储器、总线及存储在存储器上并可在处理器上运行的计算机程序;In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;
其中,所述处理器,存储器通过所述总线完成相互间的通信;Wherein, the processor and the memory communicate with each other through the bus;
所述处理器执行所述计算机程序时实现上述方法。The above method is implemented when the processor executes the computer program.
第四方面,本申请实施例提供一种非暂态计算机可读存储介质,所述存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述方法。According to a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the foregoing method.
本申请实施例提供的基于深度学习的医学文献中关键词筛选方法及装置,利用训练好的基于深度学习的Bilstm-CRF模型筛选医学文献中的关键词,因构建的Bilstm-CRF模型能够结合上下文语义,捕捉到文献的局部相关性,从而使得本方案相较于现有技术能提高医学文献中关键词筛选的准确度。The method and device for keyword selection in medical literature based on deep learning provided by the embodiments of the present application use the trained Bilstm-CRF model based on deep learning to filter keywords in medical literature, because the constructed Bilstm-CRF model can be combined with context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of keyword selection in medical documents compared to existing technologies.
附图说明BRIEF DESCRIPTION
图1为本申请基于深度学习的医学文献中关键词筛选方法一实施例的流程示意图;FIG. 1 is a schematic flowchart of an embodiment of a method for keyword selection in medical literature based on deep learning in this application;
图2为本申请基于深度学习的医学文献中关键词筛选装置一实施例的 结构示意图;FIG. 2 is a schematic structural diagram of an embodiment of a keyword selection device in medical literature based on deep learning of the present application;
图3为本申请实施例提供的一种电子设备的实体结构示意图。FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
具体实施方式detailed description
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请实施例保护的范围。To make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are Apply for some embodiments, not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the protection scope of the embodiments of the present application.
参看图1,本实施例公开一种基于深度学习的医学文献中关键词筛选方法,包括:Referring to FIG. 1, this embodiment discloses a method for keyword selection in medical literature based on deep learning, including:
S1、对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;S1. Segment the medical documents to be processed, segment the clauses, and generate the word vector matrix of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;
S2、将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键词。S2. Input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
本申请实施例提供的基于深度学习的医学文献中关键词筛选方法,利用训练好的基于深度学习的Bilstm-CRF模型筛选医学文献中的关键词,因构建的Bilstm-CRF模型能够结合上下文语义,捕捉到文献的局部相关性,从而使得本方案相较于现有技术能提高医学文献中关键词筛选的准确度。The keyword selection method in the medical literature based on deep learning provided by the embodiments of the present application uses the trained Bilstm-CRF model based on deep learning to filter the keywords in the medical literature. Because the constructed Bilstm-CRF model can combine contextual semantics, Capturing the local relevance of the document, so that the solution can improve the accuracy of keyword selection in medical literature compared to the prior art.
在前述方法实施例的基础上,所述Bilstm-CRF模型的第二层是双向LSTM层,第三层是线性层,第四层是CRF层。Based on the foregoing method embodiments, the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.
本实施例中,在使用Bilstm-CRF模型进行关键词筛选之前,需要对Bilstm-CRF模型进行构建,并使用训练数据对Bilstm-CRF模型进行训练。具体地,Bilstm-CRF模型训练过程如下:In this embodiment, before using the Bilstm-CRF model for keyword selection, the Bilstm-CRF model needs to be constructed, and the training data is used to train the Bilstm-CRF model. Specifically, the training process of the Bilstm-CRF model is as follows:
(1)将训练样本中分句的各分词组成的词向量序列(x 1,x 2,...,x max_len)作为双向LSTM各个时间步的输入。 (1) The word vector sequence (x 1 , x 2 , ..., x max_len ) composed of each participle of the sentence in the training sample is used as the input of each time step of the bidirectional LSTM.
(2)模型的第二层是双向LSTM层,用来自动提取词语特征。将正向LSTM输出的隐状态序列
Figure PCTCN2019118858-appb-000001
与反向LSTM输出的
Figure PCTCN2019118858-appb-000002
进行按位拼接,得到完整的隐状态序列:
(2) The second layer of the model is a bidirectional LSTM layer, which is used to automatically extract word features. Hidden state sequence output from forward LSTM
Figure PCTCN2019118858-appb-000001
With reverse LSTM output
Figure PCTCN2019118858-appb-000002
Perform a bitwise stitching to get the complete hidden state sequence:
Figure PCTCN2019118858-appb-000003
Figure PCTCN2019118858-appb-000003
其中
Figure PCTCN2019118858-appb-000004
among them
Figure PCTCN2019118858-appb-000004
(3)紧接着接入一个线性层,将隐状态向量的每一元素从2n维映射成k维,其中k=4表示分词类别数。设输出矩阵为P=(p 1,p 2,...,p max_len),p i的每一维p ij表示分词x i分类到第j类标签的打分值。 (3) Immediately after accessing a linear layer, each element of the hidden state vector is mapped from 2n dimensions to k dimensions, where k = 4 represents the number of word segmentation categories. Output matrix is set P = (p 1, p 2 , ..., p max_len), each of dimension p ij p i x i represents the word score value of the first classification to the class label j.
(4)模型的第四层是CRF层,该层有一个(k+2)*(k+2)大小的状态转移矩阵A,A ij表示从第i个标签到第j个标签的转移得分,该矩阵的含义是标注分句中某一分词标签时,需要考虑之前已标注的标签值。如果一个分句的目标值序列为y=(y 1,y 2,...,y max_len),那么模型对于分句x的标签等于y的打分为: (4) The fourth layer of the model is the CRF layer, which has a state transition matrix A of (k + 2) * (k + 2) size, A ij represents the transition score from the i-th label to the j-th label , The meaning of this matrix is that when labeling a participle label in a clause, the label value that has been labelled before needs to be considered. If the target value sequence of a clause is y = (y 1 , y 2 , ..., y max_len ), then the model's label for clause x is equal to the score of y:
Figure PCTCN2019118858-appb-000005
Figure PCTCN2019118858-appb-000005
模型的对数似然函数定义为:The log-likelihood function of the model is defined as:
Figure PCTCN2019118858-appb-000006
Figure PCTCN2019118858-appb-000006
式中,Y x为因变量的集合,表示所有标签项。 In the formula, Y x is the set of dependent variables, indicating all label items.
(5)通过多轮迭代训练与参数调整,找到使目标函数最大化的最优参数与状态转移概率。(5) Through multiple rounds of iterative training and parameter adjustment, find the optimal parameters and state transition probabilities that maximize the objective function.
当然,在训练模型之前,需要针对训练样本数据生成分句的词向量矩阵,过程如下:Of course, before training the model, you need to generate a sentence vector matrix for the training sample data. The process is as follows:
(1)将分句的各个分词根据在文献中出现的顺序进行id编码,编码的起始值为1,终止值为文献的词汇量大小N。(1) Each participle of the clause is id-coded according to the order in which it appears in the document. The starting value of the encoding is 1, and the ending value is the vocabulary size N of the document.
(2)将所有分句中包含最多分词的个数记录为max_len,之后将id编码的分句进行填0扩充,使其长度达到max_len,其中0码数为(max_len-分词数)。(2) Record the number of the most participles in all clauses as max_len, and then fill the id-encoded clauses with 0 to expand them to make the length reach max_len, where the number of zero codes is (max_len-number of participles).
(3)随机初始化词向量矩阵,矩阵的每一行表示为词向量,从上到下依次对应编码为0~N的分词,矩阵的列数为词向量的长度n=300。(3) Randomly initialize the word vector matrix, each row of the matrix is represented as a word vector, corresponding to the word segmentation coded 0 to N in order from top to bottom, the number of columns of the matrix is the length of the word vector n = 300.
(4)查找分句中每个id编码的分词对应的词向量,若训练样本数为m,则构建一个[m,max_len,300]大小的三维矩阵作为模型的输入。(4) Find the word vector corresponding to each id-encoded participle in the clause. If the number of training samples is m, construct a three-dimensional matrix of [m, max_len, 300] size as the input of the model.
需要说明的是,在训练模型时,需要针对训练数据进行模型输出构建,具体方法为:依据PICO指标矩阵,给分句中的所有分词打上标签。若分 词出现在指标矩阵中,按照对应关系,将标签值设置为P或I-C或O;若分词未出现在指标矩阵中,则标签值为N。将标签序列整体作为模型的目标值。It should be noted that when training the model, the model output needs to be constructed based on the training data. The specific method is: according to the PICO index matrix, label all the words in the clause. If the word segmentation appears in the index matrix, set the tag value to P or I-C or O according to the corresponding relationship; if the word segmentation does not appear in the index matrix, the tag value is N. Take the entire label sequence as the target value of the model.
本申请构建的模型能够结合分词的上下文语义,并依据标签集的内在联系,通过计算状态转移概率来限制不合理标签序列的输出。The model constructed in this application can combine the context semantics of word segmentation, and limit the output of unreasonable label sequences by calculating the state transition probability according to the internal connection of the label set.
在前述方法实施例的基础上,所述对待处理的医学文献进行分句,对分句进行分词,包括:On the basis of the foregoing method embodiments, the medical documents to be processed are segmented, and the segmentation is segmented, including:
依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
本实施例中,对分词过程举例说明如下:In this embodiment, the word segmentation process is exemplified as follows:
对于例句:目的评价亚甲基四氢叶酸还原酶基因多态性在甲氨喋呤治疗急性淋巴细胞白血病过程中毒副反应的相关性。方法通过计算机检索国内外相关数据库:EMBASE,CNKI,维普中文科技期刊数据库以及万方数据库,…,首先依据标点符号对其进行分句,分句结果为:For the example sentence: Objective To evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia. Methods Relevant databases at home and abroad were searched by computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database, ... Firstly, they were sentenced according to punctuation marks. The result of the sentence was:
(1)目的评价亚甲基四氢叶酸还原酶基因多态性在甲氨喋呤治疗急性淋巴细胞白血病过程中毒副反应的相关性;(1) Objective to evaluate the correlation of methylenetetrahydrofolate reductase gene polymorphism in the side effects of methotrexate in the treatment of acute lymphocytic leukemia;
(2)方法通过计算机检索国内外相关数据库:EMBASE,CNKI,维普中文科技期刊数据库以及万方数据库。(2) The method searches the relevant domestic and foreign databases through computer: EMBASE, CNKI, Weipu Chinese scientific journal database and Wanfang database.
然后利用分词算法对分句进行分词,分词结果为:Then use the word segmentation algorithm to segment the sentence, and the result is:
1)['目的','评价','亚','甲基','四氢叶酸','还原酶','基因','多态性','在','甲氨喋呤','治疗','急性','淋巴','细胞','白血病','过程','中','毒副','反应','的','相关性'];1) ['Purpose', 'Evaluation', 'Sub', 'Methyl', 'Tetrahydrofolate', 'Reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate ',' Treatment ',' Acute ',' Lymph ',' Cell ',' Leukemia ',' Process', 'Medium', 'Poison Side', 'Reaction', 'The', 'Relevance'];
2)['方法','通过','计算机','检索','国内外','相关','数据库','EMBASE','CNKI','维普','中文','科技','期刊','数据库','以及','万方','数据库']。2) ('Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vipu', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].
最后结合医学词库对部分分词进行合并,则对于第一个分句(1)的分词1),需要将“亚”、“甲基”、“四氢叶酸”和“还原酶”合并成一个完整的医学名词“亚甲基四氢叶酸还原酶”,需要将“淋巴”和“细胞”合并成一个完整的医学名词“淋巴细胞”,需要将“毒副”和“反应”合并成一个完整的医学名词“毒副反应”。合并结果为:Finally, part of the participles are merged in combination with the medical vocabulary. For the participle 1) of the first clause (1), it is necessary to merge "Ya", "Methyl", "Tetrahydrofolate" and "Reductase" into one The complete medical term "methylenetetrahydrofolate reductase" needs to merge "lymphoid" and "cell" into a complete medical term "lymphocyte" and needs to merge "poison" and "reaction" into a complete The medical term "toxic side effects". The merged result is:
a)['目的','评价','亚甲基四氢叶酸还原酶','基因','多态性','在','甲氨喋呤 ','治疗','急性','淋巴细胞','白血病','过程','中','毒副反应','的','相关性'];a) ['Purpose', 'Evaluation', 'Methylenetetrahydrofolate reductase', 'Gene', 'Polymorphism', 'In', 'Methotrexate', 'Treatment', 'Acute' , 'Lymphocyte', 'Leukemia', 'Process', 'Medium', 'Toxic Side Effects', 'De', 'Relevance'];
b)['方法','通过','计算机','检索','国内外','相关','数据库','EMBASE','CNKI','维普','中文','科技','期刊','数据库','以及','万方','数据库']。b) ['Method', 'Via', 'Computer', 'Retrieve', 'Home and Abroad', 'Related', 'Database', 'EMBASE', 'CNKI', 'Vip', 'Chinese', 'Technology ',' Journal ',' Database ',' and ',' Wanfang ',' Database '].
在前述方法实施例的基础上,所述通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵,包括:On the basis of the foregoing method embodiments, the word vector matrix of clauses is generated by identifying and encoding the tokens in the order in which the tokens appear in the medical document to be processed, including:
按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等;According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;
基于填零扩充的结果生成所述词向量矩阵。The word vector matrix is generated based on the result of zero-filling expansion.
本实施例中生成测试数据词向量矩阵的过程同前述训练模型时生成训练样本词向量矩阵的过程一致,此处不再赘述。The process of generating the test data word vector matrix in this embodiment is the same as the process of generating the training sample word vector matrix during the aforementioned training model, which will not be repeated here.
本实施例中,生成分句的词向量矩阵时,首先将分句的各个分词根据在文献中出现的顺序进行标识编码(id编码),编码的起始值为1,终止值为文献的词汇量大小。然后将所有分句中包含最多分词的个数记录为max_sentence_len,之后将id编码的分句进行填0扩充,使其长度达到max_sentence_len,即得到分句的词向量,其中词向量中0的数量等于max_sentence_len-分词数。In this embodiment, when generating the word vector matrix of clauses, firstly, each token of the clauses is identified and encoded (id encoding) according to the order in which they appear in the document. The starting value of the encoding is 1, and the ending value is the vocabulary of the document Quantity. Then record the number of the most participles in all clauses as max_sentence_len, and then add 0 to expand the id-encoded clauses to make the length reach max_sentence_len, which is the word vector of the clause, where the number of 0s in the word vector is equal to max_sentence_len-Number of words.
参看图2,本实施例公开一种基于深度学习的医学文献中关键词筛选装置,包括:Referring to FIG. 2, this embodiment discloses a device for keyword selection in medical literature based on deep learning, including:
生成单元1,被配置为对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;The generating unit 1 is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed ;
输入单元2,被配置为将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键词。The input unit 2 is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
具体地,所述生成单元1对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;所述输入单元2将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键词。Specifically, the generating unit 1 performs sentence segmentation on the medical document to be processed, performs word segmentation on the sentence segment, and generates the word vector of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical document to be processed The input unit 2 inputs the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain keywords in the medical literature to be processed.
本申请实施例提供的基于深度学习的医学文献中关键词筛选装置,利用训练好的基于深度学习的Bilstm-CRF模型筛选医学文献中的关键词,因构建的Bilstm-CRF模型能够结合上下文语义,捕捉到文献的局部相关性,从而使得本方案相较于现有技术能提高医学文献中关键词筛选的准确度。The keyword selection device for medical literature based on deep learning provided by the embodiments of the present application uses the trained Bilstm-CRF model based on deep learning to filter keywords in medical literature. Because the constructed Bilstm-CRF model can combine contextual semantics, Capturing the local relevance of the document, so that the solution can improve the accuracy of keyword selection in medical literature compared to the prior art.
在前述装置实施例的基础上,所述Bilstm-CRF模型的第二层是双向LSTM层,第三层是线性层,第四层是CRF层。Based on the foregoing device embodiments, the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.
在前述装置实施例的基础上,所述生成单元,具体被配置为:Based on the foregoing device embodiments, the generating unit is specifically configured to:
依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
在前述装置实施例的基础上,所述生成单元,具体被配置为:Based on the foregoing device embodiments, the generating unit is specifically configured to:
按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等;According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;
基于填零扩充的结果生成所述词向量矩阵。The word vector matrix is generated based on the result of zero-filling expansion.
本实施例的基于深度学习的医学文献中关键词筛选装置,可以用于执行前述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The keyword selection device in the medical literature based on deep learning of this embodiment may be used to execute the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.
图3示出了本申请实施例提供的一种电子设备的实体结构示意图,如图3所示,该电子设备可以包括:处理器11、存储器12、总线13及存储在存储器12上并可在处理器11上运行的计算机程序;FIG. 3 shows a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present application. As shown in FIG. 3, the electronic device may include: a processor 11, a memory 12, a bus 13, and stored on the memory 12 and may be A computer program running on the processor 11;
其中,所述处理器11,存储器12通过所述总线13完成相互间的通信;Wherein, the processor 11 and the memory 12 communicate with each other through the bus 13;
所述处理器11执行所述计算机程序时实现上述各方法实施例所提供的方法,例如包括:对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键句。When the processor 11 executes the computer program, the method provided by each of the above method embodiments is implemented, for example, including: segmenting a medical document to be processed, segmenting a clause, and performing a word segmentation on the medical subject to be processed according to the segmentation Identify and encode the word segments in the order of appearance in the literature to generate the word vector matrix of the sentence; input the word vector matrix of the sentence into the pre-trained Bilstm-CRF model based on deep learning to obtain the medical treatment to be processed Key sentences in the literature.
本申请实施例提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例所提供的方 法,例如包括:对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键句。Embodiments of the present application provide a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the method provided by the foregoing method embodiments is implemented, for example, including: medical documents to be processed To perform sentence segmentation, to perform word segmentation on the sentence segment, to generate the word vector matrix of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical literature to be processed; In the trained Bilstm-CRF model based on deep learning, key sentences in the medical literature to be processed are obtained.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems), and computer program products according to embodiments of the application. It should be understood that each flow and / or block in the flowchart and / or block diagram and a combination of the flow and / or block in the flowchart and / or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device A device for realizing the functions specified in one block or multiple blocks of one flow or multiple blocks of a flowchart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语 “包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。术语“上”、“下”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is any such actual relationship or order. Moreover, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, but also those not explicitly listed Or other elements that are inherent to this process, method, article, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article or equipment that includes the element. The terms "upper", "lower", etc. indicate the orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, only to facilitate the description of this application and simplify the description, rather than to indicate or imply that the device or element It has a specific orientation, is constructed and operated in a specific orientation, and therefore cannot be understood as a limitation of the present application. Unless otherwise clearly specified and defined, the terms "installation", "connected", and "connection" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, It can also be an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be a connection between two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in this application according to specific situations.
本申请的说明书中,说明了大量具体细节。然而能够理解的是,本申请的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。类似地,应当理解,为了精简本申请公开并帮助理解各个发明方面中的一个或多个,在上面对本申请的示例性实施例的描述中,本申请的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本申请要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本申请的单独实施例。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。本申请并不局限于任何单一的方面,也不局限于任何单一的实施例,也不局限于这些方面和/或实施例的任意组合和/或置换。而且,可以单独使用本申请的每个方面和/或实施例或者与一个或更多其他方面和/或其实施例结合使用。In the description of this application, a large number of specific details are explained. However, it can be understood that the embodiments of the present application can be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description. Similarly, it should be understood that in order to streamline the disclosure of the present application and to help understand one or more of the various inventive aspects, in the above description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together into a single embodiment , Figures, or their descriptions. However, the disclosed method should not be interpreted as reflecting the intention that the claimed application claims more features than those explicitly recited in each claim. Rather, as the claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Therefore, the claims that follow the specific implementation are hereby expressly incorporated into the specific implementation, where each claim itself serves as a separate embodiment of the present application. It should be noted that the embodiments in the present application and the features in the embodiments can be combined with each other without conflict. This application is not limited to any single aspect, nor to any single embodiment, nor to any combination and / or substitution of these aspects and / or embodiments. Moreover, each aspect and / or embodiment of the present application may be used alone or in combination with one or more other aspects and / or embodiments thereof.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非 对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围,其均应涵盖在本申请的权利要求和说明书的范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions of the technical solutions of the embodiments of the present application. The scope should be covered by the scope of the claims and the description of this application.

Claims (10)

  1. 一种基于深度学习的医学文献中关键句筛选方法,其特征在于,包括:A method for selecting key sentences in medical literature based on deep learning, which is characterized by:
    S1、对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;S1. Segment the medical documents to be processed, segment the clauses, and generate the word vector matrix of the clauses by coding the tokens in the order in which the tokens appear in the medical documents to be processed;
    S2、将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键句。S2. Input the word vector matrix of the clause into the pre-trained Bilstm-CRF model based on deep learning to obtain key sentences in the medical literature to be processed.
  2. 根据权利要求1所述的方法,其特征在于,所述Bilstm-CRF模型的第二层是双向LSTM层,第三层是线性层,第四层是CRF层。The method according to claim 1, wherein the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.
  3. 根据权利要求2所述的方法,其特征在于,所述对待处理的医学文献进行分句,对分句进行分词,包括:The method according to claim 2, characterized in that segmenting the medical document to be processed and segmenting the clause include:
    依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  4. 根据权利要求3所述的方法,其特征在于,所述通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵,包括:The method according to claim 3, characterized in that the word vector matrix of the sentence is generated by identifying and encoding the word segmentation according to the order in which the word segmentation appears in the medical document to be processed, including:
    按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等;According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;
    基于填零扩充的结果生成所述词向量矩阵。The word vector matrix is generated based on the result of zero-filling expansion.
  5. 一种基于深度学习的医学文献中关键句筛选装置,其特征在于,包括:A device for screening critical sentences in medical literature based on deep learning, which is characterized by including:
    生成单元,被配置为对待处理的医学文献进行分句,对分句进行分词,通过按照分词在所述待处理的医学文献中出现的顺序对分词进行标识编码,生成分句的词向量矩阵;The generating unit is configured to perform segmentation on the medical document to be processed, segment the segmentation, and generate the word vector matrix of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;
    输入单元,被配置为将所述分句的词向量矩阵输入预先训练好的基于深度学习的Bilstm-CRF模型中,得到所述待处理的医学文献中的关键句。The input unit is configured to input the word vector matrix of the clause into a pre-trained Bilstm-CRF model based on deep learning to obtain key sentences in the medical literature to be processed.
  6. 根据权利要求5所述的装置,其特征在于,所述Bilstm-CRF模型的第二层是双向LSTM层,第三层是线性层,第四层是CRF层。The device according to claim 5, wherein the second layer of the Bilstm-CRF model is a bidirectional LSTM layer, the third layer is a linear layer, and the fourth layer is a CRF layer.
  7. 根据权利要求6所述的装置,其特征在于,所述生成单元,具体被配置为:The apparatus according to claim 6, wherein the generating unit is specifically configured to:
    依据标点符号对所述待处理的医学文献进行分句,基于分词算法与医学词库对分句进行分词。The medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  8. 根据权利要求7所述的装置,其特征在于,所述生成单元,具体被配置为:The apparatus according to claim 7, wherein the generating unit is specifically configured to:
    按照分词在所述待处理的医学文献中出现的顺序对分句的分词进行标识编码,并对标识编码后的分句分词进行填零扩充,使填零扩充后的分句的元素数量与最长分句所包含的分词数量相等;According to the order in which the participles appear in the medical literature to be processed, the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most The number of participles included in the long clause is equal;
    基于填零扩充的结果生成所述词向量矩阵。The word vector matrix is generated based on the result of zero-filling expansion.
  9. 一种电子设备,其特征在于,包括:处理器、存储器、总线及存储在存储器上并可在处理器上运行的计算机程序;An electronic device, characterized in that it includes: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;
    其中,所述处理器,存储器通过所述总线完成相互间的通信;Wherein, the processor and the memory communicate with each other through the bus;
    所述处理器执行所述计算机程序时实现如权利要求1-4中任一项所述的方法。When the processor executes the computer program, the method according to any one of claims 1-4 is implemented.
  10. 一种非暂态计算机可读存储介质,其特征在于,所述存储介质上存储有计算机程序,该计算机程序被处理器执行时实现如权利要求1-4中任一项所述的方法。A non-transitory computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1-4 is implemented.
PCT/CN2019/118858 2018-10-12 2019-11-15 Deep learning-based method and device for screening for keywords in medical document WO2020074017A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811188051.6A CN109359300A (en) 2018-10-12 2018-10-12 Keyword screening technique and device in medical literature based on deep learning
CN201811188051.6 2018-10-12

Publications (1)

Publication Number Publication Date
WO2020074017A1 true WO2020074017A1 (en) 2020-04-16

Family

ID=65348974

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118858 WO2020074017A1 (en) 2018-10-12 2019-11-15 Deep learning-based method and device for screening for keywords in medical document

Country Status (2)

Country Link
CN (1) CN109359300A (en)
WO (1) WO2020074017A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151222A (en) * 2023-09-15 2023-12-01 大连理工大学 Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472021A (en) * 2018-10-12 2019-03-15 北京诺道认知医学科技有限公司 Critical sentence screening technique and device in medical literature based on deep learning
CN109359300A (en) * 2018-10-12 2019-02-19 北京大学第三医院 Keyword screening technique and device in medical literature based on deep learning
CN111160017B (en) * 2019-12-12 2021-09-03 中电金信软件有限公司 Keyword extraction method, phonetics scoring method and phonetics recommendation method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
US9754584B2 (en) * 2014-12-22 2017-09-05 Google Inc. User specified keyword spotting using neural network feature extractor
CN108121700A (en) * 2017-12-21 2018-06-05 北京奇艺世纪科技有限公司 A kind of keyword extracting method, device and electronic equipment
CN109359300A (en) * 2018-10-12 2019-02-19 北京大学第三医院 Keyword screening technique and device in medical literature based on deep learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108198620B (en) * 2018-01-12 2022-03-22 洛阳飞来石软件开发有限公司 Skin disease intelligent auxiliary diagnosis system based on deep learning
CN108319668B (en) * 2018-01-23 2021-04-20 义语智能科技(上海)有限公司 Method and equipment for generating text abstract

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
US9754584B2 (en) * 2014-12-22 2017-09-05 Google Inc. User specified keyword spotting using neural network feature extractor
CN108121700A (en) * 2017-12-21 2018-06-05 北京奇艺世纪科技有限公司 A kind of keyword extracting method, device and electronic equipment
CN109359300A (en) * 2018-10-12 2019-02-19 北京大学第三医院 Keyword screening technique and device in medical literature based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, WEI ET AL.: "Automatic Keyword Extraction Based on BiLSTM-CRF", COMPUTER SCIENCE, vol. 45, no. 6A, 15 June 2018 (2018-06-15), pages 93 - 96 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151222A (en) * 2023-09-15 2023-12-01 大连理工大学 Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109359300A (en) 2019-02-19

Similar Documents

Publication Publication Date Title
WO2021000676A1 (en) Q&a method, q&a device, computer equipment and storage medium
US10586155B2 (en) Clarification of submitted questions in a question and answer system
US9558264B2 (en) Identifying and displaying relationships between candidate answers
Prasetya et al. The performance of text similarity algorithms
WO2020074017A1 (en) Deep learning-based method and device for screening for keywords in medical document
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
CN107818815B (en) Electronic medical record retrieval method and system
Chen et al. Short text entity linking with fine-grained topics
CN112035730B (en) Semantic retrieval method and device and electronic equipment
CN111291188B (en) Intelligent information extraction method and system
WO2021212801A1 (en) Evaluation object identification method and apparatus for e-commerce product, and storage medium
CN112861990B (en) Topic clustering method and device based on keywords and entities and computer readable storage medium
CN112559684A (en) Keyword extraction and information retrieval method
CN113239163A (en) Intelligent question-answering method and system based on traffic big data
Mehrbod et al. Tender calls search using a procurement product named entity recogniser
Zakraoui et al. Improving Arabic text to image mapping using a robust machine learning technique
CN112131453A (en) Method, device and storage medium for detecting network bad short text based on BERT
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
Sun A natural language interface for querying graph databases
Tianxiong et al. Identifying chinese event factuality with convolutional neural networks
Chenze et al. Iterative approach for novel entity recognition of foods in social media messages
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Al-Sultany et al. Enriching tweets for topic modeling via linking to the wikipedia
TW202001621A (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
TWI636370B (en) Establishing chart indexing method and computer program product by text information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871959

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19871959

Country of ref document: EP

Kind code of ref document: A1