WO2020074023A1 - Procédé et dispositif basés sur l'apprentissage profond destinés au criblage de phrases clés dans un document médical - Google Patents

Procédé et dispositif basés sur l'apprentissage profond destinés au criblage de phrases clés dans un document médical Download PDF

Info

Publication number
WO2020074023A1
WO2020074023A1 PCT/CN2019/124561 CN2019124561W WO2020074023A1 WO 2020074023 A1 WO2020074023 A1 WO 2020074023A1 CN 2019124561 W CN2019124561 W CN 2019124561W WO 2020074023 A1 WO2020074023 A1 WO 2020074023A1
Authority
WO
WIPO (PCT)
Prior art keywords
processed
medical
clauses
segmentation
word
Prior art date
Application number
PCT/CN2019/124561
Other languages
English (en)
Chinese (zh)
Inventor
赵荣生
宋再伟
黄振城
王则远
周旻
Original Assignee
北京大学第三医院
北京诺道认知医学科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学第三医院, 北京诺道认知医学科技有限公司 filed Critical 北京大学第三医院
Publication of WO2020074023A1 publication Critical patent/WO2020074023A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Embodiments of the present disclosure relate to the field of computers, and in particular to a method and device for screening critical sentences in medical literature based on deep learning.
  • the main content of the text is often covered in a set of important key sentences. These key sentences can clearly express the content characteristics of the text (such as domain categories, theme ideas, central meaning, etc.). It is based on this understanding.
  • the identification and screening of key sentences that can represent the main content of the text is a very important step. It is important for the disclosure of subject literature and the reflection of the knowledge hidden in the text. Meaning.
  • the key sentence screening is simply to identify and extract sentences containing useful information according to certain purpose requirements, so as to condense the text and obtain rich information from a small amount of data.
  • Traditional key sentence screening methods are generally based on statistical methods, using statistical information such as location and frequency to find the sentence that best represents the subject of the article as the key sentence.
  • the structure of the article it can be divided into unstructured screening analysis type and structured screening analysis type.
  • the former calculates the weight of the sentence of the article, and finds the sentence with the highest weight as the key sentence.
  • the latter first analyzes the semantic structure of the article to find out the topic structure of the article, and then extracts sentences from each topic to form a key sentence.
  • the statistical method of filtering based on structure or weight is easy to ignore the content of the sentence itself in actual operation, and the key sentences that are distributed in the text but contain the content of the subject words are filtered out, and the redundancy is greater.
  • the widely used deep learning algorithm focuses on the content of the sentence itself, and automatically learns the sample features by simulating the structure of the neural network of the human brain, so as to filter out key sentences containing key information and prepare for further analysis.
  • the algorithm has so far been limited to analyzing isolated sentences.
  • the contextual relationship between sentences and sentences has not been systematically studied in terms of the constraints and effects of this sentence.
  • embodiments of the present disclosure provide a method and device for screening key sentences in medical literature based on deep learning.
  • an embodiment of the present disclosure proposes a method for screening key sentences in medical literature based on deep learning, including:
  • an embodiment of the present disclosure proposes a device for screening critical sentences in medical literature based on deep learning, including:
  • the generating unit is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation in the order in which the segmentation appears in the medical document to be processed;
  • the input unit is configured to input the word vectors of the clauses into a pre-trained deep learning-based convolutional neural network model to obtain key sentences in the medical literature to be processed.
  • an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;
  • processor and the memory communicate with each other through the bus;
  • an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the above method.
  • the method and device for screening critical sentences in medical literature based on deep learning utilizes the trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can Combined with context semantics, the local relevance of the document is captured, so that the scheme can improve the accuracy of screening key sentences in medical literature compared to the existing technology.
  • FIG. 1 is a schematic flowchart of an embodiment of a method for screening key sentences in medical literature based on deep learning of the present disclosure
  • FIG. 2 is a schematic structural diagram of an embodiment of a key sentence selection device in medical literature based on deep learning of the present disclosure
  • FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • this embodiment discloses a method for selecting key sentences in medical literature based on deep learning, including:
  • the method for screening critical sentences in medical literature based on deep learning uses a trained deep learning-based convolutional neural network model to screen critical sentences in medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.
  • the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window
  • the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
  • the convolutional layer of the model uses multi-filter windows with widths of 3, 4, and 5, each window corresponds to 100 filters, and slides different windows to traverse each participle in the clause.
  • each filter The device can get a feature map set.
  • the calculation formula of the feature set is as follows:
  • the pooling layer of the model adopts the Max-over-time-pooling method. For the feature set generated by different filters in each filter window, the maximum value of the set is taken as an important representative feature. In this way, the features of different sliding window sizes become a fixed length, which are spliced together to form a feature vector of 3 * 100 length.
  • the last layer of the model is a fully connected softmax layer, which outputs the probability of each category. Through multiple rounds of iterative training and parameter adjustment, the optimal model parameters are found.
  • the model output needs to be constructed for the training data.
  • the specific method is: according to the PICO index matrix, if a clause does not contain any matrix elements, it means that the clause does not contain relevant fields worth studying. Key information, so the target value of the clause is set to 0. If the clause contains one or more elements of the matrix, it means that the clause may contain important information. In order to avoid missing key information, it needs to be filtered out for subsequent In-depth study, so the target value of the clause is set to 1.
  • the medical documents to be processed are segmented, and the segmentation is segmented, including:
  • the medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  • the word vectors of the clauses are generated by identifying and encoding the tokens in the order in which the tokens appear in the medical document to be processed, including:
  • the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling is the most Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
  • the tokens of the clauses are identified and encoded (id encoding) according to the order in which they appear in the document.
  • the starting value of the encoding is 1 and the ending value is the vocabulary of the document size.
  • max_sentence_len the number of the most participles in all clauses as max_sentence_len, and then add 0 to expand the id-encoded clauses to make the length reach max_sentence_len, which is the word vector of the clause, where the number of 0s in the word vector is equal to max_sentence_len-Number of words.
  • this embodiment discloses a device for screening critical sentences in medical literature based on deep learning, including:
  • the generating unit 1 is used for segmenting the medical document to be processed, segmenting the segmentation, and generating the word vector of the segmentation by encoding the segmentation according to the order in which the segmentation appears in the medical document to be processed;
  • the input unit 2 is configured to input the word vectors of the clauses into a pre-trained convolutional neural network model based on deep learning to obtain key sentences in the medical literature to be processed.
  • the generating unit 1 performs sentence segmentation on the medical document to be processed, performs word segmentation on the sentence segment, and generates the word vector of the sentence segment by encoding the word segmentation in the order in which the word segment appears in the medical document to be processed
  • the input unit 2 inputs the word vector of the clause into a pre-trained convolutional neural network model based on deep learning to obtain the key sentence in the medical literature to be processed.
  • the key sentence screening device in the medical literature based on deep learning uses the trained deep learning-based convolutional neural network model to screen key sentences in the medical literature, because the constructed convolutional neural network model can combine context Semantics captures the local relevance of documents, so that this solution can improve the accuracy of screening key sentences in medical literature compared to existing technologies.
  • the convolutional layer of the convolutional neural network model uses a multivariate filter window, each window corresponds to a first number of filters, and the independent variable of the convolution operation is the word segmentation vector in the corresponding filter window
  • the pooling layer of the convolutional neural network model adopts the Max-over-time-pooling method.
  • the generating unit is specifically used to:
  • the medical documents to be processed are segmented according to punctuation marks, and the segmentation is segmented based on the segmentation algorithm and the medical lexicon.
  • the generating unit is specifically used to:
  • the participles of the clauses are identified and coded, and the segmented and segmented tokens after the identification coding are filled with zeros to expand, so that the number of elements of the expanded clauses with zero-filling and Long clauses contain the same number of tokens, and the zero-filled clauses are used as the word vectors of the corresponding clauses.
  • the key sentence selection device in the medical literature based on deep learning of this embodiment may be used to execute the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.
  • FIG. 3 shows a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device may include: a processor 11, a memory 12, a bus 13, and stored on the memory 12 and may be A computer program running on the processor 11;
  • processor 11 and the memory 12 communicate with each other through the bus 13;
  • the method provided by each of the above method embodiments is implemented, for example, including: segmenting a medical document to be processed, segmenting a clause, and performing a word segmentation on the medical subject to be processed according to the segmentation
  • the order of occurrence in the literature is to encode the word segmentation to generate the word vector of the sentence; input the word vector of the sentence into a pre-trained convolutional neural network model based on deep learning to obtain the medical literature to be processed Key sentence in
  • Embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored.
  • the method provided by the foregoing method embodiments is implemented, for example, including: medical documents to be processed Perform clause segmentation, segment the clauses, and generate the word vectors of the clauses by coding the tokens in the order in which they appear in the medical literature to be processed; input the word vectors of the clauses in advance
  • key sentences in the medical literature to be processed are obtained.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • the terms “installation”, “connected”, and “connection” should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection, It can also be an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be a connection between two components.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Machine Translation (AREA)

Abstract

Selon certains modes de réalisation, la présente invention concerne un procédé et un dispositif basés sur un apprentissage profond destinés au criblage de phrases clés dans un document médical, capables d'améliorer la précision de criblage de phrases clés dans des documents médicaux. Le procédé comprend les étapes suivantes : S1, effectuer une segmentation de phrase sur un document médical à traiter, effectuer une segmentation de mot sur les phrases de composant, étiqueter et coder les mots de composant selon l'ordre d'apparition des mots de composant dans le document médical à traiter, et générer des vecteurs de mots pour les phrases de composant ; S2, entrer les vecteurs de mots des phrases de composant dans un modèle de réseau neuronal convolutif basé sur un apprentissage profond pré-appris, et obtenir les phrases clés dans le document médical à traiter.
PCT/CN2019/124561 2018-10-12 2019-12-11 Procédé et dispositif basés sur l'apprentissage profond destinés au criblage de phrases clés dans un document médical WO2020074023A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811188041.2 2018-10-12
CN201811188041.2A CN109472021A (zh) 2018-10-12 2018-10-12 基于深度学习的医学文献中关键句筛选方法及装置

Publications (1)

Publication Number Publication Date
WO2020074023A1 true WO2020074023A1 (fr) 2020-04-16

Family

ID=65663812

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124561 WO2020074023A1 (fr) 2018-10-12 2019-12-11 Procédé et dispositif basés sur l'apprentissage profond destinés au criblage de phrases clés dans un document médical

Country Status (2)

Country Link
CN (1) CN109472021A (fr)
WO (1) WO2020074023A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667781A (zh) * 2020-12-31 2021-04-16 北京万方数据股份有限公司 一种恶性肿瘤文献获取方法及装置

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472021A (zh) * 2018-10-12 2019-03-15 北京诺道认知医学科技有限公司 基于深度学习的医学文献中关键句筛选方法及装置
CN110298028B (zh) * 2019-05-21 2023-08-18 杭州未名信科科技有限公司 一种文本段落的关键句提取方法和装置
CN110728143A (zh) * 2019-09-23 2020-01-24 上海蜜度信息技术有限公司 用于文档关键语句识别的方法与设备
CN112749277B (zh) * 2020-12-30 2023-08-04 杭州依图医疗技术有限公司 医学数据的处理方法、装置及存储介质
CN112735543A (zh) * 2020-12-30 2021-04-30 杭州依图医疗技术有限公司 医学数据的处理方法、装置及存储介质
CN114860900A (zh) * 2022-04-07 2022-08-05 海信集团控股股份有限公司 一种量刑预测方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776580A (zh) * 2017-01-20 2017-05-31 中山大学 混合的深度神经网络cnn和rnn的主题句识别方法
CN106980683A (zh) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 基于深度学习的博客文本摘要生成方法
US20170213130A1 (en) * 2016-01-21 2017-07-27 Ebay Inc. Snippet extractor: recurrent neural networks for text summarization at industry scale
CN108595411A (zh) * 2018-03-19 2018-09-28 南京邮电大学 一种同主题文本集合中多文本摘要获取方法
CN109359300A (zh) * 2018-10-12 2019-02-19 北京大学第三医院 基于深度学习的医学文献中关键词筛选方法及装置
CN109472021A (zh) * 2018-10-12 2019-03-15 北京诺道认知医学科技有限公司 基于深度学习的医学文献中关键句筛选方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170213130A1 (en) * 2016-01-21 2017-07-27 Ebay Inc. Snippet extractor: recurrent neural networks for text summarization at industry scale
CN106776580A (zh) * 2017-01-20 2017-05-31 中山大学 混合的深度神经网络cnn和rnn的主题句识别方法
CN106980683A (zh) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 基于深度学习的博客文本摘要生成方法
CN108595411A (zh) * 2018-03-19 2018-09-28 南京邮电大学 一种同主题文本集合中多文本摘要获取方法
CN109359300A (zh) * 2018-10-12 2019-02-19 北京大学第三医院 基于深度学习的医学文献中关键词筛选方法及装置
CN109472021A (zh) * 2018-10-12 2019-03-15 北京诺道认知医学科技有限公司 基于深度学习的医学文献中关键句筛选方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667781A (zh) * 2020-12-31 2021-04-16 北京万方数据股份有限公司 一种恶性肿瘤文献获取方法及装置

Also Published As

Publication number Publication date
CN109472021A (zh) 2019-03-15

Similar Documents

Publication Publication Date Title
WO2020074023A1 (fr) Procédé et dispositif basés sur l'apprentissage profond destinés au criblage de phrases clés dans un document médical
CN106776711B (zh) 一种基于深度学习的中文医学知识图谱构建方法
CN110399457B (zh) 一种智能问答方法和系统
CN106649260B (zh) 基于评论文本挖掘的产品特征结构树构建方法
CN108009182B (zh) 一种信息提取方法和装置
US20170330084A1 (en) Clarification of Submitted Questions in a Question and Answer System
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
US11151179B2 (en) Method, apparatus and electronic device for determining knowledge sample data set
US20170286832A1 (en) Analyzing Concepts Over Time
US11934461B2 (en) Applying natural language pragmatics in a data visualization user interface
CN104636466B (zh) 一种面向开放网页的实体属性抽取方法和系统
CN104462057B (zh) 用于产生语言分析的词汇资源的方法和系统
CN110188147B (zh) 基于知识图谱的文献实体关系发现方法及系统
CN107463548B (zh) 短语挖掘方法及装置
KR102491172B1 (ko) 자연어 질의응답 시스템 및 그 학습 방법
CN111813950B (zh) 一种基于神经网络自适应寻优调参的建筑领域知识图谱构建方法
WO2020010834A1 (fr) Procédé, appareil et dispositif de généralisation de bibliothèque de questions et de réponses d'une foire aux questions
CN110414004B (zh) 一种核心信息提取的方法和系统
CN111625659A (zh) 知识图谱处理方法、装置、服务器及存储介质
WO2020074017A1 (fr) Procédé et dispositif basés sur l'apprentissage profond destinés au criblage de mots-clés dans un document médical
CN108228758A (zh) 一种文本分类方法及装置
CN108280164B (zh) 一种基于类别相关单词的短文本过滤与分类方法
CN109522396B (zh) 一种面向国防科技领域的知识处理方法及系统
CN109741824B (zh) 一种基于机器学习的医疗问诊方法
CN108959314A (zh) 一种语义检索方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19870423

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19870423

Country of ref document: EP

Kind code of ref document: A1