CN113033155A - Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists - Google Patents

Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists Download PDF

Info

Publication number
CN113033155A
CN113033155A CN202110597714.5A CN202110597714A CN113033155A CN 113033155 A CN113033155 A CN 113033155A CN 202110597714 A CN202110597714 A CN 202110597714A CN 113033155 A CN113033155 A CN 113033155A
Authority
CN
China
Prior art keywords
data
medical
vector data
clinical
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110597714.5A
Other languages
Chinese (zh)
Other versions
CN113033155B (en
Inventor
汤步洲
黄源航
熊英
陈清财
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Original Assignee
Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology filed Critical Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority to CN202110597714.5A priority Critical patent/CN113033155B/en
Publication of CN113033155A publication Critical patent/CN113033155A/en
Application granted granted Critical
Publication of CN113033155B publication Critical patent/CN113033155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

本发明公开了提供了一种结合序列生成和层级词表的医学概念自动编码方法及装置,通过将临床医疗文本中的医学概念编码事件转变为序列生成问题,并引入层级词表的概念来增强医疗术语之间的关系,根据所述层级词表在进行序列生成的过程中准确地确定所述临床医疗文本对应的标准医学术语并进行自动编码。解决了现有技术中采用人工编码的方法将临床医疗文本中的医学概念手动映射为标准医学术语编码,不仅成本高昂、效率有限而且准确性不高的问题。

Figure 202110597714

The invention discloses and provides a medical concept automatic coding method and device combining sequence generation and hierarchical vocabulary. The medical concept coding events in clinical medical texts are transformed into sequence generation problems, and the concept of hierarchical vocabulary is introduced to enhance the According to the relationship between medical terms, the standard medical terms corresponding to the clinical medical text are accurately determined and automatically encoded in the process of sequence generation according to the hierarchical vocabulary. It solves the problems of high cost, limited efficiency and low accuracy by using manual coding method in the prior art to manually map medical concepts in clinical medical texts to standard medical term codes.

Figure 202110597714

Description

一种结合序列生成和层级词表的医学概念自动编码方法An automatic coding method for medical concepts combining sequence generation and hierarchical vocabulary

技术领域technical field

本发明涉及医学概念编码领域,尤其涉及的是一种结合序列生成和层级词表的医学概念自动编码方法。The invention relates to the field of medical concept coding, in particular to a medical concept automatic coding method combining sequence generation and hierarchical vocabulary.

背景技术Background technique

医学概念自动编码是医疗信息处理领域的一个重要研究方向。在医疗信息系统中,同一标准医学术语可能有多种不同的医学概念表达方式,这种表述方式的不统一和不准确现象严重阻碍了医疗大数据的整合、共享和利用,给医疗领域的临床、教学和科研带来了诸多不便。医学编码是一种数字和字母标签系统,它能够为每个诊断,症状或者症状组合等提供独特且统一的编码表示。目前医疗机构需要采用人工编码的方式将临床医疗文本中的医学概念手动映射为标准医学术语编码,而人工编码需要大量具有医学知识的专业人员进行操作,成本高昂,效率有限并且准确性不高。The automatic coding of medical concepts is an important research direction in the field of medical information processing. In the medical information system, the same standard medical term may have many different expressions of medical concepts. The inconsistency and inaccuracy of such expressions seriously hinder the integration, sharing and utilization of medical big data. , teaching and research have brought a lot of inconvenience. Medical coding is a number and letter labeling system that provides a unique and uniform coded representation for each diagnosis, symptom, or combination of symptoms. At present, medical institutions need to manually map medical concepts in clinical medical texts to standard medical term codes by means of manual coding, and manual coding requires a large number of professionals with medical knowledge to operate, which is costly, limited in efficiency and inaccurate.

因此,现有技术还有待改进和发展。Therefore, the existing technology still needs to be improved and developed.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题在于,针对现有技术的上述缺陷,提供一种结合序列生成和层级词表的医学概念自动编码方法及装置,旨在解决了现有技术中采用人工编码的方法将临床医疗文本中的医学概念手动映射为标准医学术语编码,不仅成本高昂、效率有限而且准确性不高的问题。The technical problem to be solved by the present invention is to provide a method and device for automatic coding of medical concepts combining sequence generation and hierarchical vocabulary, aiming at solving the problem of using manual coding in the prior art. Manual mapping of medical concepts in clinical medical texts to standard medical term codes is not only costly, limited in efficiency, but also inaccurate.

本发明解决问题所采用的技术方案如下:The technical scheme adopted by the present invention to solve the problem is as follows:

第一方面,本发明实施例提供一种结合序列生成和层级词表的医学概念自动编码方法,其中,所述方法包括:In a first aspect, an embodiment of the present invention provides an automatic coding method for medical concepts combining sequence generation and hierarchical vocabulary, wherein the method includes:

获取临床医疗文本,将所述临床医疗文本输入预设的编码器中,得到所述临床医疗文本的初始向量数据;Obtain clinical medical text, input the clinical medical text into a preset encoder, and obtain initial vector data of the clinical medical text;

获取预先构建的层级词表数据,将所述层级词表数据输入预设的学习算法中,并获得所述层级词表的标准医学术语向量数据;Obtain pre-built hierarchical vocabulary data, input the hierarchical vocabulary data into a preset learning algorithm, and obtain standard medical term vector data of the hierarchical vocabulary;

将所述临床医疗文本的初始向量数据和已经生成的所述标准医学术语向量数据输入预设的解码器中,依次生成若干个标准医学术语对应的编码数据,并根据所述编码数据形成所述临床医疗文本对应的编码数据。Input the initial vector data of the clinical medical text and the standard medical term vector data that have been generated into the preset decoder, sequentially generate encoded data corresponding to several standard medical terms, and form the encoded data according to the encoded data. The encoded data corresponding to the clinical medical text.

在一种实施方式中,所述获取临床医疗文本,将所述临床医疗文本输入预设的编码器中,得到所述临床医疗文本的初始向量数据包括:In one embodiment, the obtaining of the clinical medical text, inputting the clinical medical text into a preset encoder, and obtaining the initial vector data of the clinical medical text includes:

将临床医疗文本输入词嵌入层,通过所述词嵌入层对所述临床医疗文本进行映射后得到映射数据;Inputting the clinical medical text into the word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;

将所述映射数据输入到编码器,获取所述编码器基于所述映射数据编码生成的初始向量数据。The mapping data is input to an encoder, and initial vector data encoded and generated by the encoder based on the mapping data is obtained.

在一种实施方式中,所述获取预先构建的层级词表数据,将所述层级词典数据输入预设的学习算法中,并获得所述层级词表的标准医学术语向量数据包括:In one embodiment, the obtaining pre-built hierarchical vocabulary data, inputting the hierarchical dictionary data into a preset learning algorithm, and obtaining the standard medical term vector data of the hierarchical vocabulary includes:

获取术语词典数据中的标准医学术语数据的编码信息,根据所述编码信息将所述标准医学术语数据分为父节点和子节点;Obtain the coding information of the standard medical term data in the term dictionary data, and divide the standard medical term data into a parent node and a child node according to the coding information;

获取所述父节点、所述子节点以及所述父节点与所述子节点之间的父子关系信息,根据所述父节点、所述子节点以及所述父节点与所述子节点之间的父子关系信息构建层级词表数据;Obtain the parent node, the child node, and the parent-child relationship information between the parent node and the child node, according to the parent node, the child node, and the parent node and the child node. The parent-child relationship information builds hierarchical vocabulary data;

将所述层级词表数据输入预设的学习算法中,得到表示所述父节点、所述子节点以及所述父子关系信息的向量数据;Inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the parent node, the child node and the parent-child relationship information;

将表示所述父节点、所述子节点以及所述父子关系信息的向量数据作为所述层级词表的标准医学术语向量数据。The vector data representing the parent node, the child node and the parent-child relationship information is used as the standard medical term vector data of the hierarchical vocabulary.

在一种实施方式中,所述编码信息包含字母段信息和数字段信息。In one embodiment, the encoded information includes alphabetic segment information and numeric segment information.

在一种实施方式中,所述获取术语词典数据中的标准医学术语数据的编码信息,根据所述编码信息将所述标准医学术语数据分为父节点和子节点包括:In one embodiment, the acquiring coding information of the standard medical term data in the term dictionary data, and dividing the standard medical term data into parent nodes and child nodes according to the coding information includes:

将每一个标准医学术语数据作为一个节点;Treat each standard medical term data as a node;

将所有字母段信息的种类相同,且所述数字段信息预设顺序位之前的若干个数字相同的节点作为同一类节点;Using the same type of all letter segment information, and several nodes with the same number before the preset sequence bit of the digital segment information as the same type of node;

在所述同一类节点中,将所述数字段信息最短的节点作为父节点,将除所述父节点之外的节点作为子节点。Among the nodes of the same type, the node with the shortest digital field information is used as the parent node, and the nodes other than the parent node are used as child nodes.

在一种实施方式中,所述解码器内包含有分类器,所述分类器中包含多个标准医学术语的标签,所述将所述临床医疗文本的初始向量数据和已经生成的所述标准医学术语向量数据输入预设的解码器中,依次生成若干个标准医学术语对应的编码数据,并根据所述编码数据形成所述临床医疗文本对应的标准医学术语序列数据包括:In one embodiment, the decoder includes a classifier, the classifier includes labels of a plurality of standard medical terms, the initial vector data of the clinical medical text and the generated standard The medical term vector data is input into the preset decoder, the encoded data corresponding to several standard medical terms are sequentially generated, and the standard medical term sequence data corresponding to the clinical medical text is formed according to the encoded data, including:

获取由所述解码器输出的所有历史标准医学术语向量数据组成的序列数据;所述序列数据为当前时间步之前所述解码器输出的编码对应的标准医学术语向量数据;Obtain sequence data composed of all historical standard medical term vector data output by the decoder; the sequence data is the standard medical term vector data corresponding to the encoding output by the decoder before the current time step;

通过所述分类器基于所述初始向量数据和所述序列数据,确定在所述临床医疗文本对应的当前时间步时,所述解码器输出的编码数据;重复这一过程,直到没有编码数据可以生成为止;Determine the encoded data output by the decoder at the current time step corresponding to the clinical medical text based on the initial vector data and the sequence data by the classifier; repeat this process until there is no encoded data available until generated;

根据所述编码数据形成所述临床医疗文本对应的标准医学术语序列数据。Standard medical term sequence data corresponding to the clinical medical text is formed according to the encoded data.

在一种实施方式中,所述分类器中包括概率函数,所述通过所述分类器基于所述初始向量数据和所述序列数据,确定在所述临床医疗文本对应的当前时间步时,所述解码器输出的编码数据;In an embodiment, the classifier includes a probability function, and the classifier determines, based on the initial vector data and the sequence data, that at the current time step corresponding to the clinical medical text, the the encoded data output by the decoder;

将所述初始向量数据与所述序列数据对应的向量数据进行融合,得到融合向量数据;Fusing the initial vector data with the vector data corresponding to the sequence data to obtain fusion vector data;

将所述融合向量数据输入所述概率函数中,获取所述概率函数基于所述融合向量数据生成的若干个可能的编码数据的概率值;Inputting the fusion vector data into the probability function, and obtaining the probability values of several possible encoded data generated by the probability function based on the fusion vector data;

将所述概率值按照数值大小进行排序,并将概率值最大的编码数据作为当前时刻解码器输出的编码数据。Sort the probability values according to the numerical value, and use the encoded data with the largest probability value as the encoded data output by the decoder at the current moment.

第二方面,本发明实施例还提供一种结合序列生成和层级词表的医学概念自动编码装置,其中,所述装置包括:In a second aspect, an embodiment of the present invention further provides an apparatus for automatic coding of medical concepts combining sequence generation and hierarchical vocabulary, wherein the apparatus includes:

获取模块,用于获取临床医疗文本,将所述临床医疗文本输入预设的编码器中,得到所述临床医疗文本的初始向量数据;an acquisition module, configured to acquire clinical medical text, input the clinical medical text into a preset encoder, and obtain initial vector data of the clinical medical text;

学习模块,用于获取预先构建的层级词表数据,将所述层级词表数据输入预设的学习算法中,并获得所述层级词表的标准医学术语向量数据;A learning module for acquiring pre-built hierarchical vocabulary data, inputting the hierarchical vocabulary data into a preset learning algorithm, and obtaining standard medical term vector data of the hierarchical vocabulary;

编码模块,用于将所述临床医疗文本的初始向量数据和已经生成的所述标准医学术语向量数据输入预设的解码器中,依次生成若干个标准医学术语对应的编码数据,并根据所述编码数据形成所述临床医疗文本对应的标准医学术语序列数据。The encoding module is used to input the initial vector data of the clinical medical text and the standard medical term vector data that have been generated into the preset decoder, and sequentially generate encoded data corresponding to several standard medical terms, and according to the The encoded data forms standard medical term sequence data corresponding to the clinical medical text.

本发明的有益效果:本发明实施例通过将临床医疗文本中的医学概念编码事件转变为序列生成问题,并引入层级词表的概念来增强医疗术语之间的关系,根据所述层级词表在进行序列生成的过程中准确地确定所述临床医疗文本对应的标准医学术语并进行自动编码。解决了现有技术中采用人工编码的方法将临床医疗文本中的医学概念手动映射为标准医学术语编码,不仅成本高昂、效率有限而且准确性不高的问题。Beneficial effects of the present invention: In the embodiment of the present invention, the medical concept coding events in clinical medical texts are transformed into sequence generation problems, and the concept of hierarchical vocabulary is introduced to enhance the relationship between medical terms. During the sequence generation process, the standard medical terms corresponding to the clinical medical text are accurately determined and automatically encoded. The method solves the problems of high cost, limited efficiency and low accuracy of manually mapping medical concepts in clinical medical texts to standard medical term codes by using manual coding methods in the prior art.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1是本发明实施例提供的一种结合序列生成和层级词表的医学概念自动编码方法的流程示意图。FIG. 1 is a schematic flowchart of a method for automatic coding of medical concepts that combines sequence generation and hierarchical vocabulary according to an embodiment of the present invention.

图2是本发明实施例提供的Seq2Seq模型的框架示意图。FIG. 2 is a schematic diagram of a framework of a Seq2Seq model provided by an embodiment of the present invention.

图3是本发明实施例提供的一种结合序列生成和层级词表的医学概念自动编码装置的模块连接图。FIG. 3 is a module connection diagram of an apparatus for automatic encoding of medical concepts that combines sequence generation and hierarchical vocabulary according to an embodiment of the present invention.

图4是本发明实施例提供的终端的原理框图。FIG. 4 is a principle block diagram of a terminal provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚、明确,以下参照附图并举实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

需要说明,若本发明实施例中有涉及方向性指示(诸如上、下、左、右、前、后……),则该方向性指示仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。It should be noted that if there are directional indications (such as up, down, left, right, front, back, etc.) involved in the embodiments of the present invention, the directional indications are only used to explain a certain posture (as shown in the accompanying drawings). If the specific posture changes, the directional indication also changes accordingly.

随着互联网、大数据、云计算和人工智能等信息技术的快速发展和应用普及,人类的生产生活受到了史无前例的影响。近些年来,信息技术已经逐渐运用到社会生活的方方面面。在各行各业中,信息技术已经改变了人类管理、分析和运用数据的方式,经济文化等众多领域如今都已经离不开信息技术的辅助。在这些信息技术的应用领域中,医疗是重要且具有无限潜力的领域之一。在医疗领域中,涉及到大量的信息处理。这些医疗信息具有以下特点:With the rapid development and popularization of information technologies such as the Internet, big data, cloud computing and artificial intelligence, the production and life of human beings have been affected unprecedentedly. In recent years, information technology has been gradually applied to all aspects of social life. In all walks of life, information technology has changed the way humans manage, analyze and use data, and many fields such as economics and culture are now inseparable from the assistance of information technology. Among these application fields of information technology, medical treatment is one of the important fields with unlimited potential. In the medical field, a large amount of information processing is involved. This medical information has the following characteristics:

1)数据量大且增长速度快;1) The amount of data is large and the growth rate is fast;

2)共享性需求高。2) High demand for sharing.

而信息技术最大的优势就在于它对数据处理的效率,因此目前信息技术在医疗界的应用广泛,并由此产生了医疗信息处理这一计算机应用方向。医疗信息处理是指将计算机相关技术与医疗卫生行业需求进行有机的结合,满足医疗机构以及相关部门对医疗卫生信息的收集、整理、存储和分析等需求,提高卫生行业效率,并满足客户功能需求。医疗信息处理技术在提高医疗信息处理效率的同时,也提升了医疗信息处理的准确率,让医学信息的发展进入到新的高度。长期以来,如何利用医疗信息处理技术切实有效地提高医疗水平与推动医学发展,是相关领域学者一直在研究的热点问题之一。The biggest advantage of information technology lies in its efficiency of data processing. Therefore, information technology is widely used in the medical field, and thus the computer application direction of medical information processing has emerged. Medical information processing refers to the organic combination of computer-related technologies and the needs of the medical and health industry to meet the needs of medical institutions and related departments for the collection, sorting, storage and analysis of medical and health information, improve the efficiency of the health industry, and meet customer functional needs. . Medical information processing technology not only improves the efficiency of medical information processing, but also improves the accuracy of medical information processing, bringing the development of medical information to a new level. For a long time, how to use medical information processing technology to effectively improve medical level and promote medical development has been one of the hot issues that scholars in related fields have been studying.

医学概念自动编码是医疗信息处理领域的一个重要研究方向。临床医疗文本通常是指医务工作人员在医疗活动中形成的描述病人临床表现的文字资料,其中可能包含若干医学相关概念。在医疗信息系统中,同一标准医学术语可能有多种不同的医学概念表达方式。首先,由于医疗工作人员的记录风格可能存在差异,有时为了追求工作效率,他们记录的医疗文本中可能包含较多的同义词、缩略词、外来语或者口语表述。因此在临床医疗文本中,一种术语对应多种表达的现象较为明显。比如在中文临床医疗文本中,术语“先天性脊柱侧弯”可以表述为“先天性脊柱侧凸”,也可以表述为“先天性脊柱侧弯畸形”;在英文临床医疗文本中,“heart attack”、“MI”和“myocardial infarction”都可以代表“心肌梗塞”的含义。其次,在某些情况下,多种诊断或者症状相关的医学概念是紧密相关且易混淆的,临床医疗文本中的相同医学概念,可能由于上下文语境的不同,对应不同的标准医学术语,比如在中文临床医疗文本中,诊断相关医学概念“鼻咽瘘”根据上下文信息,可能对应标准医学术语“鼻窦瘘”,或者对应标准医学术语“咽瘘”。这种表述方式的不统一和不准确现象严重阻碍了医疗大数据的整合、共享和利用,给医疗领域的临床、教学和科研带来了诸多不便。医学编码是一种数字和字母标签系统,它为每个诊断,症状或者症状组合等提供独特的编码表示。因此,按照统一的将临床医疗文本中的医学概念规范化为标准医学术语在医疗编码系统中对应的代码,在推动医疗信息化进程中显得尤为迫切。当前部分医疗机构采用人工编码的方式将临床医疗文本中的医学概念手动映射为标准医学术语编码。在这个过程中,编码人员需要查阅临床医疗文本中的医学概念或者其他相关信息,然后按照编码指导以人工的方式给这些医学概念分配合适的标准医学术语编码。由于医疗机构每天都会产生海量的文本信息,人工编码需要大量具有医学知识的专业人员进行操作,成本高昂,效率有限并且准确性不高。The automatic coding of medical concepts is an important research direction in the field of medical information processing. Clinical medical texts usually refer to the text data formed by medical staff in medical activities to describe the clinical manifestations of patients, which may contain several medical-related concepts. In a medical information system, the same standard medical term may have many different representations of medical concepts. First, because medical staff may have different recording styles, sometimes in pursuit of work efficiency, the medical texts they record may contain many synonyms, acronyms, loanwords or colloquial expressions. Therefore, in clinical medical texts, it is obvious that one term corresponds to multiple expressions. For example, in Chinese clinical medical texts, the term "congenital scoliosis" can be expressed as "congenital scoliosis" or "congenital scoliosis deformity"; in English clinical medical texts, "heart attack" ", "MI" and "myocardial infarction" can all mean "myocardial infarction". Second, in some cases, medical concepts related to multiple diagnoses or symptoms are closely related and easily confused. The same medical concept in clinical medical texts may correspond to different standard medical terms due to different contexts, such as In Chinese clinical medical texts, the diagnosis-related medical concept "nasopharyngeal fistula" may correspond to the standard medical term "nasosinus fistula" or the standard medical term "pharyngeal fistula" according to contextual information. The inconsistency and inaccuracy of such expressions have seriously hindered the integration, sharing and utilization of medical big data, and brought a lot of inconvenience to clinical, teaching and scientific research in the medical field. Medical coding is a number and letter labeling system that provides a unique coded representation for each diagnosis, symptom, or combination of symptoms. Therefore, it is particularly urgent to standardize medical concepts in clinical medical texts into codes corresponding to standard medical terms in the medical coding system in accordance with a unified approach to promoting medical informatization. At present, some medical institutions use manual coding to manually map medical concepts in clinical medical texts to standard medical term codes. In this process, coders need to look up medical concepts or other related information in clinical medical texts, and then manually assign appropriate standard medical term codes to these medical concepts according to the coding guidelines. As medical institutions generate massive amounts of text information every day, manual coding requires a large number of professionals with medical knowledge to operate, which is costly, limited in efficiency and inaccurate.

针对现有技术的上述缺陷,本发明提供了一种结合序列生成和层级词表的医学概念自动编码方法,通过将临床医疗文本中的医学概念编码事件转变为序列生成问题,并引入层级词表的概念来增强医疗术语之间的关系,根据所述层级词表在进行序列生成的过程中准确地确定所述临床医疗文本对应的标准医学术语并进行自动编码。解决了现有技术中采用人工编码的方法将临床医疗文本中的医学概念手动映射为标准医学术语编码,不仅成本高昂、效率有限而且准确性不高的问题。Aiming at the above-mentioned defects of the prior art, the present invention provides a medical concept automatic coding method combining sequence generation and hierarchical vocabulary, by transforming medical concept coding events in clinical medical texts into sequence generation problems, and introducing hierarchical vocabulary To enhance the relationship between medical terms, the standard medical terms corresponding to the clinical medical text are accurately determined and automatically encoded in the process of sequence generation according to the hierarchical vocabulary. The method solves the problems of high cost, limited efficiency and low accuracy of manually mapping medical concepts in clinical medical texts to standard medical term codes by using manual coding methods in the prior art.

如图1所示,本实施例提供一种结合序列生成和层级词表的医学概念自动编码方法,所述方法包括如下步骤:As shown in FIG. 1 , the present embodiment provides a medical concept automatic coding method combining sequence generation and hierarchical vocabulary, and the method includes the following steps:

步骤S100、获取临床医疗文本,将所述临床医疗文本输入预设的编码器中,得到所述临床医疗文本的初始向量数据。Step S100: Obtain clinical medical text, input the clinical medical text into a preset encoder, and obtain initial vector data of the clinical medical text.

为了实现对临床医疗文本中的医学概念进行自动编码,本实施例首先需要获取进行编码的临床医疗文本。鉴于本实施例的主要目标是利用计算机相关技术自动将医学概念映射为标准医学术语对应的编码数据,因此需要对临床医疗文本进行一定处理,以使得计算机可以根据处理后的临床医疗文本进行计算。具体地,如图2所示,本实施例中主要采用的是Seq2Seq模型作为主要生成框架,所谓Seq2Seq模型指的是输出的长度不确定时采用的模型。举例说明,当机器翻译的任务中出现将一句中文翻译成英文的任务,那么翻译出的英文的长度有可能会比中文短,也有可能会比中文长,因此会出现输出的长度不确定的情况,而Seq2Seq模型刚好适用于该情况。本实施例中Seq2Seq模型包含编码器和解码器,其中编码器可以采用双向LSTM,双向LSTM负责将输入的文本数据转化成向量形式的数据,这个向量就可以看成是输入文本的语义向量。具体地,获取到临床医疗文本以后,将所述临床医疗文本输入预设的Seq2Seq模型中的编码器,所述编码器获取到所述临床医疗文本以后,会通过学习输入将所述临床医疗文本编码成向量形式,进而得到所述临床医疗文本的初始向量数据。In order to realize automatic encoding of medical concepts in clinical medical texts, in this embodiment, the encoded clinical medical texts need to be obtained first. Since the main goal of this embodiment is to automatically map medical concepts to encoded data corresponding to standard medical terms using computer-related technologies, it is necessary to perform certain processing on the clinical medical text so that the computer can perform calculations based on the processed clinical medical text. Specifically, as shown in FIG. 2 , in this embodiment, the Seq2Seq model is mainly used as the main generation framework, and the so-called Seq2Seq model refers to the model used when the output length is uncertain. For example, when the task of translating a sentence of Chinese into English appears in the task of machine translation, the length of the translated English may be shorter than that of Chinese, or it may be longer than that of Chinese, so the length of the output may be uncertain. , and the Seq2Seq model is just right for this situation. In this embodiment, the Seq2Seq model includes an encoder and a decoder, wherein the encoder can use a bidirectional LSTM, and the bidirectional LSTM is responsible for converting the input text data into data in the form of a vector, and this vector can be regarded as the semantic vector of the input text. Specifically, after obtaining the clinical medical text, input the clinical medical text into the encoder in the preset Seq2Seq model, and after the encoder obtains the clinical medical text, it will input the clinical medical text through learning input. Encoding into a vector form, and then obtaining the initial vector data of the clinical medical text.

在一种实现方式中,所述步骤S100具体包括如下步骤:In an implementation manner, the step S100 specifically includes the following steps:

步骤S110、将临床医疗文本输入词嵌入层,通过所述词嵌入层对所述临床医疗文本进行映射后得到映射数据;Step S110, input the clinical medical text into a word embedding layer, and obtain mapping data after mapping the clinical medical text through the word embedding layer;

步骤S120、将所述映射数据输入到编码器,获取所述编码器基于所述映射数据编码生成所述临床医疗文本的初始向量数据。Step S120: Input the mapping data into an encoder, and obtain initial vector data encoded by the encoder to generate the clinical medical text based on the mapping data.

为了获取到所述初始向量数据,本实施例需要先将所述临床医疗文本输入词嵌入层。词嵌入层获取到所述临床医疗文本以后,会将文本中词的特征映射到较低的维度,映射完毕以后会输出映射数据,从而使得模型参数更少,训练更快。然后将所述映射数据输入到编码器中,编码器获取到所述映射数据以后,会对所述映射数据进行编码并生成初始向量数据。举例说明,本实施例中采用双向LSTM作为编码器,首先在编码器端,先将临床医疗文本

Figure 766160DEST_PATH_IMAGE001
中的单词通过词嵌入层映射为向量表示的
Figure 877205DEST_PATH_IMAGE002
,然后对
Figure 449132DEST_PATH_IMAGE003
使用双向LSTM进行编码,得到隐藏层表示:In order to obtain the initial vector data, in this embodiment, the clinical medical text needs to be input into the word embedding layer first. After the word embedding layer obtains the clinical medical text, it maps the features of the words in the text to a lower dimension, and outputs the mapping data after the mapping is completed, so that the model has fewer parameters and faster training. Then, the mapping data is input into the encoder, and after the encoder obtains the mapping data, it encodes the mapping data and generates initial vector data. For example, in this embodiment, a bidirectional LSTM is used as the encoder. First, on the encoder side, the clinical medical text is first
Figure 766160DEST_PATH_IMAGE001
The words in the word embedding layer are mapped to vector representation
Figure 877205DEST_PATH_IMAGE002
, then right
Figure 449132DEST_PATH_IMAGE003
Use a bidirectional LSTM for encoding to get the hidden layer representation:

Figure 546401DEST_PATH_IMAGE004
Figure 546401DEST_PATH_IMAGE004

Figure 238282DEST_PATH_IMAGE005
Figure 238282DEST_PATH_IMAGE005

由于本实施例中采用的编码器是双向LSTM,因此需要将双向LSTM输出的两种隐藏层表示进行拼接,得到最终的隐藏层表示:Since the encoder used in this embodiment is a bidirectional LSTM, it is necessary to splicing the two hidden layer representations output by the bidirectional LSTM to obtain the final hidden layer representation:

Figure 864435DEST_PATH_IMAGE006
Figure 864435DEST_PATH_IMAGE006

其中,

Figure 517134DEST_PATH_IMAGE007
即为所述临床医疗文本的初始向量数据。in,
Figure 517134DEST_PATH_IMAGE007
That is, the initial vector data of the clinical medical text.

如图1所示,所述方法还包括如下步骤:As shown in Figure 1, the method further includes the following steps:

步骤S200、获取预先构建的层级词表数据,将所述层级词表数据输入预设的学习算法中,并获得所述层级词表的标准医学术语向量数据。Step S200: Obtain pre-built hierarchical vocabulary data, input the hierarchical vocabulary data into a preset learning algorithm, and obtain standard medical term vector data of the hierarchical vocabulary.

为了实现医学概念的自动编码过程,本实施例会需要获取预先构建的层级词表数据,所述层级词表数据实际上指的是包含各种标准医学术语之间层级关系的词表数据,由于本实施例中需要用到深度学习模型,因此还需要将所述层级词表中包含的标准医学术语转换成深度学习模型可以执行的向量格式,因此需要将所述层级词表数据输入预设的学习算法中,并获得所述层级词表的标准医学术语向量数据。In order to realize the automatic coding process of medical concepts, this embodiment will need to acquire pre-built hierarchical vocabulary data, which actually refers to vocabulary data containing hierarchical relationships between various standard medical terms. In the embodiment, a deep learning model needs to be used, so it is also necessary to convert the standard medical terms contained in the hierarchical vocabulary into a vector format that can be executed by the deep learning model, so it is necessary to input the hierarchical vocabulary data into the preset learning. algorithm, and obtain the standard medical term vector data of the hierarchical vocabulary.

在一种实现方式中,所述步骤S200具体地包括如下步骤:In an implementation manner, the step S200 specifically includes the following steps:

步骤S210、获取术语词典数据中的标准医学术语数据的编码信息,根据所述编码信息将所述标准医学术语数据分为父节点和子节点;Step S210, obtaining the coding information of the standard medical term data in the term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;

步骤S220、获取所述父节点、所述子节点以及所述父节点与所述子节点之间的父子关系信息,根据所述父节点、所述子节点以及所述父节点与所述子节点之间的父子关系信息构建层级词表数据;Step S220: Obtain the parent node, the child node, and the parent-child relationship information between the parent node and the child node, according to the parent node, the child node, and the parent node and the child node. The parent-child relationship information between them constructs hierarchical vocabulary data;

步骤S230、将所述层级词表数据输入预设的学习算法中,得到表示所述父节点、所述子节点以及所述父子关系信息的向量数据;Step S230, inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the parent node, the child node and the parent-child relationship information;

步骤S240、将表示所述父节点、所述子节点以及所述父子关系信息的向量数据作为所述层级词表的标准医学术语向量数据。Step S240: Use the vector data representing the parent node, the child node and the parent-child relationship information as the standard medical term vector data of the hierarchical vocabulary.

首先,本实施例会预先构建一个层级词表。具体地,为了构建所述层级词表,本实施例需要获取术语词典数据中的标准医学术语数据的编码信息,根据所述编码信息将所述标准医学术语数据分为父节点和子节点,以确定各个标准医学术语之间的包含关系。在一种实现方法中,所述编码信息可以包含字母段信息和数字段信息,举例说明,一个标准医学术语对应的编码可以为J001。然后将每一个标准医学术语数据作为一个节点,再将所有字母段信息的种类相同,且所述数字段信息预设顺序位之前的若干个数字相同的节点作为同一类节点,然后在所述同一类节点中,将所述数字段信息最短的节点作为父节点,将除所述父节点之外的节点作为子节点。First, this embodiment will build a hierarchical vocabulary in advance. Specifically, in order to construct the hierarchical vocabulary, this embodiment needs to acquire the coding information of the standard medical term data in the term dictionary data, and divide the standard medical term data into parent nodes and child nodes according to the coding information to determine Inclusions between standard medical terms. In an implementation method, the encoding information may include letter segment information and number segment information. For example, the encoding corresponding to a standard medical term may be J001. Then, each standard medical term data is used as a node, and all the letter segment information is of the same type, and several nodes with the same number before the preset sequence bit of the digital segment information are used as the same type of node, and then in the same type of node. Among the class nodes, the node with the shortest digital field information is regarded as the parent node, and the nodes other than the parent node are regarded as child nodes.

在一种实现方式中,本实施例可以设计一种算法来确定所述父节点和子节点,并且所述父节点和所述子节点组成的结构是一种树形的层级结构,所述具体算法如下:In an implementation manner, this embodiment can design an algorithm to determine the parent node and the child node, and the structure composed of the parent node and the child node is a tree-shaped hierarchical structure. The specific algorithm as follows:

A.定义树中节点的数据结构,每个树节点包含两部分:编码字符串和孩子节点列表,转至步骤B;A. Define the data structure of the nodes in the tree, each tree node contains two parts: the encoded string and the list of child nodes, go to step B;

B.初始化树的根节点,根节点对应编码为空字符串,孩子节点列表为空,转至步骤C;B. Initialize the root node of the tree, the corresponding code of the root node is an empty string, and the list of child nodes is empty, go to step C;

C.若标准医学术语词典为空,则算法结束。否则,从标准医学术语词典中取一个(编码,术语)对,转至步骤D;C. If the standard medical term dictionary is empty, the algorithm ends. Otherwise, take a (code, term) pair from the standard medical term dictionary and go to step D;

D.设置当前节点为根节点,如果当前节点的编码是所取编码的前缀(字母段信息)且二者不同,则设置跳出循环标志为假,转至步骤E,否则转至步骤H;D. Set the current node as the root node, if the code of the current node is the prefix of the code (letter segment information) and the two are different, then set the jump out of the loop flag to false, go to step E, otherwise go to step H;

E.如果当前节点的孩子节点列表为空,则转至步骤G。否则,取一孩子节点,转至步骤F;E. If the child node list of the current node is empty, go to step G. Otherwise, take a child node and go to step F;

F.如果孩子节点的编码是所取编码的前缀且二者不同,设置当前节点为孩子节点,设置跳出循环标志为真,跳转至步骤F,否则转至步骤E;F. If the code of the child node is the prefix of the code taken and the two are different, set the current node as the child node, set the jump-out loop flag to be true, and jump to step F, otherwise go to step E;

G.如果跳出循环标志为真,则转至步骤C。否则,转至步骤D;G. If the out of loop flag is true, go to step C. Otherwise, go to step D;

H.初始化新节点,新节点对应编码为所取编码,孩子节点为空。将此新节点加入到当前节点的孩子节点列表中,转至步骤C。H. Initialize a new node, the corresponding code of the new node is the selected code, and the child node is empty. Add this new node to the list of child nodes of the current node and go to step C.

确定父节点以及子节点以后,还需要获取所述父节点、所述子节点以及所述父节点与所述子节点之间的父子关系信息,并根据所述父节点、所述子节点以及所述父子关系信息构建出层级词表数据,然后将所述层级词表数据输入预设的学习算法中,例如TransE算法,从而得到表示所述父节点、所述子节点以及所述父子关系信息的向量数据。最后将表示所述父节点、所述子节点以及所述父子关系信息的向量数据作为所述层级词表的标准医学术语向量数据。简言之,本实施例希望将标准医学术语通过低维稠密的向量进行表示,并确定各个标准医学术语向量数据之间的包含关系,从而将相似语义的标准医学术语分布在近似的、邻近的向量空间。此外,本实施例通过构建一个层级词表来确定每一个标准医学术语的从属关系或者层级关系,进而使得标准医学术语之间的层级关系更加清晰化,并能够实现更好地筛选出临床医疗文本中医学概念对应的标准医学术语,最终达到医学概念的自动编码的目的。After determining the parent node and the child node, it is also necessary to obtain the parent node, the child node and the parent-child relationship information between the parent node and the child node, and according to the parent node, the child node and the Describe the parent-child relationship information to construct hierarchical vocabulary data, and then input the hierarchical vocabulary data into a preset learning algorithm, such as the TransE algorithm, so as to obtain the information representing the parent node, the child node and the parent-child relationship information. vector data. Finally, the vector data representing the parent node, the child node and the parent-child relationship information is used as the standard medical term vector data of the hierarchical vocabulary. In short, this embodiment hopes to represent standard medical terms by low-dimensional and dense vectors, and to determine the inclusion relationship between each standard medical term vector data, so that standard medical terms with similar semantics are distributed in approximate and adjacent ones. vector space. In addition, this embodiment determines the subordination or hierarchical relationship of each standard medical term by constructing a hierarchical vocabulary, thereby making the hierarchical relationship between the standard medical terms clearer and enabling better screening of clinical medical texts The standard medical terms corresponding to the concepts of traditional Chinese medicine, and finally achieve the purpose of automatic coding of medical concepts.

如图1所示,所述方法还包括如下步骤:As shown in Figure 1, the method further includes the following steps:

步骤S300、将所述临床医疗文本的初始向量数据和已经生成的所述标准医学术语向量数据输入预设的解码器中,依次生成若干个标准医学术语对应的编码数据,并根据所述编码数据形成所述临床医疗文本对应的标准医学术语序列数据。Step S300, inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to several standard medical terms, and according to the encoded data Standard medical term sequence data corresponding to the clinical medical text is formed.

本实施例中采用的Seq2Seq模型还包括一个解码器,在一种实现方式中,可以采用单向LSTM作为解码器。具体地,获取到临床医疗文本的初始向量数据和标准医学术语向量数据以后,需要将这两种数据输入解码器中,所述解码器获取到这两种数据以后,会依次生成多个标准医学术语对应的编码数据,根据这些编码数据就可以组成临床医疗文本对应的标准医学术语序列数据。简言之,本实施例在对临床医疗文本中的医学概念进行编码的时候不仅会参考标准医学术语之间的层级关系,还会参考多个标准医学术语生成的序列,并最终确定临床医疗文本中的医学概念对应的正确编码。The Seq2Seq model used in this embodiment further includes a decoder, and in an implementation manner, a unidirectional LSTM may be used as the decoder. Specifically, after obtaining the initial vector data of the clinical medical text and the vector data of standard medical terms, these two kinds of data need to be input into the decoder. After obtaining the two kinds of data, the decoder will sequentially generate multiple standard medical terms. The encoded data corresponding to the terms, according to the encoded data, the standard medical term sequence data corresponding to the clinical medical text can be formed. In short, when coding the medical concepts in the clinical medical text, this embodiment not only refers to the hierarchical relationship between standard medical terms, but also refers to the sequence generated by multiple standard medical terms, and finally determines the clinical medical text. The correct coding for the medical concepts in .

在一种实现方式中,所述解码器内包含有分类器,所述分类器中包含多个标准医学术语的标签,所述步骤S300具体包括如下步骤:In an implementation manner, the decoder includes a classifier, and the classifier includes a plurality of labels of standard medical terms, and the step S300 specifically includes the following steps:

步骤S310、获取由所述解码器输出的所有历史标准医学术语向量数据组成的序列数据;所述序列数据为当前时间步之前所述解码器输出的编码对应的标准医学术语向量数据;Step S310, obtaining sequence data composed of all historical standard medical term vector data output by the decoder; the sequence data is the standard medical term vector data corresponding to the encoding output by the decoder before the current time step;

步骤S320、通过所述分类器基于所述初始向量数据和所述序列数据,确定在所述临床医疗文本对应的当前时间步时,所述解码器输出的编码数据;重复这一过程,直到没有编码数据可以生成为止;Step S320: Determine the encoded data output by the decoder at the current time step corresponding to the clinical medical text based on the initial vector data and the sequence data by the classifier; repeat this process until there is no until the encoded data can be generated;

步骤S330、根据所述编码数据形成所述临床医疗文本对应的标准医学术语序列数据。Step S330 , forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.

为了实现对临床医疗文本内的医学概念进行自动编码,本实施例首先会获取由所述解码器输出的所有历史标准医学术语向量数据组成的序列数据,可以理解的是所述序列数据为当前时间步之前所述解码器输出的编码对应的标准医学术语向量数据。然后为了确定当前时间步,与所述临床医疗文本相似度最高的一个标准医学术语向量数据,本实施例还会将所述向量数据与所述序列数据输入分类器中,所述分类器中的分类空间实际可以看做一个标签集合,每一个标签对应一个标准医学术语。本实施例希望通过所述分类器根据所述初始向量数据和所述序列数据,确定在所述临床医疗文本对应的当前时间步时,解码器输出的编码数据。具体地,本实施例还在所述分类器中设置有一个赶驴函数,为了确定解码器当前时间步输出的编码数据,本实施例需要先将所述初始向量数据与所述序列数据对应的向量数据进行融合,得到融合向量数据。然后再将所述融合向量数据输入所述概率函数中,获取所述概率函数基于所述融合向量数据生成的若干个可能的编码数据的概率值,可以理解的是所述概率值在一定程度上可以指示每一个拼接序列数据与临床医疗文本之间的相关性或者关联程度,即本实施例中会引入注意力机制来确定临床医疗文本对应的标准医学术语。In order to realize automatic encoding of medical concepts in clinical medical texts, this embodiment first acquires sequence data composed of all historical standard medical term vector data output by the decoder. It can be understood that the sequence data is the current time The code corresponding to the standard medical term vector data output by the decoder before the step. Then, in order to determine the vector data of a standard medical term with the highest similarity with the clinical medical text at the current time step, in this embodiment, the vector data and the sequence data are also input into the classifier, and the The classification space can actually be seen as a set of labels, each label corresponding to a standard medical term. In this embodiment, it is hoped that the classifier can determine the encoded data output by the decoder at the current time step corresponding to the clinical medical text according to the initial vector data and the sequence data. Specifically, in this embodiment, a donkey chasing function is also set in the classifier. In order to determine the encoded data output by the decoder at the current time step, this embodiment needs to firstly compare the initial vector data with the sequence data corresponding to the sequence data. The vector data is fused to obtain fused vector data. Then, the fusion vector data is input into the probability function, and the probability values of several possible encoded data generated by the probability function based on the fusion vector data are obtained. It can be understood that the probability values are to a certain extent The correlation or degree of association between each spliced sequence data and the clinical medical text can be indicated, that is, an attention mechanism is introduced in this embodiment to determine the standard medical term corresponding to the clinical medical text.

举例说明,在解码器端,时刻

Figure 152514DEST_PATH_IMAGE008
解码器的隐藏层表示
Figure 184055DEST_PATH_IMAGE009
计算如下,其中
Figure 246689DEST_PATH_IMAGE010
Figure 386684DEST_PATH_IMAGE011
时刻输出分布中概率最大的编码
Figure 950389DEST_PATH_IMAGE012
对应的标准医学术语的向量表示,即用TransE算法学习得到的向量表示;For example, on the decoder side, the moment
Figure 152514DEST_PATH_IMAGE008
The hidden layer representation of the decoder
Figure 184055DEST_PATH_IMAGE009
Calculated as follows, where
Figure 246689DEST_PATH_IMAGE010
for
Figure 386684DEST_PATH_IMAGE011
The code with the highest probability in the output distribution at the moment
Figure 950389DEST_PATH_IMAGE012
The vector representation of the corresponding standard medical terms, that is, the vector representation learned by the TransE algorithm;

Figure 226649DEST_PATH_IMAGE013
Figure 226649DEST_PATH_IMAGE013

其中,计算概率时需要用到注意力机制,从而得到

Figure 460185DEST_PATH_IMAGE009
与临床医疗文本之间的相关性系数,具体地,注意力机制计算过程如下,
Figure 821896DEST_PATH_IMAGE014
表示
Figure 674445DEST_PATH_IMAGE008
时刻利用注意力机制计算得到的向量:Among them, the attention mechanism needs to be used when calculating the probability, so as to obtain
Figure 460185DEST_PATH_IMAGE009
The correlation coefficient with clinical medical text, specifically, the calculation process of the attention mechanism is as follows,
Figure 821896DEST_PATH_IMAGE014
express
Figure 674445DEST_PATH_IMAGE008
The vector calculated by the attention mechanism at all times:

Figure 805212DEST_PATH_IMAGE015
Figure 805212DEST_PATH_IMAGE015

Figure 209649DEST_PATH_IMAGE016
Figure 209649DEST_PATH_IMAGE016

Figure 58656DEST_PATH_IMAGE017
Figure 58656DEST_PATH_IMAGE017

其中,

Figure 695655DEST_PATH_IMAGE018
表示
Figure 946508DEST_PATH_IMAGE019
与临床医疗文本中第i个字之间的相关度,
Figure 521846DEST_PATH_IMAGE018
表示
Figure 858149DEST_PATH_IMAGE019
与临床医疗文本中第i个字之间的相关度系数,
Figure 52501DEST_PATH_IMAGE020
Figure 157861DEST_PATH_IMAGE021
Figure 904100DEST_PATH_IMAGE022
、均为学习参数。in,
Figure 695655DEST_PATH_IMAGE018
express
Figure 946508DEST_PATH_IMAGE019
the correlation with the i -th word in the clinical medical text,
Figure 521846DEST_PATH_IMAGE018
express
Figure 858149DEST_PATH_IMAGE019
is the correlation coefficient with the i -th word in the clinical medical text,
Figure 52501DEST_PATH_IMAGE020
,
Figure 157861DEST_PATH_IMAGE021
,
Figure 904100DEST_PATH_IMAGE022
, are learning parameters.

然后经过分类器输出概率分布:Then output the probability distribution through the classifier:

Figure 852333DEST_PATH_IMAGE023
Figure 852333DEST_PATH_IMAGE023

Figure 975010DEST_PATH_IMAGE024
Figure 975010DEST_PATH_IMAGE024

其中,

Figure 934876DEST_PATH_IMAGE025
表示
Figure 852016DEST_PATH_IMAGE026
时刻
Figure 38278DEST_PATH_IMAGE019
Figure 699066DEST_PATH_IMAGE027
经过非线性激活函数f(如tanh、ReLu等)进行融合后的向量,
Figure 513439DEST_PATH_IMAGE028
表示t时刻解码器端输出的分布,
Figure 867060DEST_PATH_IMAGE029
Figure 524306DEST_PATH_IMAGE030
Figure 988785DEST_PATH_IMAGE031
、均为学习参数。in,
Figure 934876DEST_PATH_IMAGE025
express
Figure 852016DEST_PATH_IMAGE026
time
Figure 38278DEST_PATH_IMAGE019
and
Figure 699066DEST_PATH_IMAGE027
The vector after fusion by nonlinear activation function f (such as tanh, ReLu, etc.),
Figure 513439DEST_PATH_IMAGE028
represents the distribution of the output of the decoder at time t ,
Figure 867060DEST_PATH_IMAGE029
,
Figure 524306DEST_PATH_IMAGE030
,
Figure 988785DEST_PATH_IMAGE031
, are learning parameters.

然后将获得的概率值按照数值大小进行排序,并将概率值最大的编码数据作为当前时刻解码器输出的编码数据。一直重复这一过程,直到没有编码数据可以生成为止。最后,根据所述编码数据形成所述临床医疗文本对应的标准医学术语序列数据,从而实现对临床医疗文本进行自动编码。Then, the obtained probability values are sorted by numerical value, and the encoded data with the largest probability value is used as the encoded data output by the decoder at the current moment. This process is repeated until no encoded data can be generated. Finally, standard medical term sequence data corresponding to the clinical medical text is formed according to the encoded data, thereby realizing automatic encoding of the clinical medical text.

概括地讲,在现有技术中虽然存在利用机器学习的方法进行医学概念编码的技术,但是它们大都采用的是贪心搜索的策略生成编码,贪心搜索的策略在解码器端每个时间步将概率最大的标准医学术语的向量筛选出来,因此搜索空间比较局限。而本实施例实际采用的是集束搜索,即在每个时间步,考虑概率最大的前几个序列作为候选序列,并在候选序列中选择概率最大的序列作为最终的目标序列,由于本实施例中会考虑序列与临床医疗文本之间的关系,并且是将概率值最大的前几个序列作为候选序列,因此相较于贪心搜索策略,本实施例中采用的集束搜索策略的搜索空间更大,编码更加准确。Generally speaking, although there are technologies for medical concept coding using machine learning methods in the prior art, most of them use a greedy search strategy to generate codes. The greedy search strategy converts the probability of The largest vector of standard medical terms is filtered out, so the search space is limited. However, this embodiment actually adopts beam search, that is, at each time step, the first few sequences with the highest probability are considered as candidate sequences, and the sequence with the highest probability is selected as the final target sequence among the candidate sequences. The relationship between sequences and clinical medical texts will be considered, and the first few sequences with the largest probability values are used as candidate sequences. Therefore, compared with the greedy search strategy, the search space of the beam search strategy adopted in this embodiment is larger. , the encoding is more accurate.

基于上述实施例,本发明还提供了一种结合序列生成和层级词表的医学概念自动编码装置,如图3所示,该装置包括:Based on the above embodiment, the present invention also provides a medical concept automatic coding device combining sequence generation and hierarchical vocabulary. As shown in FIG. 3 , the device includes:

获取模块01,用于获取临床医疗文本,将所述临床医疗文本输入预设的编码器中,并获取所述编码器基于所述临床医疗文本生成的初始向量数据;An acquisition module 01, configured to acquire clinical medical text, input the clinical medical text into a preset encoder, and acquire initial vector data generated by the encoder based on the clinical medical text;

学习模块02,用于获取预先构建的层级词表数据,将所述层级词表数据输入预设的学习算法中,并获得所述层级词表的标准医学术语向量数据;The learning module 02 is used to obtain pre-built hierarchical vocabulary data, input the hierarchical vocabulary data into a preset learning algorithm, and obtain the standard medical term vector data of the hierarchical vocabulary;

编码模块03,用于将所述临床医疗文本的初始向量数据和已经生成的所述标准医学术语向量数据输入预设的解码器中,依次生成若干个标准医学术语对应的编码数据,并根据所述编码数据形成所述临床医疗文本对应的标准医学术语序列数据。The encoding module 03 is used to input the initial vector data of the clinical medical text and the standard medical term vector data that have been generated into the preset decoder, and sequentially generate encoded data corresponding to several standard medical terms, and according to the The encoded data forms standard medical term sequence data corresponding to the clinical medical text.

基于上述实施例,本发明还提供了一种终端,其原理框图可以如图4所示。该终端包括通过系统总线连接的处理器、存储器、网络接口、显示屏。其中,该终端的处理器用于提供计算和控制能力。该终端的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该终端的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种结合序列生成和层级词表的医学概念自动编码方法。该终端的显示屏可以是液晶显示屏或者电子墨水显示屏。Based on the above embodiments, the present invention also provides a terminal, the principle block diagram of which may be as shown in FIG. 4 . The terminal includes a processor, a memory, a network interface, and a display screen connected through a system bus. Among them, the processor of the terminal is used to provide computing and control capabilities. The memory of the terminal includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used to communicate with external terminals through a network connection. The computer program, when executed by a processor, implements a method for automatic coding of medical concepts that combines sequence generation and hierarchical vocabulary. The display screen of the terminal may be a liquid crystal display screen or an electronic ink display screen.

本领域技术人员可以理解,图4中示出的原理框图,仅仅是与本发明方案相关的部分结构的框图,并不构成对本发明方案所应用于其上的终端的限定,具体的终端可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the principle block diagram shown in FIG. 4 is only a block diagram of a partial structure related to the solution of the present invention, and does not constitute a limitation on the terminal to which the solution of the present invention is applied. The specific terminal may include There are more or fewer components than shown in the figures, or some components are combined, or have a different arrangement of components.

在一种实现方式中,所述终端的存储器中存储有一个或者一个以上的程序,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行结合序列生成和层级词表的医学概念自动编码方法的指令。In one implementation, one or more programs are stored in a memory of the terminal and are configured to be executed by one or more processors, including for combining sequence generation and hierarchy. Instructions for automatic coding methods for medical concepts of the vocabulary.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本发明所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided by the present invention may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM) and so on.

综上所述,本发明公开了一种结合序列生成和层级词表的医学概念自动编码方法及装置,通过将临床医疗文本中的医学概念编码事件转变为序列生成问题,并引入层级词表的概念来增强医疗术语之间的关系,根据所述层级词表在进行序列生成的过程中准确地确定所述临床医疗文本对应的标准医学术语并进行自动编码。解决了现有技术中采用人工编码的方法将临床医疗文本中的医学概念手动映射为标准医学术语编码,不仅成本高昂、效率有限而且准确性不高的问题。To sum up, the present invention discloses a medical concept automatic coding method and device combining sequence generation and hierarchical vocabulary, by transforming medical concept coding events in clinical medical texts into sequence generation problems, and introducing hierarchical vocabulary Concepts are used to enhance the relationship between medical terms, and standard medical terms corresponding to the clinical medical text are accurately determined and automatically encoded in the process of sequence generation according to the hierarchical vocabulary. The method solves the problems of high cost, limited efficiency and low accuracy of manually mapping medical concepts in clinical medical texts to standard medical term codes by using manual coding methods in the prior art.

应当理解的是,本发明的应用不限于上述的举例,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made according to the above descriptions, and all these improvements and transformations should belong to the protection scope of the appended claims of the present invention.

Claims (9)

1. A method for automatic coding of medical concepts in conjunction with sequence generation and hierarchical vocabularies, the method comprising:
acquiring a clinical medical text, and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text;
acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list;
inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
2. The method of claim 1, wherein the obtaining clinical medical texts, inputting the clinical medical texts into a preset encoder, and obtaining initial vector data of the clinical medical texts comprises:
inputting a clinical medical text into a word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;
inputting the mapping data into an encoder, and acquiring initial vector data of the clinical medical text generated by the encoder based on the mapping data.
3. The method as claimed in claim 1, wherein the step of obtaining pre-constructed hierarchical vocabulary data, inputting the hierarchical vocabulary data into a predetermined learning algorithm, and obtaining standard medical term vector data of the hierarchical vocabulary comprises:
acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;
acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node;
inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information;
and taking vector data representing the parent node, the child nodes and the parent-child relationship information as standard medical term vector data of the hierarchical vocabulary.
4. The method as claimed in claim 3, wherein the coding information comprises letter field information and number field information.
5. The method as claimed in claim 4, wherein the obtaining of encoding information of standard medical term data in term dictionary data, and the dividing of the standard medical term data into parent nodes and child nodes according to the encoding information comprises:
taking each standard medical term data as a node;
taking a plurality of nodes with the same number before the preset sequence bit of the digital field information as the same type of nodes, wherein the types of all the letter field information are the same;
and in the same type of nodes, taking the node with the shortest digital field information as a father node and taking the nodes except the father node as child nodes.
6. The method as claimed in claim 1, wherein the decoder comprises a classifier containing labels of a plurality of standard medical terms, the inputting the initial vector data of the clinical medical text and the generated vector data of the standard medical terms into a preset decoder sequentially generates encoded data corresponding to a plurality of standard medical terms, and the forming of the sequence data of the standard medical terms corresponding to the clinical medical text from the encoded data comprises:
acquiring sequence data consisting of all historical standard medical term vector data output by the decoder; the sequence data is standard medical term vector data corresponding to codes output by the decoder before the current time step;
determining, by the classifier, encoded data output by the decoder at a current time step corresponding to the clinical medical text based on the initial vector data and the sequence data; this process is repeated until no encoded data can be generated;
and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
7. The method of claim 6, wherein the classifier comprises a probability function, and the classifier determines the encoded data output by the decoder at the current time step corresponding to the clinical medical text based on the initial vector data and the sequence data;
fusing the initial vector data with vector data corresponding to the sequence data to obtain fused vector data;
inputting the fusion vector data into the probability function, and acquiring probability values of a plurality of possible coded data generated by the probability function based on the fusion vector data;
and sequencing the probability values according to the numerical values, and taking the coded data with the maximum probability value as the coded data output by the decoder at the current moment.
8. An apparatus for automatic coding of medical concepts in conjunction with sequence generation and hierarchical vocabularies, the apparatus comprising:
the acquisition module is used for acquiring a clinical medical text, inputting the clinical medical text into a preset encoder and acquiring initial vector data generated by the encoder based on the clinical medical text;
the learning module is used for acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm and acquiring standard medical term vector data of the hierarchical word list;
and the encoding module is used for inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
9. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to perform the steps of a method for automatic encoding of medical concepts in conjunction with sequence generation and hierarchical vocabularies of any of the preceding claims 1-7.
CN202110597714.5A 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists Active CN113033155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110597714.5A CN113033155B (en) 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110597714.5A CN113033155B (en) 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists

Publications (2)

Publication Number Publication Date
CN113033155A true CN113033155A (en) 2021-06-25
CN113033155B CN113033155B (en) 2021-10-26

Family

ID=76455886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110597714.5A Active CN113033155B (en) 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists

Country Status (1)

Country Link
CN (1) CN113033155B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091411A (en) * 2021-11-30 2022-02-25 云知声智能科技股份有限公司 Method, device and system for generating standardized medical text based on pointer network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182976A (en) * 2017-12-28 2018-06-19 西安交通大学 A kind of clinical medicine information extracting method based on neural network
CN109299273A (en) * 2018-11-02 2019-02-01 广州语义科技有限公司 Based on the multi-source multi-tag file classification method and its system for improving seq2seq model
CN109408820A (en) * 2018-10-17 2019-03-01 长沙瀚云信息科技有限公司 A kind of medical terminology mapped system and method, equipment and storage medium
CN110705214A (en) * 2019-08-27 2020-01-17 天津开心生活科技有限公司 Automatic coding method and device
CN110827929A (en) * 2019-11-05 2020-02-21 中山大学 Disease classification code recognition method and device, computer equipment and storage medium
CN111063446A (en) * 2019-12-17 2020-04-24 医渡云(北京)技术有限公司 Method, apparatus, device and storage medium for standardizing medical text data
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182976A (en) * 2017-12-28 2018-06-19 西安交通大学 A kind of clinical medicine information extracting method based on neural network
CN109408820A (en) * 2018-10-17 2019-03-01 长沙瀚云信息科技有限公司 A kind of medical terminology mapped system and method, equipment and storage medium
CN109299273A (en) * 2018-11-02 2019-02-01 广州语义科技有限公司 Based on the multi-source multi-tag file classification method and its system for improving seq2seq model
CN110705214A (en) * 2019-08-27 2020-01-17 天津开心生活科技有限公司 Automatic coding method and device
CN110827929A (en) * 2019-11-05 2020-02-21 中山大学 Disease classification code recognition method and device, computer equipment and storage medium
CN111063446A (en) * 2019-12-17 2020-04-24 医渡云(北京)技术有限公司 Method, apparatus, device and storage medium for standardizing medical text data
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091411A (en) * 2021-11-30 2022-02-25 云知声智能科技股份有限公司 Method, device and system for generating standardized medical text based on pointer network

Also Published As

Publication number Publication date
CN113033155B (en) 2021-10-26

Similar Documents

Publication Publication Date Title
CN114530223A (en) NLP-based cardiovascular disease medical record structuring system
CN110110324B (en) Biomedical entity linking method based on knowledge representation
CN114036934B (en) A Chinese medical entity relationship joint extraction method and system
CN118606440B (en) Data intelligent analysis method and system combining knowledge graph and rule constraints
WO2021238604A1 (en) Translation method and apparatus, and electronic device and computer readable storage medium
EP4174714A1 (en) Text sequence generation method, apparatus and device, and medium
CN116737879A (en) Knowledge base query method and device, electronic equipment and storage medium
CN116719840B (en) Medical information pushing method based on post-medical-record structured processing
CN111563380A (en) Named entity identification method and device
CN112749277A (en) Medical data processing method and device and storage medium
CN115455175A (en) Method and device for generating cross-language summarization based on multilingual model
Hifny Hybrid LSTM/MaxEnt networks for Arabic syntactic diacritics restoration
CN113033155B (en) Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists
CN114398492B (en) Knowledge graph construction method, terminal and medium in digital field
Zhao et al. A multi-scale embedding network for unified named entity recognition in Chinese electronic medical records
CN115935914A (en) Admission record missing text supplementing method
CN112883711B (en) Method and device for generating abstract and electronic equipment
Ling et al. Enhancing Chinese Address Parsing in Low-Resource Scenarios through In-Context Learning
CN114358021A (en) Task type dialogue statement reply generation method based on deep learning and storage medium
CN114490935A (en) Abnormal text detection method, device, computer readable medium and electronic device
CN114218954B (en) Method and device for distinguishing the positive and negative nature of disease entities and symptom entities in medical record text
CN115587589B (en) Statement confusion degree acquisition method and system for multiple languages and related equipment
CN117038099A (en) Medical term standardization method and device
CN115270792A (en) Medical entity identification method and device
Cao et al. Near-optimal active learning for multilingual grapheme-to-phoneme conversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant