KR102611011B1

KR102611011B1 - Apparatus and method for analyzing electronic health record data

Info

Publication number: KR102611011B1
Application number: KR1020230045459A
Authority: KR
Inventors: 문예찬; 한상철; 안병은; 이기병; 김광준; 정경수
Original assignee: 주식회사 에이아이트릭스
Priority date: 2022-04-19
Filing date: 2023-04-06
Publication date: 2023-12-07
Also published as: KR20230149228A

Abstract

전자 건강 기록 데이터 분석 장치 및 방법을 개시한다. 전자 건강 기록 데이터 분석 장치는, 전자 건강 기록 데이터를 분석하기 위한 프로그램이 저장되는 메모리; 및 상기 프로그램을 실행함으로써 전자 건강 기록 데이터를 분석하는 제어부;를 포함하고, 상기 제어부는, 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성하고, 상기 생성된 임베딩 벡터를 미리 학습된 인공 지능 모델에 입력함으로써 전자 건강 기록 데이터에 대한 분석을 수행하는 것을 특징으로 한다. Disclosed is an electronic health record data analysis device and method. An electronic health record data analysis device includes: a memory storing a program for analyzing electronic health record data; and a control unit that analyzes electronic health record data by executing the program, wherein the control unit generates an embedding vector by selectively aggregating function information for each timestamp of the electronic health record data formed based on timestamps. , Characterized in performing analysis on electronic health record data by inputting the generated embedding vector into a pre-trained artificial intelligence model.

Description

Electronic health record data analysis device and method {APPARATUS AND METHOD FOR ANALYZING ELECTRONIC HEALTH RECORD DATA}

본 명세서에서 개시되는 실시예들은 전자 건강 기록 데이터 분석 장치 및 방법에 관한 것으로, 더욱 상세하게는, EHR(Electronic Health Record) 데이터에 포함되는 결측 데이터를 효과적으로 처리하여 EHR 데이터에 대한 분석을 수행하는 전자 건강 기록 데이터 분석 장치 및 방법에 관한 것이다.Embodiments disclosed herein relate to electronic health record data analysis devices and methods, and more specifically, to electronic health record (EHR) data that effectively processes missing data included in the data and performs analysis of the EHR data. Pertains to a health record data analysis device and method.

전자 건강 기록(Electronic Health Record, 이하 EHR) 데이터는 의료 기관에서 환자를 진료하여 생성되는 정보로, 디지털 형태로 체계적으로 수집되어 전자적으로 저장된 환자의 건강정보를 일컫는다. 전자 건강 기록 데이터는, 예를 들어, 생체 징후(바이탈 사인), 과거 병력, 약물 복용 및 알레르기, 실험실 데이터(실험실 테스트 기능 데이터), 예방 접종 날짜 및 영상 보고서, 나이와 성별 등과 같은 개인적인 통계 정보 등 환자와 관련된 모든 정보들을 포함할 수 있다.Electronic Health Record (EHR) data is information generated by treating patients at medical institutions and refers to the patient's health information that is systematically collected in digital form and stored electronically. Electronic health record data includes, for example, vital signs, past medical history, medications and allergies, laboratory data (lab test function data), immunization dates and imaging reports, personal demographic information such as age and gender, etc. It can contain all information related to the patient.

현대에는 이러한 EHR 데이터를 기반으로 분석을 수행하여 각종 질병에 대한 예측을 수행함으로써 정보를 획득하는, 인공 지능(Artificial Intelligence, 이하 AI) 모델이 활성화되고 있다.In modern times, artificial intelligence (AI) models that obtain information by performing analysis based on EHR data and making predictions about various diseases are becoming active.

하지만, 이러한 EHR 데이터에는 결측 데이터(missing data)가 존재할 수 있다. However, there may be missing data in these EHR data.

결측 데이터를 다른 값으로 대체하여 EHR 데이터에 대한 분석을 수행할 수도 있지만, 결측 데이터를 다른 값으로 대체하여 EHR 데이터의 분석을 수행하는 것은 대체 방법에 따라 다양하므로, EHR 데이터에 대한 분석을 수행하는 AI 모델의 성능은 대체 방법에 따라 달라질 수 있다. Although analysis of EHR data can be performed by replacing missing data with other values, performing analysis of EHR data by replacing missing data with other values varies depending on the imputation method. The performance of AI models may vary depending on alternative methods.

따라서, EHR 데이터에 대한 분석을 수행하는 AI 모델의 성능을 저하시키지 않고 EHR 데이터에 포함되는 결측 데이터를 처리할 수 있는 시스템에 대한 연구가 필요한 실정이다. Therefore, research is needed on a system that can handle missing data included in EHR data without deteriorating the performance of AI models that perform analysis on EHR data.

한편, 전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다. Meanwhile, the above-described background technology is technical information that the inventor possessed for deriving the present invention or acquired in the process of deriving the present invention, and cannot necessarily be said to be known technology disclosed to the general public before filing the application for the present invention. .

한국등록특허 제10-2304370호(2021.09.24. 공고)Korean Patent No. 10-2304370 (announced on September 24, 2021)

본 명세서에서 개시되는 실시예들은, EHR 데이터에 포함되는 결측 데이터를 효과적으로 처리하여 EHR 데이터에 대한 분석을 수행함으로써, 인공지능(artificial intelligence, AI) 모델의 성능이 저하되는 것을 방지할 수 있는 전자 건강 기록 데이터 분석 장치 및 방법을 제공하는데 그 목적이 있다. Embodiments disclosed herein are electronic health systems that can prevent the performance of artificial intelligence (AI) models from deteriorating by effectively processing missing data included in EHR data and performing analysis on EHR data. The purpose is to provide a recorded data analysis device and method.

본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있으며, 일 실시예에 의해 보다 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허청구범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.Other objects and advantages of the present invention can be understood from the following description and will be more clearly understood through an example. In addition, it will be readily apparent that the objects and advantages of the present invention can be realized by means and combinations thereof as indicated in the claims.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 전자 건강 기록 데이터 분석 장치는, 전자 건강 기록 데이터를 분석하기 위한 프로그램이 저장되는 메모리; 및 상기 프로그램을 실행함으로써 전자 건강 기록 데이터를 분석하는 제어부;를 포함하고, 상기 제어부는, 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성하고, 상기 생성된 임베딩 벡터를 미리 학습된 인공 지능 모델에 입력함으로써 전자 건강 기록 데이터에 대한 분석을 수행하는 것을 특징으로 한다.As a technical means for achieving the above-described technical problem, an electronic health record data analysis device includes: a memory storing a program for analyzing electronic health record data; and a control unit that analyzes electronic health record data by executing the program, wherein the control unit generates an embedding vector by selectively aggregating function information for each timestamp of the electronic health record data formed based on timestamps. , Characterized in performing analysis on electronic health record data by inputting the generated embedding vector into a pre-trained artificial intelligence model.

다른 실시예에 따르면, 전자 건강 기록 데이터 분석 장치가 수행하는 전자 건강 기록 데이터 분석 방법은, 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성하는 단계; 및 상기 생성된 임베딩 벡터를 미리 학습된 인공 지능 모델에 입력함으로써 전자 건강 기록 데이터에 대한 분석을 수행하는 단계;를 포함한다.According to another embodiment, the electronic health record data analysis method performed by the electronic health record data analysis device generates an embedding vector by selectively aggregating function information for each timestamp of electronic health record data formed based on timestamps. step; and performing analysis on electronic health record data by inputting the generated embedding vector into a pre-trained artificial intelligence model.

또 다른 실시예에 따르면, 기록매체는, 전자 건강 기록 데이터 분석 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록 매체이다. 상기 전자 건강 기록 데이터 분석 방법은, 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성하는 단계; 및 상기 생성된 임베딩 벡터를 미리 학습된 인공 지능 모델에 입력함으로써 전자 건강 기록 데이터에 대한 분석을 수행하는 단계;를 포함한다.According to another embodiment, the recording medium is a computer-readable recording medium on which a program for performing an electronic health record data analysis method is recorded. The electronic health record data analysis method includes generating an embedding vector by selectively aggregating function information for each timestamp of electronic health record data formed based on timestamps; and performing analysis on electronic health record data by inputting the generated embedding vector into a pre-trained artificial intelligence model.

또 다른 실시예에 따르면, 컴퓨터 프로그램은, 전자 건강 기록 데이터 분석 장치에 의해 수행되며, 전자 건강 기록 데이터 분석 방법을 수행하기 위해 기록 매체에 저장된 컴퓨터 프로그램이다. 상기 전자 건강 기록 데이터 분석 방법은, 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성하는 단계; 및 상기 생성된 임베딩 벡터를 미리 학습된 인공 지능 모델에 입력함으로써 전자 건강 기록 데이터에 대한 분석을 수행하는 단계;를 포함한다.According to another embodiment, the computer program is a computer program stored in a recording medium that is performed by an electronic health record data analysis device and performs an electronic health record data analysis method. The electronic health record data analysis method includes generating an embedding vector by selectively aggregating function information for each timestamp of electronic health record data formed based on timestamps; and performing analysis on electronic health record data by inputting the generated embedding vector into a pre-trained artificial intelligence model.

전술한 과제 해결 수단 중 어느 하나에 의하면, 대입없는 알고리즘(imputation free algorithm)을 이용하여 EHR 데이터에 포함되는 결측 데이터를 효과적으로 처리하고, EHR 데이터에 대한 분석을 수행함으로써, 데이터 분석 중에 부적절한 편향(bias)이 발생하는 것을 막아 AI 모델의 성능이 저하되는 것을 방지할 수 있는 효과가 있다.According to one of the means for solving the above-mentioned problem, the missing data included in the EHR data is effectively processed using an imputation free algorithm and the analysis of the EHR data is performed, thereby preventing inappropriate bias during data analysis. ) has the effect of preventing the performance of the AI model from deteriorating.

개시되는 실시예들에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 개시되는 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects that can be obtained from the disclosed embodiments are not limited to the effects mentioned above, and other effects not mentioned are clear to those skilled in the art to which the disclosed embodiments belong from the description below. It will be understandable.

이하, 첨부되는 도면들은 본 명세서에 개시되는 바람직한 실시예를 예시하는 것이며, 발명을 실시하기 위한 구체적인 내용들과 함께 본 명세서에 개시되는 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 명세서에 개시되는 내용은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니 된다.
도 1은 일 실시예에 따른 전자 건강 기록 데이터 분석 장치의 기능 블록도이다.
도 2 내지 도 3은 일 실시예에 따른 전자 건강 기록 데이터 분석 장치를 설명하기 위한 예시도이다.
도 4는 일 실시예에 따른 전자 건강 기록 데이터 분석 방법의 순서도이다. Hereinafter, the attached drawings illustrate preferred embodiments disclosed in the present specification, and serve to further understand the technical idea disclosed in the present specification along with specific details for carrying out the invention, and thus the drawings disclosed in the present specification The contents should not be construed as limited to the matters described in such drawings.
1 is a functional block diagram of an electronic health record data analysis device according to an embodiment.
2 and 3 are exemplary diagrams for explaining an electronic health record data analysis device according to an embodiment.
Figure 4 is a flowchart of an electronic health record data analysis method according to one embodiment.

아래에서는 첨부한 도면을 참조하여 다양한 실시예들을 상세히 설명한다. 아래에서 설명되는 실시예들은 여러 가지 상이한 형태로 변형되어 실시될 수도 있다. 실시예들의 특징을 보다 명확히 설명하기 위하여, 이하의 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 널리 알려져 있는 사항들에 관해서 자세한 설명은 생략하였다. 그리고, 도면에서 실시예들의 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Below, various embodiments will be described in detail with reference to the attached drawings. The embodiments described below may be modified and implemented in various different forms. In order to more clearly explain the characteristics of the embodiments, detailed descriptions of matters widely known to those skilled in the art to which the following embodiments belong have been omitted. In addition, in the drawings, parts that are not related to the description of the embodiments are omitted, and similar parts are given similar reference numerals throughout the specification.

명세서 전체에서, 어떤 구성이 다른 구성과 "연결"되어 있다고 할 때, 이는 '직접적으로 연결'되어 있는 경우뿐 아니라, '그 중간에 다른 구성을 사이에 두고 연결'되어 있는 경우도 포함한다. 또한, 어떤 구성이 어떤 구성을 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한, 그 외 다른 구성을 제외하는 것이 아니라 다른 구성들을 더 포함할 수도 있음을 의미한다.Throughout the specification, when a configuration is said to be “connected” to another configuration, this includes not only cases where it is “directly connected,” but also cases where it is “connected with another configuration in between.” In addition, when a configuration “includes” a configuration, this means that other configurations may be further included rather than excluding other configurations, unless specifically stated to the contrary.

이하 첨부된 도면을 참고하여 실시예들을 상세히 설명하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings.

다만 이를 설명하기에 앞서, 아래에서 사용되는 용어들의 의미를 먼저 정의한다. However, before explaining this, we first define the meaning of the terms used below.

심층 신경망(Deep Neural Network, DNN)은 입력층(input layer)과 출력층(output layer) 사이에 여러 개의 은닉층(hidden layer)들로 이루어진 인공신경망(Artificial Neural Network, ANN)이다. 심층 신경망은 일반적인 인공신경망과 마찬가지로 복잡한 비선형 관계(non-linear relationship)들을 모델링할 수 있다. 예를 들어, 사물 식별 모델을 위한 심층 신경망 구조에서는 각 객체가 이미지 기본 요소들의 계층적 구성으로 표현될 수 있다. 이때, 추가 계층들은 점진적으로 모여진 하위 계층들의 특징들을 규합시킬 수 있다. 심층 신경망의 이러한 특징은, 비슷하게 수행된 인공신경망에 비해 더 적은 수의 노드(node)들만으로도 복잡한 데이터를 모델링할 수 있게 해준다. 일 실시예에 따르면, 심층 신경망을 이용하여 전자 건강 기록 데이터를 분석하여 각종 질병에 대한 유병률(prevalence) 또는 사망률에 대한 수치를 예측할 수 있다. 일 실시예에 따르면, 심층 신경망은 공지된 LSTM(long short-term memory) 및 BRITS(bidirectional recurrent imputation for time series)일 수 있다. 이때, LSTM은 논문(Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation 1997; 9(8):1735-1780. 참조)을 참조하여 설계될 수 있으며, BRITS는 논문(Cao W, Wang D, Li J, Zhou H, Li L, Li Y. BRITS: Bidirectional Recurrent Imputation for Time Series. Advances in Neural Information Processing Systems 2018. 참조)을 참조하여 설계될 수 있다. A deep neural network (DNN) is an artificial neural network (ANN) consisting of several hidden layers between an input layer and an output layer. Deep neural networks, like general artificial neural networks, can model complex non-linear relationships. For example, in a deep neural network structure for an object identification model, each object can be expressed as a hierarchical composition of basic image elements. At this time, additional layers can gradually integrate the characteristics of the lower layers. This characteristic of deep neural networks allows complex data to be modeled with fewer nodes compared to similarly performed artificial neural networks. According to one embodiment, prevalence or mortality rates for various diseases can be predicted by analyzing electronic health record data using a deep neural network. According to one embodiment, the deep neural network may be the known long short-term memory (LSTM) and bidirectional recurrent imputation for time series (BRITS). At this time, LSTM can be designed by referring to the paper (Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation 1997; 9(8):1735-1780.), and BRITS can be designed by referring to the paper (Cao W, Wang D , Li J, Zhou H, Li L, Li Y. BRITS: Bidirectional Recurrent Imputation for Time Series. Advances in Neural Information Processing Systems 2018.).

위에서 정의한 용어 이외에 설명이 필요한 용어는 아래에서 각각 따로 설명하기로 한다.Terms that require explanation other than those defined above will be explained separately below.

도 1은 일 실시예에 따른 전자 건강 기록 데이터 분석 장치의 기능 블록도, 도 2 내지 도 3은 일 실시예에 따른 전자 건강 기록 데이터 분석 장치를 설명하기 위한 예시도이다. FIG. 1 is a functional block diagram of an electronic health record data analysis device according to an embodiment, and FIGS. 2 and 3 are exemplary diagrams for explaining an electronic health record data analysis device according to an embodiment.

본 실시예에 따른 전자 건강 기록 데이터 분석 장치(100)는 사용자와 인터랙션할 수 있는 애플리케이션이 설치된 전자단말기로 구현되거나 서버로 구현되거나 또는 서버-클라이언트 시스템으로 구현될 수 있으며, 서버-클라이언트 시스템으로 구현되는 경우 사용자와의 인터랙션을 위한 온라인 서비스용 애플리케이션이 설치된 전자단말기를 포함할 수 있다. The electronic health record data analysis device 100 according to this embodiment may be implemented as an electronic terminal with an application that can interact with the user installed, as a server, or as a server-client system, and may be implemented as a server-client system. If available, it may include an electronic terminal with an online service application installed for interaction with the user.

이때 전자단말기는 네트워크를 통해 원격지의 서버에 접속하거나, 타 디바이스 및 서버와 연결 가능한 컴퓨터나 휴대용 단말기, 텔레비전, 웨어러블 디바이스(Wearable Device) 등으로 구현될 수 있다. 여기서, 컴퓨터는 예를 들어, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(desktop), 랩톱(laptop)등을 포함하고, 휴대용 단말기는 예를 들어, 휴대성과 이동성이 보장되는 무선 통신 장치로서, PCS(Personal Communication System), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), GSM(Global System for Mobile communications), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet), 스마트폰(Smart Phone), 모바일 WiMAX(Mobile Worldwide Interoperability for Microwave Access) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치를 포함할 수 있다. 또한, 텔레비전은 IPTV(Internet Protocol Television), 인터넷 TV(Internet Television), 지상파 TV, 케이블 TV 등을 포함할 수 있다. 나아가 웨어러블 디바이스는 예를 들어, 시계, 안경, 액세서리, 의복, 신발 등 인체에 직접 착용 가능한 타입의 정보처리장치로서, 직접 또는 다른 정보처리장치를 통해 네트워크를 경유하여 원격지의 서버에 접속하거나 타 디바이스와 연결될 수 있다.At this time, the electronic terminal can be implemented as a computer, portable terminal, television, wearable device, etc. that can connect to a remote server through a network or connect to other devices and servers. Here, the computer includes, for example, a laptop, desktop, laptop, etc. equipped with a web browser, and the portable terminal is, for example, a wireless communication device that guarantees portability and mobility. , PCS (Personal Communication System), PDC (Personal Digital Cellular), PHS (Personal Handyphone System), PDA (Personal Digital Assistant), GSM (Global System for Mobile communications), IMT (International Mobile Telecommunication)-2000, CDMA (Code) All types of handhelds such as Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet), Smart Phone, Mobile WiMAX (Mobile Worldwide Interoperability for Microwave Access), etc. (Handheld)-based wireless communication device may be included. Additionally, television may include IPTV (Internet Protocol Television), Internet TV (Internet Television), terrestrial TV, cable TV, etc. Furthermore, a wearable device is a type of information processing device that can be worn directly on the human body, such as a watch, glasses, accessories, clothing, or shoes, and can connect to a remote server or connect to another device via a network directly or through another information processing device. can be connected with

그리고 서버는 전자 건강 기록 데이터 분석 장치(100)의 사용자와의 인터랙션을 위한 애플리케이션이나 웹브라우저가 설치된 전자단말기와 네트워크를 통해 통신이 가능한 컴퓨터로 구현되거나 클라우드 컴퓨팅 서버로 구현될 수 있다. 또한, 서버는 데이터를 저장할 수 있는 저장장치를 포함하거나 제 3의 서버를 통해 데이터를 저장할 수 있다.In addition, the server may be implemented as a computer capable of communicating via a network with an electronic terminal installed with an application or web browser for interaction with the user of the electronic health record data analysis device 100, or as a cloud computing server. Additionally, the server may include a storage device capable of storing data or may store data through a third party server.

상술된 바와 같이 전자 건강 기록 데이터 분석 장치(100)는 전자단말기, 서버 또는 서버-클라이언트 시스템 중 어느 하나의 형태로 구현될 수 있으며, 서버로 구현될 경우, 전자 건강 기록 데이터 분석 장치(100)를 구성하는 구성부는 물리적으로 분리된 복수의 서버에서 수행되거나 하나의 서버에서 수행될 수 있다.As described above, the electronic health record data analysis device 100 may be implemented in the form of any one of an electronic terminal, a server, or a server-client system. When implemented as a server, the electronic health record data analysis device 100 The constituting components may be performed on a plurality of physically separated servers or may be performed on one server.

도 1을 참조하면, 본 실시예에 따른 전자 건강 기록 데이터 분석 장치(100)는 입출력부(110), 통신부(120), 메모리(130) 및 제어부(140)를 포함한다. Referring to FIG. 1, the electronic health record data analysis device 100 according to this embodiment includes an input/output unit 110, a communication unit 120, a memory 130, and a control unit 140.

도 2를 참조하면, 전자 건강 기록 데이터 분석 장치(100)는, 전자 건강 기록 데이터를 미리 설정된 알고리즘(A)에 입력하여, 출력된 값을 미리 학습된 인공지능 모델(B)에 입력함으로써 전자 건강 기록 데이터에 대한 분석을 수행할 수 있다. Referring to FIG. 2, the electronic health record data analysis device 100 inputs electronic health record data into a preset algorithm (A) and inputs the output value into a pre-trained artificial intelligence model (B) to analyze electronic health. Analysis can be performed on historical data.

이때, 미리 설정된 알고리즘(A)은 MADE(Masked Attention-based timestamp-wise Data Embedding) 알고리즘일 수 있다. 일 실시예에 따르면, 전자 건강 기록 데이터 분석 장치(100)는 MADE 알고리즘을 이용해 전자 건강 기록 데이터를 타임스탬프에 기초하여 형성할 수 있으며, 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성할 수 있다. 한편, 전자 건강 기록 데이터 분석 장치(100)는, 다변량 시계열 데이터(multivariate time series data)를 토큰별 시계열 벡터(token-wise vector) 형식으로 변환하여 임베딩 벡터(embedding vector)를 생성할 수 있다. 여기서, 기능 정보는, 전자 건강 기록 데이터에 포함된 환자의 건강 기록 정보일 수 있다. 예를 들어, 맥박(pulse), 혈압(blood pressure), 헤모글로빈(hemoglobin) 수치 등과 같은 환자의 건강 기록 정보일 수 있다. 한편, 생성된 임베딩 벡터는 고정 길이 차원 벡터(fixed-length dimension vector)일 수 있다. At this time, the preset algorithm (A) may be the Masked Attention-based timestamp-wise Data Embedding (MADE) algorithm. According to one embodiment, the electronic health record data analysis device 100 may form electronic health record data based on timestamps using the MADE algorithm, and may analyze the electronic health record data formed based on the timestamps for each timestamp. Optionally, feature information can be aggregated to generate an embedding vector. Meanwhile, the electronic health record data analysis device 100 may generate an embedding vector by converting multivariate time series data into a token-wise vector format. Here, the functional information may be the patient's health record information included in the electronic health record data. For example, it may be the patient's health record information such as pulse, blood pressure, hemoglobin level, etc. Meanwhile, the generated embedding vector may be a fixed-length dimension vector.

이후, 전자 건강 기록 데이터 분석 장치(100)는, 생성된 임베딩 벡터를 미리 학습된 인공지능 모델(B)에 입력하여 전자 건강 기록 데이터에 대한 분석을 수행할 수 있으며, 분석된 내용은 전자 건강 기록 데이터에 포함된 각종 질병에 대한 유병률(prevalence) 또는 사망률에 대한 수치를 예측한 정보일 수 있다. 상술된 바에 따르면, 전자 건강 기록 데이터의 결측값을 대체하지 않고 임베딩 벡터를 생성함에 따라, 이후, 전자 건강 기록 데이터의 분석 중에 부적절한 편향(bias)이 발생하는 것을 막을 수 있다.Thereafter, the electronic health record data analysis device 100 may perform analysis on the electronic health record data by inputting the generated embedding vector into the pre-trained artificial intelligence model (B), and the analyzed content may be stored in the electronic health record. This may be information that predicts prevalence or mortality rates for various diseases included in the data. As described above, by generating an embedding vector without replacing missing values of electronic health record data, it is possible to prevent inappropriate bias from occurring during subsequent analysis of electronic health record data.

한편, MADE(Masked Attention-based timestamp-wise Data Embedding) 알고리즘에 대해 보다 구체적으로 설명하면 다음과 같다. Meanwhile, the MADE (Masked Attention-based timestamp-wise Data Embedding) algorithm is described in more detail as follows.

■ MADE(Masked Attention-based timestamp-wise Data Embedding) 알고리즘■ MADE (Masked Attention-based timestamp-wise Data Embedding) algorithm

MADE는 트랜스포머 신경망 아키텍처를 기반으로 하며, 알고리즘의 핵심 메커니즘은 트랜스포머의 어텐션 메커니즘이다. 기본적으로, MADE는 환자의 상태를 시간에 기초하여 직렬로 나타낸 다변량 시계열 데이터로 형성된 전자 건강 기록 데이터를, 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성하고, 생성된 임베딩 벡터를 미리 학습된 인공지능 모델에 입력값으로 제공한다.MADE is based on the transformer neural network architecture, and the core mechanism of the algorithm is the attention mechanism of the transformer. Basically, MADE generates an embedding vector by selectively aggregating functional information for each timestamp of electronic health record data, which is formed as multivariate time series data that represents the patient's condition serially based on time, and stores the generated embedding vector in advance. It is provided as input to the learned artificial intelligence model.

트랜스포머 계층(Transformer Layer)은 입력 데이터를 토큰별 시계열 벡터 형식으로 사용하기 때문에, 다변량 시계열 데이터를 토큰별 시계열 벡터 형식으로 변환해야 한다. 예를 들어, 입력 특징값 (여기서, 는 특징 번호 를 나타내고, 는 타임 스탬프 번호 를 나타냄)가 MADE 알고리즘에 적용되면, 그것은 번째 완전 연결 계층 을 통과하고 d 차원 벡터 로 변환됩니다. 그런 다음 MADE 알고리즘 은 로 표시된 단일 타임스탬프 의 전체 기능(feature) 토큰 벡터를 입력 데이터로 사용합니다. 한편, 트랜스포머 계층은 스케일 내적 알고리즘(Scaled Dot-Product Algorithm) 또는 QKV 주의 알고리즘(QKV Attention Algorithm)을 사용하여 토큰 벡터 를 계산한다. 이때, 계산 방정식은 일 수 있다. 여기서, 행렬 , 와 은 질의, 키 및 값을 각각 참조할 수 있으며, 차원에 대한 숫자는 키 행렬 의 로 표시될 수 있다. 한편, 행렬 , 와 는, 동일한 입력 토큰 벡터 로부터 계산될 수 있다. MADE 알고리즘에서, QKV 주의 행렬은 토큰 벡터 와 그 마스크 행렬 을 사용하여 계산되며, 이 마스크 행렬은 시간 에 특징 관측값의 유효성에 대한 인덱스를 보여줄 수 있다. MADE 알고리즘의 전달 패스(forwarding pass) 전체에 걸쳐서, 입력 토큰 벡터 는 해당 정보를 시간 에서 상태 토큰 벡터 로 집계됨에 따라 임베딩 벡터로 생성될 수 있다. 이때, 생성된 임베딩 벡터 즉, 상태 토큰 벡터 는, MADE 알고리즘에서 결과값으로 추출되어, 미리 학습된 인공 지능 모델인 심층 신경망(DNN) 알고리즘에 입력 벡터(Embedded Token)로 제공될 수 있다. 한편, 상술된 바에 따르면, 타임스탬프 데이터(입력 특징값) 의 한 단위만 사용하여 모든 상태 토큰 벡터 를 계산하기 때문에, 위치 인코딩 토큰 또는 기타 정보 토큰(예컨대, 유형 임베딩 토큰 등)은 필요하지 않다. 한편, 상술된 MADE 알고리즘은 후술하는 미리 학습된 알고리즘인 LSTM 및 BRITS의 손실 함수에 의해 최적화됨에 따라, 결측된 특징값을 무시하고, 주어진 작업에 대한 최적의 가중치로 측정된 정보를 집계할 수 있다.Since the Transformer Layer uses input data in the format of a token-specific time series vector, the multivariate time series data must be converted into the token-specific time series vector format. For example, input features (here, is the feature number represents, is the timestamp number ) is applied to the MADE algorithm, it second fully connected layer and pass the d-dimensional vector It is converted to . Then the MADE algorithm silver A single timestamp indicated by The entire feature token vector of is used as input data. Meanwhile, the Transformer layer uses the Scaled Dot-Product Algorithm or the QKV Attention Algorithm to determine the token vector Calculate . At this time, the calculation equation is It can be. Here, the matrix , and can refer to the query, key and value respectively, and the numbers for the dimensions are the key matrix of It can be displayed as . Meanwhile, the procession , and , the same input token vector It can be calculated from In the MADE algorithm, the QKV attention matrix is the token vector and the mask matrix is calculated using , and this mask matrix is It is possible to show an index of the validity of feature observations. Throughout the forwarding pass of the MADE algorithm, the input token vector time that information state token vector As it is aggregated, it can be created as an embedding vector. At this time, the generated embedding vector, that is, the state token vector can be extracted as a result from the MADE algorithm and provided as an input vector (Embedded Token) to a deep neural network (DNN) algorithm, a pre-trained artificial intelligence model. Meanwhile, as described above, timestamp data (input feature value) All state token vectors using only one unit of Since calculating , no location encoding tokens or other information tokens (e.g., type embedding tokens, etc.) are needed. Meanwhile, the MADE algorithm described above is optimized by the loss function of LSTM and BRITS, which are pre-trained algorithms described later, so that missing feature values can be ignored and the measured information can be aggregated with the optimal weight for the given task. .

입출력부(110)는 사용자로부터 입력을 수신하기 위한 입력부와 작업의 수행결과 또는 전자 건강 기록 데이터 분석 장치(100)의 상태 등의 정보를 표시하기 위한 출력부를 포함할 수 있다. 예를 들어, 입출력부(110)는 사용자 입력을 수신하는 조작 패널(operation panel) 및 화면을 표시하는 디스플레이 패널(display panel) 등을 포함할 수 있다.The input/output unit 110 may include an input unit for receiving input from a user and an output unit for displaying information such as a task performance result or the status of the electronic health record data analysis device 100 . For example, the input/output unit 110 may include an operation panel that receives user input and a display panel that displays a screen.

구체적으로 입력부는 키보드, 물리 버튼, 터치 스크린, 카메라 또는 마이크 등과 같이 다양한 형태의 입력을 수신할 수 있는 장치들을 포함할 수 있다. 또한, 출력부는 디스플레이 패널 또는 스피커 등을 포함할 수 있다. 다만, 이에 한정되지 않고 입출력부(110)는 다양한 입출력을 지원하는 구성을 포함할 수 있다.Specifically, the input unit may include devices that can receive various types of input, such as a keyboard, physical button, touch screen, camera, or microphone. Additionally, the output unit may include a display panel or a speaker. However, the input/output unit 110 is not limited to this and may include a configuration that supports various inputs and outputs.

통신부(120)는 다른 디바이스(장치) 및/또는 네트워크와 유무선 통신을 수행할 수 있다. 이를 위해, 통신부(120)는 다양한 유무선 통신 방법 중 적어도 하나를 지원하는 통신 모듈을 포함할 수 있다. 예컨대, 통신 모듈은 칩셋(chipset)의 형태로 구현될 수 있다. The communication unit 120 may perform wired or wireless communication with other devices and/or networks. To this end, the communication unit 120 may include a communication module that supports at least one of various wired and wireless communication methods. For example, the communication module may be implemented in the form of a chipset.

한편, 통신부(120)가 지원하는 무선 통신은, 예를 들어 Wi-Fi(Wireless Fidelity), Wi-Fi Direct, 블루투스(Bluetooth), 저전력블루투스(BLE; Bluetooth Low Energy), UWB(Ultra Wide Band), NFC(Near Field Communication), LTE, LTE-Advanced 등의 무선 이동통신 등일 수 있다. 또한, 통신부(120)가 지원하는 유선 통신은, 예를 들어 USB 또는 HDMI(High Definition Multimedia Interface) 등일 수 있다.Meanwhile, wireless communications supported by the communication unit 120 include, for example, Wi-Fi (Wireless Fidelity), Wi-Fi Direct, Bluetooth, Bluetooth Low Energy (BLE), and UWB (Ultra Wide Band). , NFC (Near Field Communication), LTE, LTE-Advanced, etc. may be wireless mobile communications. Additionally, wired communication supported by the communication unit 120 may be, for example, USB or HDMI (High Definition Multimedia Interface).

메모리(130)는 파일, 어플리케이션 및 프로그램 등과 같은 다양한 종류의 데이터를 설치 및 저장할 수 있으며, RAM, HDD 및 SSD 등과 같이 다양한 종류의 메모리 중 적어도 하나를 포함하도록 구성될 수 있다. 후술하는 제어부(140)는 메모리(130)에 저장된 데이터에 접근하여 이를 이용하거나, 또는 새로운 데이터를 메모리(130)에 저장할 수도 있다. 또한, 제어부(140)는 메모리(130)에 설치된 프로그램을 실행할 수도 있다. 한편, 메모리(130)에는, 일 실시예에 따른 전자 건강 기록 데이터 분석 방법을 수행하기 위한 프로그램이 설치될 수 있다. 또한, 메모리(130)에는 환자에 대한 다양한 전자 건강 기록 데이터가 미리 저장될 수 있다.The memory 130 can install and store various types of data, such as files, applications, and programs, and may be configured to include at least one of various types of memory, such as RAM, HDD, and SSD. The control unit 140, which will be described later, may access and use data stored in the memory 130, or may store new data in the memory 130. Additionally, the control unit 140 may execute a program installed in the memory 130. Meanwhile, a program for performing an electronic health record data analysis method according to an embodiment may be installed in the memory 130. Additionally, various electronic health record data about the patient may be stored in advance in the memory 130.

제어부(140)는 CPU, GPU, 아두이노 등과 같은 적어도 하나의 프로세서를 포함하는 구성으로, 전자 건강 기록 데이터 분석 장치(100)의 전체적인 동작을 제어할 수 있다. 즉, 제어부(140)는 전자 건강 기록 데이터에 대한 분석을 수행할 수 있도록 전자 건강 기록 데이터 분석 장치(100)에 포함된 다른 구성들을 제어할 수 있다. 또한, 제어부(140)는 메모리(130)에 저장된 프로그램을 실행하거나, 메모리(130)에 저장된 파일을 읽어오거나 또는 새로운 파일을 메모리(130)에 저장할 수도 있다. The control unit 140 includes at least one processor such as CPU, GPU, Arduino, etc., and can control the overall operation of the electronic health record data analysis device 100. That is, the control unit 140 can control other components included in the electronic health record data analysis device 100 to perform analysis on the electronic health record data. Additionally, the control unit 140 may execute a program stored in the memory 130, read a file stored in the memory 130, or store a new file in the memory 130.

실시예에 따르면, 제어부(140)는 메모리(130)에 저장된 원시 데이터인 전자 건강 기록 데이터에 대한 전처리를 수행할 수 있다. 이때, 원시 데이터로 기록된 전자 건강 기록 데이터는 불규칙하게 샘플링된 시계열 데이터로, 자동화되지 않은 유인 데이터 수집 프로세스로 인해 정기적으로 발생하는 결측값이 많기 때문에 분석을 위해서는 정교한 전처리가 수행되는 것이 바람직하다. 원시 데이터인 전자 건강 기록 데이터의 전처리 수행과 관련한 내용은 다음과 같다.According to an embodiment, the control unit 140 may perform preprocessing on electronic health record data, which is raw data stored in the memory 130. At this time, the electronic health record data recorded as raw data is irregularly sampled time series data, and there are many missing values that occur regularly due to the non-automated manned data collection process, so it is desirable to perform sophisticated preprocessing for analysis. Details related to preprocessing of electronic health record data, which are raw data, are as follows.

제어부(140)는 전자 건강 기록 데이터를 타임스탬프에 기초하여 형성할 수 있으며, 이때 형성되는 전자 기록 데이터는 타임스탬프 별로, 환자의 상태를 시간에 기초하여 직렬로 나타낸 다변량 시계열 데이터로 형성될 수 있다. The control unit 140 may form electronic health record data based on timestamps, and the electronic record data formed at this time may be formed as multivariate time series data that serially represents the patient's condition based on time for each timestamp. .

제어부(140)는 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성할 수 있다. 여기서, 기능 정보는, 전자 건강 기록 데이터에 포함된 환자의 건강 기록 정보일 수 있다. 예를 들어, 맥박(pulse), 혈압(blood pressure), 헤모글로빈(hemoglobin) 수치 등과 같은 환자의 건강 기록 정보일 수 있다. 한편, 생성된 임베딩 벡터는 고정 길이 차원 벡터일 수 있다.The control unit 140 may generate an embedding vector by selectively aggregating function information for each timestamp of electronic health record data formed based on timestamps. Here, the functional information may be the patient's health record information included in the electronic health record data. For example, it may be the patient's health record information such as pulse, blood pressure, hemoglobin level, etc. Meanwhile, the generated embedding vector may be a fixed-length dimensional vector.

보다 구체적으로, 제어부(140)는 다변량 시계열 데이터를 토큰별 시계열 벡터 형식으로 변환하여 임베딩 벡터를 생성할 수 있다. 원시 데이터인 전자 건강 기록 데이터는 타임스탬프 별로, 환자의 상태를 시간에 기초하여 직렬로 나타낸 다변량 시계열 데이터로 형성된다. 일 실시예에 따르면, 도 2에 도시된 바와 같은, 미리 설정된 알고리즘(A)인 MADE(Masked Attention-based timestamp-wise Data Embedding)의 트랜스포머 계층(Transformer Layer)에서는 입력 데이터를 토큰별 벡터 형식으로 사용하기 때문에, 다변량 시계열 데이터를 토큰별 시계열 벡터 형식으로 변환하는 것이 필요하다. 한편, 일 실시예에 따르면, 생성된 임베딩 벡터는 고정 길이의 다차원 임베딩 벡터일 수 있다.More specifically, the control unit 140 may generate an embedding vector by converting multivariate time series data into a time series vector format for each token. Electronic health record data, which is raw data, is formed as multivariate time series data that represents the patient's condition serially based on time by timestamp. According to one embodiment, the Transformer Layer of MADE (Masked Attention-based timestamp-wise Data Embedding), which is a preset algorithm (A), as shown in Figure 2, uses input data in vector format for each token. Therefore, it is necessary to convert multivariate time series data into token-specific time series vector format. Meanwhile, according to one embodiment, the generated embedding vector may be a multidimensional embedding vector of fixed length.

제어부(140)는 생성된 임베딩 벡터를 미리 학습된 인공 지능 모델에 입력하여 전자 건강 기록 데이터에 대한 분석을 수행할 수 있다. 미리 학습된 인공 지능 모델은 심층 신경망(Deep Neural Network) 알고리즘일 수 있다. 이때, 일 실시예에 따르면, 미리 학습된 인공 지능 모델인 심층 신경망(Deep Neural Network) 알고리즘은, LSTM(long short-term memory) 및 BRITS(bidirectional recurrent imputation for time series)일 수 있다. 이때, LSTM은 논문(Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation 1997; 9(8):1735-1780. 참조)을 참조하여 설계될 수 있으며, BRITS는 논문(Cao W, Wang D, Li J, Zhou H, Li L, Li Y. BRITS: Bidirectional Recurrent Imputation for Time Series. Advances in Neural Information Processing Systems 2018. 참조)을 참조하여 설계될 수 있다. 한편, 일 실시예에 따르면, 미리 설정된 알고리즘(A)인 MADE와 결합된 LSTM의 구조는 도 3의 (a)에 도시된 바와 같으며, 미리 설정된 알고리즘(A)인 MADE와 결합된 BRITS의 구조는 도 3의 (b)에 도시된 바와 같을 수 있다. 상술된 바에 따른 LSTM 모델은 조밀하게 연결된 계층(densely connected layer)을 통해 대표 벡터와 정적 특징값을 집계하여, 주어진 데이터 인스턴스의 출력 예측을 수행할 수 있다. 또한, BRITS 모델은, 원시 시계열 데이터(raw time series data)에서 두 개의 추가 정보를 사용하는 GRU-D 모델과 유사하며, GRU 셀이 LSTM 셀로 대체되는 것이 유일한 차이점이다. GRU-D 모델에서 사용되는 두 개의 추가 정보 중 첫번째는, 마스킹 벡터()로 시점()에서 누락된 특징(missing feature)의 인덱스를 나타내고, 두번째는, 마지막 관찰에서 시점()까지의 시간 간격을 나타내는 간격 벡터(interval vector)()일 수 있다. GRU-D는 구간 벡터()를 사용하여 특징값(feature values)을 평균으로 변환한다. 예를 들어, 간격 벡터()는 감쇠율 벡터(decay rate vector)()를 계산하는데 사용되며, 이는 타임스탬프()의 특징값을 시간 간격 벡터()에 비례하여 특징의 전역 평균값(global average value)으로 줄일 수 있다. 벡터에 대한 정확한 방정식은 이며, 여기서, 와 는 각각 가중치 및 편향 파마미터이다. 또한, 는 GRU의 은닉 상태 벡터(hidden state vector)에 적용되는 또 다른 감쇠율 벡터(decay rate vector)이다. 는 와 같은 방식으로 계산되지만, 다른 가중치() 및 편향 파라미터()를 사용한다. 벡터()는 감쇠율 벡터()와 함께 결측 데이터(missing data)의 대체(imputation)를 계산하는데 사용된다. 예를 들어, 타임스탬프()에서 귀속된 입력 벡터(imputed input vector)()는 이다. 이때, 는 입력 벡터의 전체 평균값(global average value)을 나타낸다. 또한, 는 입력 벡터와 함께 GRU-D의 각 타임스탬프에 대한 입력 데이터로 사용된다. The control unit 140 may perform analysis on electronic health record data by inputting the generated embedding vector into a pre-trained artificial intelligence model. The pre-trained artificial intelligence model may be a deep neural network algorithm. At this time, according to one embodiment, the deep neural network algorithm, which is a pre-trained artificial intelligence model, may be long short-term memory (LSTM) and bidirectional recurrent imputation for time series (BRITS). At this time, LSTM can be designed by referring to the paper (Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation 1997; 9(8):1735-1780.), and BRITS can be designed by referring to the paper (Cao W, Wang D , Li J, Zhou H, Li L, Li Y. BRITS: Bidirectional Recurrent Imputation for Time Series. Advances in Neural Information Processing Systems 2018.). Meanwhile, according to one embodiment, the structure of LSTM combined with MADE, a preset algorithm (A), is as shown in (a) of Figure 3, and the structure of BRITS combined with MADE, a preset algorithm (A) may be as shown in (b) of FIG. 3. The LSTM model as described above can perform output prediction of a given data instance by aggregating representative vectors and static feature values through a densely connected layer. Additionally, the BRITS model is similar to the GRU-D model in that it uses two additional pieces of information in the raw time series data, with the only difference being that GRU cells are replaced with LSTM cells. The first of the two additional pieces of information used in the GRU-D model is the masking vector ( ) to the viewpoint ( ) represents the index of the missing feature, and the second is the time point at the last observation ( ) (interval vector) representing the time interval until ( ) can be. GRU-D is an interval vector ( ) to convert feature values to average. For example, the interval vector ( ) is the decay rate vector ( ), which is a timestamp ( ) the feature values of the time interval vector ( ) can be reduced to the global average value of the feature in proportion to. The exact equation for the vector is , where: and are the weight and bias parameters, respectively. also, is another decay rate vector applied to the hidden state vector of the GRU. Is Calculated in the same way, but with different weights ( ) and bias parameters ( ) is used. vector( ) is the attenuation rate vector ( ) is used to calculate the imputation of missing data. For example, a timestamp ( ) imputed input vector ( )Is am. At this time, represents the global average value of the input vector. also, is used as input data for each timestamp of GRU-D along with the input vector.

BRITS 모델은 학습되면, 기본 예측 작업과 함께 다음 타임스탬프 상태의 기능값을 자동 회귀적으로 예측할 수 있다. 이때, 자동 회귀는, 유효한 관찰 데이터에 도달할 때까지 후속 타임스탬프 입력 데이터의 결측된 값을 반복적으로 대체할 수 있다. 이후, 훈련 손실을 계산할 때, 관측된 값과 휘귀된 값 사이의 평균 제곱 오차 손실을 작업별 손실과 함께 계산할 수 있다. Once trained, the BRITS model can autoregressively predict the feature value of the next timestamp state along with the basic prediction task. At this time, autoregression can iteratively replace missing values in subsequent timestamp input data until valid observation data is reached. Afterwards, when calculating the training loss, the mean square error loss between the observed and imputed values can be calculated along with the task-specific loss.

한편, 상술된 LSTM 및 BRITS는 공지된 기술임에 따라 상세한 설명은 생략하기로 한다. Meanwhile, since the above-described LSTM and BRITS are known technologies, detailed descriptions will be omitted.

도 4는 일 실시예에 따른 전자 건강 기록 데이터 분석 방법의 순서도이다.Figure 4 is a flowchart of an electronic health record data analysis method according to one embodiment.

도 4에 도시된 실시예에 따른 전자 건강 기록 데이터 분석 방법은 도 1 내지 도 3에 도시된 전자 건강 기록 데이터 분석 장치(100)에서 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하에서 생략된 내용이라고 하더라도, 도 1 내지 도 3에 도시된 전자 건강 기록 데이터 분석 장치(100)에 관하여 이상에서 기술한 내용은 도 4에 도시된 실시예에 따른 전자 건강 기록 데이터 분석 방법에도 적용될 수 있다. The electronic health record data analysis method according to the embodiment shown in FIG. 4 includes steps processed in time series in the electronic health record data analysis device 100 shown in FIGS. 1 to 3. Therefore, even if the content is omitted below, the content described above regarding the electronic health record data analysis device 100 shown in FIGS. 1 to 3 is the electronic health record data analysis method according to the embodiment shown in FIG. 4. It can also be applied.

도 4에 도시된 바와 같이, 본 실시예에 따른 전자 건강 기록 데이터 분석 장치(100)는 타임스탬프에 기초하여 형성된 전자 건강 기록 데이터를 각 타임스탬프 별로 선택적으로 기능 정보를 집계하여 임베딩 벡터를 생성한다(S410). 한편, 전자 건강 기록 데이터 분석 장치(100)는, 다변량 시계열 데이터를 토큰별 시계열 벡터 형식으로 변환하여 임베딩 벡터를 생성할 수 있다. 여기서, 기능 정보는, 전자 건강 기록 데이터에 포함된 환자의 건강 기록 정보일 수 있다. 예를 들어, 맥박(pulse), 혈압(blood pressure), 헤모글로빈(hemoglobin) 수치 등과 같은 환자의 건강 기록 정보일 수 있다. 한편, 생성된 임베딩 벡터는 고정 길이 차원 벡터일 수 있다.As shown in FIG. 4, the electronic health record data analysis device 100 according to this embodiment generates an embedding vector by selectively aggregating function information for each timestamp of electronic health record data formed based on timestamps. (S410). Meanwhile, the electronic health record data analysis device 100 may generate an embedding vector by converting multivariate time series data into a time series vector format for each token. Here, the functional information may be the patient's health record information included in the electronic health record data. For example, it may be the patient's health record information such as pulse, blood pressure, hemoglobin level, etc. Meanwhile, the generated embedding vector may be a fixed-length dimensional vector.

다음으로, 전자 건강 기록 데이터 분석 장치(100)는 S410 단계에서 생성된 임베딩 벡터를 미리 학습된 인공 지능 모델에 입력함으로써 전자 건강 기록 데이터에 대한 분석을 수행한다(S420). 이때, 분석된 내용은 전자 건강 기록 데이터에 포함된 각종 질병에 대한 유병률(prevalence) 또는 사망률에 대한 수치를 예측한 정보일 수 있다.Next, the electronic health record data analysis device 100 performs analysis on the electronic health record data by inputting the embedding vector generated in step S410 into a pre-trained artificial intelligence model (S420). At this time, the analyzed content may be information predicting prevalence or mortality rates for various diseases included in electronic health record data.

이상의 실시예들에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field programmable gate array) 또는 ASIC 와 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램특허 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다.The term '~unit' used in the above embodiments refers to software or hardware components such as FPGA (field programmable gate array) or ASIC, and the '~unit' performs certain roles. However, '~part' is not limited to software or hardware. The '~ part' may be configured to reside in an addressable storage medium and may be configured to reproduce on one or more processors. Therefore, as an example, '~ part' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables.

구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로부터 분리될 수 있다.The functions provided within the components and 'parts' may be combined into a smaller number of components and 'parts' or may be separated from additional components and 'parts'.

뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU 들을 재생시키도록 구현될 수도 있다.In addition, the components and 'parts' may be implemented to regenerate one or more CPUs within the device or secure multimedia card.

한편, 본 명세서를 통해 설명된 일실시예에 따른 전자 건강 기록 데이터 분석 방법은 컴퓨터에 의해 실행 가능한 명령어 및 데이터를 저장하는, 컴퓨터로 판독 가능한 매체의 형태로도 구현될 수 있다. 이때, 명령어 및 데이터는 프로그램 코드의 형태로 저장될 수 있으며, 프로세서에 의해 실행되었을 때, 소정의 프로그램 모듈을 생성하여 소정의 동작을 수행할 수 있다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터 기록 매체일 수 있는데, 컴퓨터 기록 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 예를 들어, 컴퓨터 기록 매체는 HDD 및 SSD 등과 같은 마그네틱 저장 매체, CD, DVD 및 블루레이 디스크 등과 같은 광학적 기록 매체, 또는 네트워크를 통해 접근 가능한 서버에 포함되는 메모리일 수 있다.Meanwhile, the electronic health record data analysis method according to an embodiment described throughout this specification may also be implemented in the form of a computer-readable medium that stores instructions and data executable by a computer. At this time, instructions and data can be stored in the form of program code, and when executed by a processor, they can generate a certain program module and perform a certain operation. Additionally, computer-readable media can be any available media that can be accessed by a computer and includes both volatile and non-volatile media, removable and non-removable media. Additionally, computer-readable media may be computer recording media, which are volatile and non-volatile implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. It can include both volatile, removable and non-removable media. For example, computer recording media may be magnetic storage media such as HDDs and SSDs, optical recording media such as CDs, DVDs, and Blu-ray discs, or memory included in servers accessible through a network.

또한, 본 명세서를 통해 설명된 일실시예에 따른 전자 건강 기록 데이터 분석 방법은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 컴퓨터 프로그램(또는 컴퓨터 프로그램 제품)으로 구현될 수도 있다. 컴퓨터 프로그램은 프로세서에 의해 처리되는 프로그래밍 가능한 기계 명령어를 포함하고, 고레벨 프로그래밍 언어(High-level Programming Language), 객체 지향 프로그래밍 언어(Object-oriented Programming Language), 어셈블리 언어 또는 기계 언어 등으로 구현될 수 있다. 또한 컴퓨터 프로그램은 유형의 컴퓨터 판독가능 기록매체(예를 들어, 메모리, 하드디스크, 자기/광학 매체 또는 SSD(Solid-State Drive) 등)에 기록될 수 있다. Additionally, the electronic health record data analysis method according to an embodiment described throughout this specification may be implemented as a computer program (or computer program product) including instructions executable by a computer. A computer program includes programmable machine instructions processed by a processor and may be implemented in a high-level programming language, object-oriented programming language, assembly language, or machine language. . Additionally, the computer program may be recorded on a tangible computer-readable recording medium (eg, memory, hard disk, magnetic/optical medium, or solid-state drive (SSD)).

따라서, 본 명세서를 통해 설명된 일실시예에 따른 전자 건강 기록 데이터 분석 방법은 상술한 바와 같은 컴퓨터 프로그램이 컴퓨팅 장치에 의해 실행됨으로써 구현될 수 있다. 컴퓨팅 장치는 프로세서와, 메모리와, 저장 장치와, 메모리 및 고속 확장포트에 접속하고 있는 고속 인터페이스와, 저속 버스와 저장 장치에 접속하고 있는 저속 인터페이스 중 적어도 일부를 포함할 수 있다. 이러한 성분들 각각은 다양한 버스를 이용하여 서로 접속되어 있으며, 공통 마더보드에 탑재되거나 다른 적절한 방식으로 장착될 수 있다. Accordingly, the electronic health record data analysis method according to an embodiment described throughout this specification can be implemented by executing the above-described computer program by a computing device. The computing device may include at least some of a processor, memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to a low-speed bus and a storage device. Each of these components is connected to one another using various buses and may be mounted on a common motherboard or in some other suitable manner.

여기서 프로세서는 컴퓨팅 장치 내에서 명령어를 처리할 수 있는데, 이런 명령어로는, 예컨대 고속 인터페이스에 접속된 디스플레이처럼 외부 입력, 출력 장치상에 GUI(Graphic User Interface)를 제공하기 위한 그래픽 정보를 표시하기 위해 메모리나 저장 장치에 저장된 명령어를 들 수 있다. 다른 실시예로서, 다수의 프로세서 및(또는) 다수의 버스가 적절히 다수의 메모리 및 메모리 형태와 함께 이용될 수 있다. 또한 프로세서는 독립적인 다수의 아날로그 및(또는) 디지털 프로세서를 포함하는 칩들이 이루는 칩셋으로 구현될 수 있다. Here, the processor can process instructions within the computing device, such as displaying graphical information to provide a graphic user interface (GUI) on an external input or output device, such as a display connected to a high-speed interface. These may include instructions stored in memory or a storage device. In other embodiments, multiple processors and/or multiple buses may be utilized along with multiple memories and memory types as appropriate. Additionally, the processor may be implemented as a chipset consisting of chips including multiple independent analog and/or digital processors.

또한, 메모리는 컴퓨팅 장치 내에서 정보를 저장한다. 일례로, 메모리는 휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 다른 예로, 메모리는 비휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 또한 메모리는 예컨대, 자기 혹은 광 디스크와 같이 다른 형태의 컴퓨터 판독 가능한 매체일 수도 있다. Additionally, memory stores information within a computing device. In one example, memory may be comprised of volatile memory units or sets thereof. As another example, memory may consist of non-volatile memory units or sets thereof. The memory may also be another type of computer-readable medium, such as a magnetic or optical disk.

그리고, 저장장치는 컴퓨팅 장치에게 대용량의 저장공간을 제공할 수 있다. 저장 장치는 컴퓨터 판독 가능한 매체이거나 이런 매체를 포함하는 구성일 수 있으며, 예를 들어 SAN(Storage Area Network) 내의 장치들이나 다른 구성도 포함할 수 있고, 플로피 디스크 장치, 하드 디스크 장치, 광 디스크 장치, 혹은 테이프 장치, 플래시 메모리, 그와 유사한 다른 반도체 메모리 장치 혹은 장치 어레이일 수 있다.Additionally, the storage device can provide a large amount of storage space to the computing device. A storage device may be a computer-readable medium or a configuration that includes such media, and may include, for example, devices or other components within a storage area network (SAN), such as a floppy disk device, a hard disk device, an optical disk device, Or it may be a tape device, flash memory, or other similar semiconductor memory device or device array.

상술한 실시예들은 예시를 위한 것이며, 상술한 실시예들이 속하는 기술분야의 통상의 지식을 가진 자는 상술한 실시예들이 갖는 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above-described embodiments are for illustrative purposes, and those of ordinary skill in the technical field to which the above-described embodiments belong will recognize that they can be easily modified into other specific forms without changing the technical idea or essential features of the above-described embodiments. You will understand. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive. For example, each component described as unitary may be implemented in a distributed manner, and similarly, components described as distributed may also be implemented in a combined form.

본 명세서를 통해 보호받고자 하는 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope sought to be protected through this specification is indicated by the patent claims described later rather than the detailed description above, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts are included in the scope of the present invention. It should be interpreted as being

100 : 전자 건강 기록 데이터 분석 장치
110 : 입출력부
120 : 통신부
130 : 메모리
140 : 제어부 100: Electronic health record data analysis device
110: input/output unit
120: Department of Communications
130: memory
140: control unit

Claims

An electronic health record data analysis device for predicting values for various diseases by analyzing electronic health record data using a pre-trained artificial intelligence model,
a memory storing a program for analyzing electronic health record data; and
A control unit that analyzes electronic health record data by executing the program,
The control unit,
By selectively aggregating functional information for each timestamp of electronic health record data formed based on timestamps, an embedding vector is generated without replacing missing values of the electronic health record data, and the generated embedding vector is used as the pre-learned artificial intelligence. Characterized by performing analysis on electronic health record data by inputting it into an intelligence model,
The control unit,
An electronic health record data analysis device that converts the electronic health record data formed of multivariate time series data into a time series vector format for each token, inputs it to a transformer layer of the MADE algorithm, and generates the embedding vector as an output of the transformer layer. .

delete

According to claim 1,
The embedding vector is,
Electronic health record data analysis device, characterized in that it is a fixed-length multidimensional embedding vector.

According to claim 1,
The pre-trained artificial intelligence model is,
An electronic health record data analysis device characterized by a deep neural network algorithm.

In the electronic health record data analysis method, which is performed by an electronic health record data analysis device and predicts values for various diseases by analyzing the electronic health record data using a pre-trained artificial intelligence model,
Generating an embedding vector without replacing missing values of the electronic health record data formed based on timestamps by selectively aggregating function information for each timestamp; and
Comprising: performing analysis on electronic health record data by inputting the generated embedding vector into a pre-trained artificial intelligence model,
The step of generating the embedding vector is,
Electronic health, comprising the step of converting the electronic health record data formed of multivariate time series data into a time series vector format for each token, inputting it to a transformer layer of the MADE algorithm, and generating the embedding vector as an output of the transformer layer. Methods for analyzing historical data.

delete

According to claim 5,
The embedding vector is,
Electronic health record data analysis method, characterized in that it is a fixed-length multidimensional embedding vector.

According to claim 5,
The pre-trained artificial intelligence model is,
An electronic health record data analysis method characterized by a deep neural network algorithm.

A computer-readable recording medium on which a program for performing the method according to claim 5 is recorded.

A computer program stored on a recording medium for performing the method described in claim 5, performed by an electronic health record data analysis device.