CN113380414B

CN113380414B - Data acquisition method and system based on big data

Info

Publication number: CN113380414B
Application number: CN202110552784.9A
Authority: CN
Inventors: 王兴维; 邰从越; 陈攀; 张迁
Original assignee: Senyint International Digital Medical System Dalian Co ltd
Current assignee: Xiaorui Medical Technology Dalian Co ltd
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2023-11-10
Anticipated expiration: 2041-05-20
Also published as: CN113380414A

Abstract

The invention discloses a data collection method and system based on big data, which relates to the technical field of medical data collection; the method includes: obtaining a variety of medical data through a collection and dispatch center, wherein the collection and dispatch center includes a plurality of different collectors , the different collectors obtain unstructured medical data through corresponding collection channels; summarize the unstructured medical data; process the medical data; locally store and/or store the processed medical data Cloud storage. This invention integrates and stores a variety of relevant medical data after collecting them, and provides two dirty data processing methods. During the processing process, dirty data can be accurately filtered, identified, collected, and displayed, and has strong reliability and high security. , and can also handle duplicate data in medical data.

Description

Data collection methods and systems based on big data

技术领域Technical field

本发明涉及医疗数据采集技术领域，具体涉及基于大数据的数据采集方法及系统。The present invention relates to the technical field of medical data collection, and specifically to a data collection method and system based on big data.

背景技术Background technique

现阶段我国医疗数据主要来自于医院信息系统HIS、电子病历系统EMR、影像采集与传输系统PACS、实验室检查信息系统LIS、病理系统PS、医疗器械等信息化系统和设备所记录下来的疾病、体征数据。还包括医院物资管理、医院运营系统所产生的数据。经调查显示，当前已有70％以上的医院实现了医疗信息化，但仅有不到3％的医院数据互通，医疗大数据比较分散，信息孤岛有待攻破。有时同一份病历，两个医生会有不同的解读，故医院之间的信息如果不能互通，对患者来说是很大的损失。信息孤岛同样给需要运用数据和信息的医生、医院管理者带来了极大不便。At this stage, my country's medical data mainly comes from diseases recorded in information systems and equipment such as hospital information system HIS, electronic medical record system EMR, image acquisition and transmission system PACS, laboratory examination information system LIS, pathology system PS, and medical equipment. Physical data. It also includes data generated by hospital material management and hospital operation systems. Surveys show that currently more than 70% of hospitals have implemented medical informatization, but less than 3% of hospitals have data exchanges. Medical big data is relatively scattered, and information islands need to be broken down. Sometimes two doctors will have different interpretations of the same medical record. Therefore, if the information between hospitals cannot be exchanged, it will be a great loss to the patients. Information islands also bring great inconvenience to doctors and hospital managers who need to use data and information.

信息孤岛是我国卫生信息化建设过程中留下的历史问题，由于未出台相关标准，各家医院在建设医疗信息系统时缺乏标准指导，没有顶层设计，条块分割，导致了信息孤岛的产生。因此建立医疗的数据采集中心，是当前提高医疗技术、破除信息孤岛、实现医院之间互联互通的一个重要手段。Information islands are a historical problem left in the process of my country's health informatization construction. Due to the lack of relevant standards, various hospitals lack standard guidance when building medical information systems. There is no top-level design and fragmentation, resulting in the creation of information islands. Therefore, establishing a medical data collection center is an important means to improve medical technology, break up information islands, and realize interconnection between hospitals.

由于医疗数据种类繁多、量大且更新速度快，现有医疗数据采集系统不能很好的处理多种多样的大量数据，且无法保证采集到的数据可靠性，还不能处理重复数据。Due to the wide variety, large volume and fast update speed of medical data, the existing medical data collection system cannot handle a large amount of diverse data well, cannot guarantee the reliability of the collected data, and cannot handle repeated data.

发明内容Contents of the invention

针对现有技术存在上述问题，本发明提出了一种基于大数据的数据采集方法及系统，其能够处理多种多样的大量数据，可靠性强，安全性高，还能处理采集数据中的重复数据。In view of the above problems in the existing technology, the present invention proposes a data collection method and system based on big data, which can process a large amount of diverse data, has strong reliability and high security, and can also handle duplication in the collected data. data.

根据本申请第一方面实施例的一种基于大数据的数据采集方法，包括：A data collection method based on big data according to the first embodiment of the present application includes:

通过采集调度中心获取多种医疗数据，其中所述采集调度中心包括多个不同的采集器，所述不同的采集器在对应的采集渠道获取非结构化医疗数据；Acquire a variety of medical data through a collection and dispatch center, where the collection and dispatch center includes a plurality of different collectors, and the different collectors obtain unstructured medical data through corresponding collection channels;

汇总所述非结构化医疗数据；Aggregate said unstructured medical data;

将所述医疗数据进行处理；Process the medical data;

对所述处理后的医疗数据进行本地存储和/或云端存储。The processed medical data is stored locally and/or in the cloud.

根据本申请的一些实施例，通过多种采集方式获取医疗数据前，还包括：According to some embodiments of this application, before obtaining medical data through multiple collection methods, it also includes:

将yml类型文件对应的服务进行基础配置，各个服务之间通过队列方式进行医疗数据的传递。Perform basic configuration on the services corresponding to the yml type file, and transfer medical data between each service through queues.

根据本申请的一些实施例，将所述医疗数据进行处理，包括：According to some embodiments of the present application, processing the medical data includes:

对医疗数据的质量进行校验；Verify the quality of medical data;

将校验后的医疗数据打标签；Label the verified medical data;

对打标签后的医疗数据创建索引。Create an index for tagged medical data.

根据本申请的一些实施例，对医疗数据的质量进行校验，包括：According to some embodiments of the present application, verifying the quality of medical data includes:

校验医疗数据的准确度；Verify the accuracy of medical data;

通过神经网络对所述医疗数据进行去重处理；De-duplicate the medical data through neural networks;

将去重后的医疗数据加密。Encrypt the deduplicated medical data.

根据本申请的一些实施例，将校验后的医疗数据打标签，包括：According to some embodiments of the present application, labeling the verified medical data includes:

将校验后的医疗数据输入到bert神经网络获取文本向量V；Input the verified medical data into the BERT neural network to obtain the text vector V;

随机选择多个文本向量V作为聚类中心点a；Randomly select multiple text vectors V as the cluster center point a;

获取其他医疗数据与每个聚类中心点a之间距离，将所述其他医疗数据归类为距离最近的文本向量V，分类完成后再得到多类文本向量V的聚类中心点b；Obtain the distance between other medical data and each cluster center point a, classify the other medical data into the nearest text vector V, and obtain the cluster center point b of the multi-category text vector V after the classification is completed;

获取其他医疗数据与每个聚类中心点b之间距离，将所述其他医疗数据归类为距离最近的文本向量V，分类完成后再得到多类文本向量V的聚类中心点c，重复该步骤，获得多个种类文本；Obtain the distance between other medical data and each cluster center point b, classify the other medical data into the nearest text vector V, and then obtain the cluster center point c of the multi-category text vector V after the classification is completed, and repeat In this step, multiple types of text are obtained;

对每个种类所述文本打上中心词的标签；Label the text described in each category with a central word;

新获取的医疗数据根据与中心词的相似度进行分类。The newly acquired medical data is classified according to the similarity with the center word.

将现有医疗数据分为多个类型；Classify existing medical data into multiple types;

通过bert+bilstm+cnn+attention+crf神经网络对所述现有医疗数据进行训练，直至准确率大于阈值；The existing medical data is trained through bert+bilstm+cnn+attention+crf neural network until the accuracy is greater than the threshold;

用训练后的bert+bilstm+cnn+attention+crf神经网络对新获取的医疗数据进行分类，使其归属到相应的类型下。Use the trained bert+bilstm+cnn+attention+crf neural network to classify the newly acquired medical data and assign it to the corresponding type.

根据本申请的一些实施例，对所述处理后的医疗数据进行本地存储，包括：According to some embodiments of the present application, storing the processed medical data locally includes:

获取属性表所在的代理服务和端口；Get the proxy service and port where the attribute table is located;

所述代理服务扫描属性表中每个属性配置的起始行健，判断当前医疗数据在哪个属性范围内后存储在数据库中；The proxy service scans the starting row of each attribute configuration in the attribute table, determines which attribute range the current medical data is in, and then stores it in the database;

所述数据库中存储有属性与代理服务的对应关系。The database stores the corresponding relationship between attributes and proxy services.

根据本申请的一些实施例，对所述数据库进行管理，包括：According to some embodiments of the present application, managing the database includes:

读取所述医疗数据翻译成内部统一数据格式；Read the medical data and translate it into an internal unified data format;

对医疗数据的采集源进行增删改查操作；Perform addition, deletion, modification and check operations on the collection sources of medical data;

从所述数据库中获取查询结果后，对其进行数据格式转换。After obtaining the query results from the database, perform data format conversion on them.

根据本申请第二方面实施例的一种基于大数据的数据采集系统，包括：A data collection system based on big data according to the second embodiment of the present application includes:

采集模块，通过采集调度中心获取多种医疗数据，其中所述采集调度中心包括多个不同的采集器，所述不同的采集器在对应的采集渠道获取非结构化医疗数据；The collection module acquires a variety of medical data through a collection and dispatching center, where the collection and dispatching center includes a plurality of different collectors, and the different collectors obtain unstructured medical data through corresponding collection channels;

汇总模块，汇总所述非结构化医疗数据；A summary module to summarize the unstructured medical data;

处理模块，用于将所述医疗数据进行处理；A processing module, used to process the medical data;

存储模块，用于对所述处理后的医疗数据进行本地存储和/或云端存储。A storage module is used to store the processed medical data locally and/or in the cloud.

根据本申请的一些实施例，所述处理模块包括：According to some embodiments of the present application, the processing module includes:

校验模块，对医疗数据的质量进行校验；Verification module to verify the quality of medical data;

打标签模块，将校验后的医疗数据打标签；The labeling module labels the verified medical data;

创建索引模块，对打标签后的医疗数据创建索引。Create an index module to create an index for the tagged medical data.

通过以上技术方案，获得的技术效果在于：本发明对多种多样的相关医疗数据收集后进行整合存储，提供两种的脏数据处理方式，在处理过程中可以实现脏数据精确过滤、识别、采集、展示，其可靠性强，安全性高，还能处理医疗数据中的重复数据。Through the above technical solution, the technical effect obtained is: the present invention collects and integrates storage of a variety of relevant medical data, provides two dirty data processing methods, and can achieve accurate filtering, identification, and collection of dirty data during the processing process. , showing that it has strong reliability, high security, and can also handle repeated data in medical data.

附图说明Description of the drawings

图1为本申请实施例公开的数据采集计算机的硬件结构框图；Figure 1 is a hardware structural block diagram of a data collection computer disclosed in the embodiment of the present application;

图2为本申请实施例公开的数据采集方法流程图；Figure 2 is a flow chart of the data collection method disclosed in the embodiment of the present application;

图3为本申请实施例公开的将所述医疗数据进行质量校验和处理流程图；Figure 3 is a flow chart of quality verification and processing of the medical data disclosed in the embodiment of the present application;

图4为本申请实施例公开的对医疗数据的质量进行校验流程图；Figure 4 is a flow chart for verifying the quality of medical data disclosed in the embodiment of the present application;

图5为本申请实施例公开的对处理后的医疗数据进行本地存储流程图。Figure 5 is a flow chart of local storage of processed medical data disclosed in the embodiment of the present application.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例性实施例。然而，示例性实施例能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施例使得本发明将更加全面和完整，并将示例性实施例的构思全面地传达给本领域的技术人员。Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concepts of the exemplary embodiments. be communicated to those skilled in the art.

当前三级综合医院医疗质量管理与控制指标框架，包括7大类指标，44类质量评价指标，730个单项指标，2610个复合指标，400多个监测数据，其中指标分类包含住院死亡类指标、重返类指标、医院感染类指标、手术并发症类指标、患者安全类指标、医疗机构合理用药指标与医院运行管理类指标。其管理体系庞大，监测难度也大。且各个业务系统的数据库服务器运行DBMS,是否存在手工数据、手工数据量有多大、是否存在非结构化数据等问题，都是医疗互通的技术壁垒，因此本申请实施例提供基于大数据的数据采集方法及系统。所述数据采集方法可以在服务器、计算机或者类似的运算装置中执行。以运行在计算机上为例，图1为本申请实施例公开的数据采集计算机的硬件结构框图。如图1所示，计算机10可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104，可选地，上述计算机还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解，图1所示的结构仅为示意，其并不对上述计算机的结构造成限定。例如，计算机10还可包括比图1中所示更多或者更少的组件，或者具有与图1所示不同的配置。The current medical quality management and control indicator framework for tertiary general hospitals includes 7 major categories of indicators, 44 categories of quality evaluation indicators, 730 individual indicators, 2610 composite indicators, and more than 400 monitoring data. The indicator categories include in-hospital death indicators, Return indicators, hospital infection indicators, surgical complications indicators, patient safety indicators, rational drug use indicators in medical institutions and hospital operation and management indicators. Its management system is huge and difficult to monitor. In addition, the database servers of each business system run DBMS. Issues such as whether there is manual data, the amount of manual data, and whether there is unstructured data are all technical barriers to medical interoperability. Therefore, the embodiment of this application provides data collection based on big data. Methods and systems. The data collection method can be executed in a server, computer or similar computing device. Taking running on a computer as an example, Figure 1 is a hardware structure block diagram of a data acquisition computer disclosed in an embodiment of the present application. As shown in FIG. 1 , the computer 10 may include one or more (only one is shown in FIG. 1 ) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data. Optionally, the above-mentioned computer may also include a transmission device 106 and an input and output device 108 for communication functions. Persons of ordinary skill in the art can understand that the structure shown in Figure 1 is only illustrative and does not limit the structure of the above-mentioned computer. For example, computer 10 may also include more or fewer components than shown in FIG. 1 , or have a different configuration than shown in FIG. 1 .

存储器104可用于存储计算机程序，例如，应用软件的软件程序以及模块，如本发明实施例中的数据采集方法对应的计算机程序，处理器102通过运行存储在存储器104内的计算机程序，从而执行各种功能应用以及数据处理，即实现上述的方法。存储器104可包括高速随机存储器，还可包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。The memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the data collection method in the embodiment of the present invention. The processor 102 executes various tasks by running the computer program stored in the memory 104. A functional application and data processing, that is, to implement the above method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.

传输设备106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机10的通信供应商提供的无线网络。在一个实例中，传输设备106包括一个网络适配器(Network Interface Controller，简称为NIC)，其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中，传输设备106可以为射频(Radio Frequency，简称为RF)模块，其用于通过无线方式与互联网进行通讯。Transmission device 106 is used to receive or send data via a network. Specific examples of the above-mentioned network may include a wireless network provided by the communication provider of the computer 10 . In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet. In one example, the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.

如图2所示，在一些实施例中，基于大数据的数据采集方法，包括：As shown in Figure 2, in some embodiments, a data collection method based on big data includes:

S1.通过采集调度中心获取多种医疗数据，其中所述采集调度中心包括多个不同的采集器，所述不同的采集器在对应的采集渠道获取非结构化医疗数据；S1. Acquire a variety of medical data through a collection and dispatch center, where the collection and dispatch center includes a plurality of different collectors, and the different collectors obtain unstructured medical data through corresponding collection channels;

具体的，可以利用不同的采集方式在相应的采集渠道获取医疗数据。任务调度中心可以对不同的数据(日志，数据库内数据等)采集任务进行管理。所述采集方式可以为：Specifically, different collection methods can be used to obtain medical data through corresponding collection channels. The task scheduling center can manage different data (logs, database data, etc.) collection tasks. The collection method can be:

(1)网络上有各种各样的开发数据集，只要找到相应的网址获取下载链接，就可以得到医疗领域的数据集，这些数据集可以帮助医疗系统完善内部信息，配置采集器通过爬虫和规则匹配等方法对医疗数据进行爬取和整理。(1) There are various development data sets on the Internet. As long as you find the corresponding URL and get the download link, you can get data sets in the medical field. These data sets can help the medical system improve internal information, configure the collector to use crawlers and Rule matching and other methods are used to crawl and organize medical data.

(2)对于医疗系统的日志服务，可以采用相关的日志收集方案。比较常见的几款日志收集工具有Logstash、Filebeat、Flume、Fluentd、Logagent、rsyslog、syslog-ng。配置采集器利用日志文本信息读取的方式对该类医疗数据进行读取，收集。(2) For the log service of the medical system, relevant log collection solutions can be adopted. Some of the more common log collection tools include Logstash, Filebeat, Flume, Fluentd, Logagent, rsyslog, and syslog-ng. Configure the collector to read and collect this type of medical data by reading log text information.

(3)通过社会调查的方式获取相应的医疗数据，这些医疗数据可以对医疗系统的数据内容进行完善。配置采集器获取社会调查结果的医疗数据。(3) Obtain corresponding medical data through social surveys, which can improve the data content of the medical system. Configure the collector to obtain medical data from social survey results.

(4)医疗系统中会设有日常运营和业务部门模块，其各种相关数据都会记录在某些文件或者系统中，例如常见的医疗系统数据库等。大量的医疗数据存储在数据库中，配置采集器对不同种类的数据库用不同的方式进行获取数据。(4) The medical system will have daily operations and business department modules, and various related data will be recorded in certain files or systems, such as common medical system databases. A large amount of medical data is stored in the database, and the collector is configured to obtain data in different ways for different types of databases.

(5)医疗传感器是一种检测装置，能感受到被测量的信息，并能将感受到的信息，按一定规律变换成为电信号或其他所需形式的信息输出，通过医疗传感器获取到医疗数据后上传，配置采集器对这些医疗数据进行收集。(5) A medical sensor is a detection device that can sense the information being measured, and can convert the sensed information into electrical signals or other required forms of information output according to certain rules, and obtain medical data through the medical sensor. After uploading, configure the collector to collect these medical data.

S2.汇总所述非结构化医疗数据；S2. Aggregate the unstructured medical data;

具体的，获取医疗数据均是通过统一的微服务接口进行收集。即采集器获取数据后通过Restful接口将数据发送到微服务中，然后将数据放在redis的分布式缓存平台中进行临时存储。Specifically, medical data is collected through a unified microservice interface. That is, after the collector obtains the data, it sends the data to the microservice through the Restful interface, and then places the data in the redis distributed cache platform for temporary storage.

对通过不同渠道获取的非结构化医疗数据进行汇总，然后统一交付给数据质量校验来处理。本申请采用了不同采集器的方式，可以处理多种数据。解决了在传统系统中不能很好的处理多种多样的数据问题。Unstructured medical data obtained through different channels are summarized and then uniformly delivered to data quality verification for processing. This application adopts different collector methods and can process a variety of data. It solves the problem that traditional systems cannot handle a variety of data well.

S3.将所述医疗数据进行处理；S3. Process the medical data;

具体的，医疗数据获取后，首先对医疗数据的质量进行校验，其包括校验各种相关医疗信息的准确度，对医疗数据去重、加密等。再进行校验后，对去重后的数据打标签，监控信息来源。Specifically, after the medical data is obtained, the quality of the medical data is first verified, which includes verifying the accuracy of various related medical information, deduplicating and encrypting the medical data, etc. After further verification, the deduplicated data is labeled and the source of the information is monitored.

S4.对所述处理后的医疗数据进行本地存储和/或云端存储。S4. Store the processed medical data locally and/or in the cloud.

具体的，由于本地存储医疗数据存在一定风险，如本地设备损坏导致部分或全部数据消失。故可以将医疗数据备份在云端。即将获取的医疗数据在本地存储的同时再向云端发送一份相同的数据信息进行存储。或者每隔一段时间(例如：半个月、一个月等)将本地存储的医疗数据更新部分进行压缩，然后备份到云端。保证数据的安全性。Specifically, there are certain risks associated with local storage of medical data, such as local device damage leading to the disappearance of some or all of the data. Therefore, medical data can be backed up in the cloud. The medical data to be obtained is stored locally and a copy of the same data information is sent to the cloud for storage. Or every once in a while (for example: half a month, a month, etc.), the updated part of the locally stored medical data is compressed and then backed up to the cloud. Ensure data security.

本发明可以将医院信息系统HIS、电子病历系统EMR、影像采集与传输系统PACS、实验室检查信息系统LIS、病理系统PS和其他医院信息化系统中，分布的、异构数据源的医疗数据抽取到临时中间层后进行清洗、转换、集成，最后加载到数据库中，成为医疗数据联机分析处理、医疗数据挖掘的基础，通过上述方法建立的医疗服务监管信息数据采集平台，其架构安全性高、易扩展，能够支持各类主流开发语言，并提供丰富的接口。同时能够支持结构化和非结构化数据的存储和应用。The invention can extract medical data from distributed and heterogeneous data sources in hospital information system HIS, electronic medical record system EMR, image acquisition and transmission system PACS, laboratory examination information system LIS, pathology system PS and other hospital information systems. After reaching the temporary middle layer, it is cleaned, converted, integrated, and finally loaded into the database, which becomes the basis for online analysis and processing of medical data and medical data mining. The medical service supervision information data collection platform established through the above method has a high architecture security and It is easy to expand, can support various mainstream development languages, and provides rich interfaces. It can also support the storage and application of structured and unstructured data.

在一些实施例中，通过多种采集方式获取医疗数据前，还包括：In some embodiments, before obtaining medical data through multiple collection methods, it also includes:

具体的，数据采集过程中所有配置文件均存储在nacos中，所述nacos中有很多yml类型的文件，分别是各个服务的配置文件，例如：采集器服务，数据质量校验服务、用户中心服务，网关服务，数据管理服务等。首先对yml类型文件进行服务的端口、数据库地址、启动方式等进行基础配置。各个服务在启动时自动从nacos中获取到对应的配置。Specifically, all configuration files during the data collection process are stored in nacos. There are many yml-type files in nacos, which are configuration files for each service, such as: collector service, data quality verification service, and user center service. , gateway services, data management services, etc. First, perform basic configuration on the yml type file such as service port, database address, startup method, etc. Each service automatically obtains the corresponding configuration from nacos when it is started.

需要说明的是，医疗数据在各个服务中的传递都是使用中间件工具，也就是用队列的方式来解决并发问题，降低服务器的压力。It should be noted that the transfer of medical data in various services uses middleware tools, that is, using queues to solve concurrency problems and reduce server pressure.

如图3所示，在一些实施例中，将所述医疗数据进行质量校验和处理，包括：As shown in Figure 3, in some embodiments, the medical data is subjected to quality checksum processing, including:

S31.对医疗数据的质量进行校验；S31. Verify the quality of medical data;

具体的，质量校验包括通过对比原则校验医疗数据准确度、通过神经网络对医疗数据进行去重、对去重后的数据进行TripleDES算法加密。Specifically, quality verification includes verifying the accuracy of medical data through comparison principles, deduplicating medical data through neural networks, and encrypting the deduplicated data with the TripleDES algorithm.

S32.将校验后的医疗数据打标签；S32. Label the verified medical data;

具体的，可以通过两种方式实现，第一种方式是将获取到的医疗数据进行聚类；第二种方式是使用神经网络进行分类。分类完成后将数据的数据源也作为属性存储到该数据的整体信息中。Specifically, it can be achieved in two ways. The first way is to cluster the obtained medical data; the second way is to use neural networks for classification. After the classification is completed, the data source of the data is also stored as an attribute in the overall information of the data.

S33.对打标签后的医疗数据创建索引。S33. Create an index for the tagged medical data.

具体的，将每条医疗数据进行总结获取到该数据的标题。获取标题方式分为两种：一种为直接获取该数据的前10-20个字符作为该数据的标题，另一种方式通过Seq2Seq架构中的编码器和解码器获取该数据的摘要。将所述数据的摘要，产生时间等重要的属性在Elasticsearch中创建索引便于用户快速的查询该数据。Specifically, each piece of medical data is summarized to obtain the title of the data. There are two ways to obtain the title: one is to directly obtain the first 10-20 characters of the data as the title of the data, and the other is to obtain the summary of the data through the encoder and decoder in the Seq2Seq architecture. Create an index in Elasticsearch for the summary of the data, generation time and other important attributes to facilitate users to quickly query the data.

如图4所示，在一些实施例中，对医疗数据的质量进行校验，包括：As shown in Figure 4, in some embodiments, checking the quality of medical data includes:

S311.校验医疗数据的准确度；S311. Verify the accuracy of medical data;

具体的，医疗数据准确度校验可以通过多种方式实现，例如第一种是将中间件发送来的医疗数据进行MD5编码和数据携带的MD5码进行对比，如果相同则说明传输的数据不存在问题；第二种是通过该数据的多个数据源进行相似度对比，差异较大，则该数据存在问题；第三种是确定数据传输前后是否存在较大误差，如果同一个指标的平均值出现了巨大差异，而又不符合逻辑时，则说明传输过程出了问题，获取的数据不准确。Specifically, medical data accuracy verification can be implemented in a variety of ways. For example, the first method is to compare the MD5 encoding of the medical data sent by the middleware with the MD5 code carried by the data. If they are the same, it means that the transmitted data does not exist. problem; the second is to compare the similarity through multiple data sources of the data. If the difference is large, there is a problem with the data; the third is to determine whether there is a large error before and after data transmission. If the average value of the same indicator When there is a huge difference and it is not logical, it means that there is something wrong with the transmission process and the data obtained is inaccurate.

S312.通过神经网络对所述医疗数据进行去重处理；S312. Perform deduplication processing on the medical data through neural network;

具体的，使用神经网络对医疗数据文本进行去重，将获取到的经过检验的完整数据和现有系统中相对应的模块信息对比。例如将获取某人现有的病情信息，和现有系统中该人的病情信息进行对比。判断方式可以用神经网络中的bert获取相应的句子向量然后进行相似度的计算。相似度大于90％将文本定义为基本相同，相似度大于80％则定义为大致相同，相似度低于50％，则定义为不同。过滤掉相同的医疗数据，保存不同的医疗数据。Specifically, a neural network is used to deduplicate medical data text, and the obtained verified complete data is compared with the corresponding module information in the existing system. For example, a person's current condition information is obtained and compared with the person's condition information in the existing system. The judgment method can be to use bert in the neural network to obtain the corresponding sentence vector and then calculate the similarity. A similarity greater than 90% defines the texts as substantially the same, a similarity greater than 80% as roughly the same, and a similarity less than 50% as different. Filter out the same medical data and save different medical data.

S313.将去重后的医疗数据加密。S313. Encrypt the deduplicated medical data.

具体的，TripleDES算法可以把64位的明文输入块变为数据长度为64位的密文输出块，其中8位为奇偶校验位，另外56位作为密码的长度。Specifically, the TripleDES algorithm can convert a 64-bit plaintext input block into a ciphertext output block with a data length of 64 bits, of which 8 bits are parity bits and the other 56 bits are used as the length of the password.

在一些实施例中，将校验后的医疗数据打标签，包括：In some embodiments, labeling the verified medical data includes:

随机选择多个(例如10个)文本向量V作为聚类中心点a；Randomly select multiple (for example, 10) text vectors V as the cluster center point a;

获取其他医疗数据与每个聚类中心点a之间距离(通过相似度计算来判断其他医疗数据和文本之间含义的距离)，将所述其他医疗数据归类为距离最近的文本向量V，分类完成后再得到多类(例如10类)文本向量V的聚类中心点b；Obtain the distance between other medical data and each cluster center point a (judging the distance between other medical data and text through similarity calculation), classify the other medical data into the nearest text vector V, After the classification is completed, the cluster center point b of the text vector V of multiple categories (for example, 10 categories) is obtained;

获取其他医疗数据与每个聚类中心点b之间距离(通过相似度计算来判断其他医疗数据和文本之间含义的距离)，将所述其他医疗数据归类为距离最近的文本向量V，分类完成后再得到多类(例如10类)文本向量V的聚类中心点c，重复该步骤N次，获得多个(例如10个)种类文本进行存储；Obtain the distance between other medical data and each cluster center point b (judging the distance between other medical data and text through similarity calculation), classify the other medical data into the nearest text vector V, After the classification is completed, the cluster center point c of the text vector V of multiple categories (for example, 10 categories) is obtained. Repeat this step N times to obtain multiple (for example, 10) categories of text for storage;

具体的，所述多个类型可以为10个类型，该数目根据现有医疗数据量确定。Specifically, the plurality of types may be 10 types, and the number is determined based on the amount of existing medical data.

具体的，使用训练的设备显卡内存应大于10G，训练出准确率大于90％的效果。Specifically, the graphics card memory of the device used for training should be greater than 10G, and the training accuracy rate should be greater than 90%.

需要说明的是：本实施例对多种脏数据提供了2种或多种的处理方式，在某种数据的处理方式失败情况下，可以自动识别返回处理失败的信号0，立刻启动另外处理方式对数据进行处理，保证数据处理的稳定性。例如：对医疗数据进行规则分类失败时，则会立刻用神经网络模型对医疗数据进行分类获取准确的分类结果。It should be noted that this embodiment provides two or more processing methods for a variety of dirty data. When a certain data processing method fails, it can automatically identify and return the processing failure signal 0, and immediately start another processing method. Process the data to ensure the stability of data processing. For example: when the rule classification of medical data fails, the neural network model will be used to classify the medical data immediately to obtain accurate classification results.

如图5所示，在一些实施例中，对处理后的医疗数据进行本地存储，包括：As shown in Figure 5, in some embodiments, the processed medical data is stored locally, including:

S41.获取属性表所在的代理服务和端口；S41. Obtain the proxy service and port where the attribute table is located;

具体的，获取包含完整属性的医疗数据后。通过客户端连接zookeeper,从zookeeper的节点找到属性表所在的代理服务和端口。Specifically, after obtaining medical data containing complete attributes. Connect to zookeeper through the client, and find the proxy service and port where the attribute table is located from the zookeeper node.

S42.所述代理服务扫描属性表中每个属性配置的起始行健，判断当前医疗数据在哪个属性范围内后存储在数据库中；S42. The proxy service scans the starting row of each attribute configuration in the attribute table, determines which attribute range the current medical data is in, and then stores it in the database;

S43.所述数据库中存储有属性与代理服务的对应关系。S43. The database stores the corresponding relationship between attributes and proxy services.

具体的，客户端直接请求对应的代理服务；代理服务接收到客户端发来的请求之后，将医疗数据写入到属性中。Specifically, the client directly requests the corresponding proxy service; after receiving the request from the client, the proxy service writes the medical data into the attributes.

在一些实施例中，对所述数据库进行管理，包括：In some embodiments, managing the database includes:

具体的，该步骤使数据库中的资源可以充分得到管理，并且能实现关于数据的一种控制；Specifically, this step enables the resources in the database to be fully managed and enables control over the data;

具体的，根据信源状态、正则状态等，实时监控网站；对于关键词搜索采集，便于实时增加/删除、启动/关闭采集；根据采集的实际情况，实时调整采集策略。如增加/删减采集器等；Specifically, the website is monitored in real time based on the source status, regular status, etc.; for keyword search collection, it is convenient to add/delete, start/stop collection in real time; and adjust the collection strategy in real time based on the actual collection situation. Such as adding/deleting collectors, etc.;

具体的，将用户的数据请求(高级指令)转换成复杂的机器代码(低层指令)，实现对数据库的查询操作并获取查询结果；对查询结果进行处理(格式转换)返回给用户。Specifically, the user's data request (high-level instructions) is converted into complex machine code (low-level instructions) to implement the query operation on the database and obtain the query results; the query results are processed (format conversion) and returned to the user.

本实施例还公开了一种基于大数据的数据采集系统，包括：This embodiment also discloses a data collection system based on big data, including:

采集模块，汇总所述非结构化医疗数据；An acquisition module that summarizes the unstructured medical data;

处理模块，将所述医疗数据进行处理；A processing module to process the medical data;

存储模块，对所述处理后的医疗数据进行本地存储和/或云端存储。A storage module is used to store the processed medical data locally and/or in the cloud.

系统完成单个医疗数据的存储作业，称之为Job，其接收到一个Job之后，将启动一个进程来完成整个存储过程。系统Job模块是单个作业的中枢管理节点，承担了数据清理、子任务切分(将单一作业计算转化为多个子Task)、TaskGroup管理等功能。系统Job启动后，会根据不同的源端切分策略，将Job切分成多个小的Task(子任务)，以便于并发执行。Task便是系统作业的最小单元，每一个Task都会负责一部分数据的存储工作。切分多个Task之后，系统Job会调用Scheduler模块，根据配置的并发数据量，将拆分成的Task重新组合，组装成TaskGroup(任务组)。每一个TaskGroup负责以一定的并发运行完毕分配好的所有Task，默认单个任务组的并发数量可以为10。每一个Task都由TaskGroup负责启动，Task启动后，会固定启动Reader—>Channel—>Writer的线程来完成数据存储工作。系统作业运行起来之后，Job监控并等待多个TaskGroup模块任务完成，等待所有TaskGroup任务完成后Job成功退出。The system completes the storage operation of a single medical data, called a Job. After receiving a Job, it will start a process to complete the entire storage process. The system Job module is the central management node of a single job, responsible for functions such as data cleaning, sub-task segmentation (converting single job calculations into multiple sub-tasks), and TaskGroup management. After the system job is started, the job will be divided into multiple small tasks (subtasks) according to different source-end segmentation strategies to facilitate concurrent execution. Task is the smallest unit of system operations, and each Task is responsible for storing part of the data. After splitting multiple tasks, the system job will call the Scheduler module to recombine the split tasks into a TaskGroup according to the configured concurrent data volume. Each TaskGroup is responsible for running all assigned tasks with a certain concurrency. The default number of concurrencies for a single task group can be 10. Each Task is started by TaskGroup. After the Task is started, the Reader->Channel->Writer thread will be started to complete the data storage work. After the system job is running, the Job monitors and waits for the completion of multiple TaskGroup module tasks, and waits for the completion of all TaskGroup tasks before the Job successfully exits.

在一些实施例中，所述一种基于大数据的数据采集系统还包括：In some embodiments, the data collection system based on big data also includes:

配置模块，用于将yml类型文件对应的服务进行基础配置，各个服务之间通过队列方式进行医疗数据的传递。The configuration module is used to perform basic configuration of the services corresponding to the yml type file, and transfer medical data between each service through queues.

在一些实施例中，所述处理模块包括：In some embodiments, the processing module includes:

在一些实施例中，所述校验模块包括：In some embodiments, the verification module includes:

准确度校验模块，校验医疗数据的准确度；Accuracy verification module to verify the accuracy of medical data;

去重模块，通过神经网络对所述医疗数据进行去重处理；A deduplication module that performs deduplication processing on the medical data through a neural network;

加密模块，将去重后的医疗数据加密。The encryption module encrypts the deduplicated medical data.

在一些实施例中，所述打标签模块具体实现方式包括：In some embodiments, the specific implementation of the labeling module includes:

在一些实施例中，所述存储模块具体实现方式包括：In some embodiments, the specific implementation of the storage module includes:

由于该基于大数据的数据采集系统解决问题的原理与上述数据采集方法类似，因此所述基于大数据的数据采集系的实施可以参见方法的实施，在此不再赘述。Since the problem-solving principle of the data collection system based on big data is similar to the above-mentioned data collection method, the implementation of the data collection system based on big data can be found in the implementation of the method, and will not be described again here.

本发明实施例还提供一种计算机可读存储介质，计算机可读存储介质上存储有计算机程序，计算机程序被处理器运行时执行上述数据采集方法的步骤。Embodiments of the present invention also provide a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is run by a processor, the steps of the above data collection method are executed.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本发明的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本发明的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments, or portions of code that include one or more executable instructions for implementing the specified logical functions or steps of the process. , and the scope of the preferred embodiments of the invention includes additional implementations in which functions may be performed out of the order shown or discussed, including in a substantially simultaneous manner or in the reverse order, depending on the functionality involved, which shall It should be understood by those skilled in the art to which embodiments of the present invention belong.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，就本说明书而言，“计算机可读介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。The logic and/or steps represented in the flowcharts or otherwise described herein, for example, may be considered a sequenced list of executable instructions for implementing the logical functions, and may be embodied in any computer-readable medium, For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wires (electronic device), portable computer disk cartridges (magnetic device), random access memory (RAM), Read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, and subsequently edited, interpreted, or otherwise suitable as necessary. process to obtain the program electronically and then store it in computer memory.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if it is implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following technologies known in the art: a logic gate circuit with a logic gate circuit for implementing a logic function on a data signal. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.

此外，在本发明各个实施例中的各功能模块可以集成在一个处理模块中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一起。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读存储介质中。In addition, each functional module in various embodiments of the present invention can be integrated into one processing module, each module can exist physically alone, or two or more modules can be integrated together. The above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

最后应说明的是：以上上述实施例，仅为本发明的具体实施方式，用以说明本发明的技术方案，而非对其限制，本发明的保护范围并不局限于此，尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应上述以权利要求的保护范围为准。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present invention and are used to illustrate the technical solutions of the present invention, rather than to limit them. The protection scope of the present invention is not limited thereto. Although reference is made to the foregoing implementations The present invention has been described in detail in the examples. Those of ordinary skill in the art should understand that any person familiar with the technical field can still modify or modify the technical solutions recorded in the foregoing embodiments within the technical scope disclosed in the present invention. Changes can be easily imagined, or equivalent substitutions can be made to some of the technical features; and these modifications, changes or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included in the present invention. within the scope of protection. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims above.

Claims

1. A data collection method based on big data, characterized by including:

Acquire a variety of medical data through a collection and dispatch center, where the collection and dispatch center includes a plurality of different collectors, and the different collectors obtain unstructured medical data through corresponding collection channels;

Aggregate said unstructured medical data;

Process the medical data;

Store the processed medical data locally and/or in the cloud;

The medical data will be processed, including:

Verify the quality of medical data;

Label the verified medical data;

Create an index for tagged medical data;

Label the verified medical data through method 1 or method 2,

The first method includes:

Input the verified medical data into the BERT neural network to obtain the text vector V;

Randomly select multiple text vectors V as the cluster center point a;

Obtain the distance between other medical data and each cluster center point a, classify the other medical data into the nearest text vector V, and obtain the cluster center point b of the multi-category text vector V after the classification is completed;

Obtain the distance between other medical data and each cluster center point b, classify the other medical data into the nearest text vector V, and then obtain the cluster center point c of the multi-category text vector V after the classification is completed, and repeat In this step, multiple types of text are obtained;

Label the text described in each category with a central word;

The newly acquired medical data is classified according to the similarity with the central word;

The second method includes:

Classify existing medical data into multiple types;

The existing medical data is trained through bert+bilstm+cnn+attention+crf neural network until the accuracy is greater than the threshold;

Use the trained bert+bilstm+cnn+attention+crf neural network to classify the newly acquired medical data and assign it to the corresponding type.

2. A data collection method based on big data according to claim 1, characterized in that before obtaining medical data through multiple collection methods, it also includes:

Perform basic configuration on the services corresponding to the yml type file, and transfer medical data between each service through queues.

3. A data collection method based on big data according to claim 1, characterized in that verifying the quality of medical data includes:

Verify the accuracy of medical data;

De-duplicate the medical data through neural networks;

Encrypt the deduplicated medical data.

4. A data collection method based on big data according to claim 1, characterized in that local storage of the processed medical data includes:

Get the proxy service and port where the attribute table is located;

The proxy service scans the starting row of each attribute configuration in the attribute table, determines which attribute range the current medical data is in, and then stores it in the database;

The database stores the corresponding relationship between attributes and proxy services.

5. A data collection method based on big data according to claim 4, characterized in that managing the database includes:

Read the medical data and translate it into an internal unified data format;

Perform addition, deletion, modification and check operations on the collection sources of medical data;

After obtaining the query results from the database, perform data format conversion on them.

6. A data collection system based on big data, used to implement the data collection method according to any one of claims 1-5, characterized in that it includes:

The collection module acquires a variety of medical data through a collection and dispatching center, where the collection and dispatching center includes a plurality of different collectors, and the different collectors obtain unstructured medical data through corresponding collection channels;

A summary module to summarize the unstructured medical data;

A processing module, used to process the medical data;

A storage module is used to store the processed medical data locally and/or in the cloud.

7. A data collection system based on big data according to claim 6, characterized in that the processing module includes:

Verification module to verify the quality of medical data;

The labeling module labels the verified medical data;

Create an index module to create an index for the tagged medical data.