CN115543933A

CN115543933A - Cloud-edge collaborative medical data management method and platform based on data lake

Info

Publication number: CN115543933A
Application number: CN202211226909.XA
Authority: CN
Inventors: 刘子锋; 高伟; 邱述洪; 岳强; 郑宇浩; 吴诗韵; 李永宏; 洪驹发; 方莹; 覃琳; 胡泽康; 鄞乐炜; 师雯琦; 陈强
Original assignee: Third Affiliated Hospital Sun Yat Sen University; China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: Third Affiliated Hospital Sun Yat Sen University; China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2022-12-30

Abstract

The invention discloses a cloud-edge collaborative medical data management method and platform based on a data lake, wherein the method is applied to a medical data management platform, the platform comprises an edge cloud and a center cloud, the edge cloud is in communication connection with the center cloud, and the method comprises the following steps: acquiring medical data, and sending the medical data to an edge cloud or a center cloud for data storage; the edge cloud and the center cloud perform data management on the medical data; the center cloud trains a data analysis model according to the medical data and issues the trained data analysis model to the edge cloud, and the edge cloud carries out data analysis on the medical data by using the trained data analysis model; and both the edge cloud and the central cloud adopt a data lake technology to store data. The cloud-side cooperative mechanism is constructed to manage the medical data together, so that the computing power and storage requirements are guaranteed, the time delay and safety requirements are met, the data lake technology is used for storing the data, the complicated medical data can be accommodated, and the comprehensiveness and integrity of the medical data are guaranteed.

Description

Cloud-edge collaborative medical data management method and platform based on data lake

Technical Field

The invention relates to the technical field of medical information, in particular to a cloud-edge collaborative medical data management method and platform based on a data lake.

Background

Various medical data generated in the medical system play a decisive role in diagnosis, treatment and scientific research of diseases, and have extremely important data values, so that effective management and standardization of the medical data are urgently needed. At present, with the rapid development of scientific technology, medical institutions gradually complete the conversion to informatization, and a medical data management platform is built by using big data technology, so that data of different data sources are collected, the data are managed and analyzed, and potential data values are mined, so that management and analysis results of various services are provided for managers, the scientization and refinement of hospital management decisions are promoted, and the normal operation of hospital services is assisted.

However, the existing medical data management platform has many problems:

1. data storage technologies related to the existing platform mostly only support storage and management of structured data, and medical data are various and comprise not only structured data but also a large amount of unstructured and semi-structured data, and the existing platform cannot directly store the medical data, so that comprehensiveness and integrity of the medical data cannot be guaranteed;

2. the existing platform generally needs to transmit medical data to a data center cloud platform for storage and processing, and a certain distance exists between the cloud platform and each medical system and user terminal, so that data response efficiency is low and data transmission delay is high during data transmission or access.

Disclosure of Invention

The invention aims to overcome at least one defect (deficiency) of the prior art, and provides a cloud-edge collaborative medical data management method and platform based on a data lake, which are used for solving the problems that the prior art cannot guarantee the comprehensiveness and integrity of medical data, and the data response efficiency is low and the transmission delay is high.

The technical scheme adopted by the invention is that the cloud-edge collaborative medical data management method based on the data lake is applied to a cloud-edge collaborative medical data management platform, the cloud-edge collaborative medical data management platform comprises an edge cloud and a center cloud, the edge cloud and the center cloud are in communication connection, and the method comprises the following steps:

acquiring medical data, and sending the medical data to the edge cloud or the center cloud for data storage;

the edge cloud and the central cloud carry out data governance on the medical data;

the center cloud trains a data analysis model by using the medical data and issues the trained data analysis model to the edge cloud, and the edge cloud performs data analysis on the medical data by using the trained data analysis model;

and the edge cloud and the central cloud both adopt a data lake technology to store the medical data.

In the invention, besides the central cloud, the edge cloud is also arranged, a cloud-edge coordination mechanism is constructed, and the edge cloud and the central cloud are used for managing the medical data together. The edge cloud is a cloud platform which is close to a data source or one side of a user side and executes edge operation, can provide nearby data services to generate faster network service response, and accordingly reduces data transmission time delay, and the central cloud is a central cloud platform which is located in the whole data center and provides cloud computing, and has high computing power and large-capacity data storage space. According to the method and the device, data can be classified in a grading way according to the requirements of data access frequency and access response time to provide hot and cold data layering, data with high access frequency (such as electronic medical record data and examination data of patients in a week) and data with high requirements on data access speed (such as images, pathological data and result data sets in a week) are stored in the edge cloud, the data can be rapidly sunk to a user side for access, and therefore the influence of network delay on data processing and access is reduced. In addition, through the edge cloud mechanism, data reach the gateway through the base station, the gateway has the function of shunting data, and the data that visit private network MEC application system shunts to the private network, and this kind of private network networking mode has ensured that data can not go out of the academy, reduces the attack risk of public network to the private network simultaneously to strengthen the security of medical treatment edge cloud. And the full data is stored to the central cloud for high computing power operation and large-capacity data storage. Therefore, the computing power and the storage requirement of medical data management can be guaranteed through a cloud edge cooperation mechanism, and the time delay requirement and the data safety requirement of data application and presentation can be met.

Meanwhile, in the invention, the edge cloud and the central cloud store the medical data by adopting a data lake technology, the data lake is a centralized storage database and can store the structured data (such as a table in a relational database), the semi-structured data (such as CSV, log, XML and JSON), the unstructured data (such as E-mail, document and PDF) and the binary data (such as graph, audio and video) without carrying out structural processing on the data in advance, so that various types of medical data can be effectively stored, and the comprehensiveness and integrity of the medical data are ensured.

Further, acquiring medical data, and sending the medical data to the edge cloud or the center cloud for data storage, specifically including:

acquiring medical data, when the medical data is a large-capacity file, performing file segmentation on the medical data by using a slicing mechanism to obtain a plurality of slice files, concurrently transmitting the plurality of slice files to the edge cloud or the center cloud for splicing, and performing data storage on the spliced files;

when the medical data are small-capacity files, a merging mechanism is utilized to merge a plurality of medical data to obtain merged files, and the merged files are sent to the edge cloud or the central cloud for data storage.

When data is stored and is a large-capacity file, a slicing mechanism can be utilized to form a plurality of slicing files and carry out task transmission processing in parallel, so that the data transmission quantity can be effectively reduced, the data reading and writing efficiency can be improved, and the efficiency of entering the lake and retrieving the data can be improved; when the data is a small-capacity file, the small file can be merged into a large file by using a merging mechanism, the metadata access and query efficiency is improved by reducing the number of files, the I/O operation times of file reading and writing are reduced, the data processing efficiency can be greatly improved, and the data transmission time is saved.

Further, the edge cloud and the center cloud both adopt a data lake technology to perform data storage on the medical data, and specifically include:

the edge clouds and the center clouds both use Hadoop distributed clusters as bases, and Hive, MPP data warehouses and Hudi data lake clusters are built in a cluster heterogeneous mode to form a lake-storehouse integrated framework, and data storage is conducted on the medical data.

According to the invention, a cluster heterogeneous mode is adopted to provide an elastic distributed storage layer and a distributed storage data warehouse based on a data lake, the data lake has the advantages of a unified storage system, stored raw data, rich calculation models and the like, and the data warehouse has a built-in storage system and rich ETL processes, so that modeling and data management are emphasized, rich data diversity storage layers are effectively built, complicated medical information data are contained, and structured, unstructured and semi-structured data of medical information can be supported.

Further, the data governance of the medical data by the edge cloud and the center cloud specifically comprises:

and the edge cloud and the center cloud establish a unified metadata management system, and the metadata management system provides data consanguinity, data index, data version and data routing functions.

Further, the data governance of the medical data by the edge cloud and the center cloud further comprises:

the edge cloud and the center cloud are integrated with a standard Sql, offline calculation, real-time calculation, MPP analysis and visual calculation engine, and target data positions are obtained through data service support of the metadata management system to perform cross-platform calculation.

the edge cloud and the center cloud are integrated to encapsulate the message publishing system, support the data subscription mode of kafka, and unify the data publishing path.

Further, the data governance of the medical data by the edge cloud and the center cloud further includes:

the edge cloud and the central cloud integrate data security, data quality control, data development, model design, data labels, task arrangement, imaging and visualization modules;

the data quality control module is integrated with kerberos, adopts a bill authentication mode, controls the data authority and safety in a full flow, and performs data quality control on the medical data;

the data development module carries off-line, real-time and AI capabilities and mines the value of the medical data;

and the data label module constructs a label system, provides a data label and carries out data sorting on the medical data.

On the other hand, another technical scheme adopted by the invention is that a cloud-edge collaborative medical data management platform based on a data lake executes the management method, wherein the platform comprises an edge cloud and a center cloud, and the edge cloud is in communication connection with the center cloud;

the edge cloud and the center cloud include:

the data storage layer is used for acquiring medical data and storing the medical data;

the data management layer is used for performing data management on the medical data;

the central cloud further comprises: the model training layer is used for training a data analysis model according to the medical data and issuing the trained data analysis model to the edge cloud;

the edge cloud further comprises: and the model analysis layer is used for receiving the trained data analysis model and analyzing the medical data by utilizing the trained data analysis model.

And the storage layer is used for storing the medical data by adopting a data lake technology.

On the other hand, another technical solution adopted by the present invention is that an electronic device includes a memory and a processor, where the memory stores a computer program, and the processor implements the cloud-edge collaborative medical data management method based on the data lake technology when executing the computer program.

On the other hand, another technical solution adopted by the present invention is a computer storage medium, on which a computer program is stored, wherein the computer program is executed by a processor to implement the cloud-edge collaborative medical data management method based on data lake technology.

Compared with the prior art, the invention has the following beneficial effects:

1. in the invention, a cloud-edge coordination mechanism is constructed, and the edge cloud and the central cloud are used for managing the medical data together, so that the computing power and the storage requirement of the medical data management can be ensured, and the time delay requirement and the data safety requirement of data application and presentation can be met;

2. in the invention, the data lake technology is used for storing the data, so that various medical data can be effectively stored, and the comprehensiveness and integrity of the medical data are guaranteed;

3. in the invention, when data is stored and is a large-capacity file, a plurality of slice files can be formed by using a slicing mechanism and are subjected to task transmission processing, so that the data transmission quantity can be effectively reduced, the data reading and writing efficiency can be improved, and the efficiency of lake entering of the data and the efficiency of data retrieval and reading can be improved; when the data is a small-capacity file, the small file can be merged into a large file by using a merging mechanism, the metadata access and query efficiency is improved by reducing the number of files, the I/O operation times of file reading and writing are reduced, the data processing efficiency can be greatly improved, and the data transmission time is saved.

Drawings

FIG. 1 is a process flow diagram of example 1.

Fig. 2 is a structure diagram of a platform of embodiment 2.

Fig. 3 is a diagram of a data storage layer structure of embodiment 2.

Fig. 4 is a structural view of a data management layer of example 2.

Fig. 5 is a flowchart of the platform application liver cancer medical data management analysis of embodiment 2.

Description of reference numerals: the system comprises a data storage layer 100, a slicing module 110, a merging module 120, a data governance layer 200, a metadata management system 210, a calculation module 220, a message issuing system 230, a data quality control module 240, a data development module 250, a data label module 260, a model training layer 300 and a model analysis layer 400.

Detailed Description

The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For the purpose of better illustrating the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Example 1

As shown in fig. 1, this embodiment provides a cloud-edge collaborative medical data management method based on a data lake, which is applied to a cloud-edge collaborative medical data management platform, where the cloud-edge collaborative medical data management platform includes an edge cloud and a center cloud, and the edge cloud and the center cloud are in communication connection, and the method includes:

s1, acquiring medical data, and sending the medical data to the edge cloud or the central cloud for data storage, wherein the edge cloud and the central cloud both adopt a data lake technology to perform data storage on the medical data;

s2, the edge cloud and the center cloud perform data governance on the medical data;

and S3, the center cloud trains a data analysis model according to the medical data and issues the trained data analysis model to the edge cloud, and the edge cloud carries out data analysis on the medical data by using the trained data analysis model.

In this embodiment, besides the central cloud, an edge cloud is also set, a cloud-edge coordination mechanism is constructed, and the edge cloud and the central cloud are used to manage the medical data together. The edge cloud is a cloud platform which is close to a data source or one side of a user side and executes edge operation, nearby data services can be provided, so that a faster network service response is generated, and therefore data transmission delay is reduced, and the central cloud is a central cloud platform which is located in the whole data center and provides cloud computing, and has high computing power and large-capacity data storage space. In the embodiment, data can be classified in a grading way according to the requirements of data access frequency and access response time to provide hot and cold data layering, data with high access frequency (such as patient electronic medical record data and examination data in a near week) and data with high requirement on data access speed (such as image, pathological data and result data set in a near week) are stored in a marginal cloud, and the data can be quickly sunk to a user side for access, so that the influence of network delay on data processing and access is reduced. In addition, through the edge cloud mechanism, data reach the gateway through the base station, the gateway has the function of data distribution, and data accessing the application system of the special network MEC are distributed to the special network. And the full data is stored to the central cloud for high computational power operation and large-capacity data storage. Therefore, the computing power and the storage requirement of medical data management can be guaranteed through the cloud edge cooperation mechanism, and the time delay requirement and the data safety requirement of data application and presentation can be met.

Meanwhile, in the embodiment, the edge cloud and the center cloud store the medical data by adopting a data lake technology, the data lake is a centralized storage database, and can store structured data (such as tables in a relational database), semi-structured data (such as CSV, logs, XML and JSON), unstructured data (such as e-mail, documents and PDF) and binary data (such as graphics, audio and video) without performing structural processing on the data in advance, so that various types of medical data can be effectively stored, and the comprehensiveness and integrity of the medical data are guaranteed.

Further, step S1 specifically includes:

and when the medical data are small-volume files, a merging mechanism is utilized to merge a plurality of medical data to obtain merged files, and the merged files are sent to the edge cloud or the central cloud for data storage.

Specifically, in this embodiment, the slicing mechanism and the merging mechanism are implemented as follows:

for a single large-capacity file such as pathology and the like, file cutting is carried out by using a checksum blocking algorithm, the file is firstly averagely cut into a plurality of data small blocks, the default length of the cut piece is assumed to be 4MB, and three types of data categories including 32-bit weak checksum, MD5 128-bit strong checksum and data block number are generated after the file is cut. Two check values for each block are then calculated. The first check value uses a 32-bit weak checksum and the second check value uses a MD5 hashed 128-bit strong checksum. Namely, a 4MB file block of a source file is taken to carry out 32-bit weak check calculation, then the file is inquired in a hash table, if the file is found, the result shows that the target file has the potentially same data block, then MD5 hash strong checksums are compared again, if the strong checksums are different, the result shows that the data block has different data blocks, if the weak checksums and the strong checksums are the same, the data block is determined to have the same block in the target, and then the number of the data block of the block in the target file is recorded. If the weak check of the source file is not found in the hash table, the strong check is not calculated any more, which means that different information exists in the data. And then, transmitting the determined difference data blocks in all the source files back to the target file, and performing insertion splicing according to the transmitted data block information and the source data block information to finally finish the synchronous data. And if the target file does not exist, concurrently transmitting all the slice files to the target end for splicing. For example, if a pathological file of liver cancer is 2GB, if a solid state disk IO needs 4 seconds to read and write 500MBps in a conventional manner, 512 files are formed by a slicing mechanism, and 512 distributed tasks are started for processing based on a data lake, which can be shortened to 8ms. Therefore, the data transmission quantity can be effectively reduced and the data reading and writing efficiency can be improved through a file slicing mode, so that the efficiency of data entering a lake and data retrieval is improved.

Aiming at small files such as PACS massive images and the like, a merging mechanism is utilized to carry out a data merging optimization strategy, and firstly, the quantity of metadata is reduced. A plurality of small files are combined into a large file, the access and query efficiency of metadata is improved by reducing the number of the files, the number of I/O operations of file reading and writing is reduced, for example, a slice is defaulted to 4MB, if one file is 256KB, IO reading and writing switching by 16 times can be reduced, the data processing efficiency can be greatly improved, and the data transmission time is saved. The large files are combined and stored on the file system, so that the I/O pressure of the file system can be reduced, and the storage performance is improved. When small files are merged and stored, an index file is generated, data is searched through indexes when being accessed, the index file can be prestored in the Cache, and therefore when the small files are read and written, the small files can be read and written only through one-time I/O. In the improved scheme of small file merging and storing, when a file is newly added, the size of the file is taken as a parameter, the small files are grouped, then the small files in the same group are merged and stored in the same data block, and an idle unit in the block is preferentially selected during storage. The original scheme does not consider the size of the file, and the newly added file is directly merged and stored in the data block in a write-after mode. When the small file merging storage improvement scheme is used for modifying a file, whether the size of the file is changed or not is judged, if so, the file is treated as a new file, and meanwhile, the old file is deleted. The improved scheme of small file merging and storing modifies a storage unit where a file is located into an idle state when the file is deleted, and takes priority when files with the same size are written.

the edge cloud and the central cloud both use a Hadoop distributed cluster as a base, and build a Hive and MPP data warehouse and a Hudi data lake cluster in a cluster heterogeneous mode to form a lake-warehouse integrated framework for data storage of the medical data.

In this embodiment, a cluster heterogeneous manner is adopted to provide an elastic distributed storage layer and a distributed storage data warehouse based on a data lake, the data lake has the advantages of a unified storage system, stored raw data, rich calculation models and the like, the data warehouse has a built-in storage system and rich ETL processes, modeling and data management are emphasized, rich data diversity storage layers are effectively built, complicated medical information data are accommodated, and structured, unstructured and semi-structured data of medical information can be supported.

Specifically, the lake-cabin integrated architecture in this embodiment supports incremental and full intake, supports structured, semi-structured, and unstructured data in content, and stores examination data related to acquisition and diagnosis, such as intermediate and result data of CT, MR, color ultrasound, DR/X-ray, blood detection, and the like.

Further, step S2 specifically includes:

the edge cloud and the center cloud establish a unified metadata management system, and the metadata management system provides data bloodletting, data indexing, data version and data routing functions;

specifically, in this embodiment, the metadata management system provides a data lake access entry, performs data analysis and extraction On a data access request, provides a uniform data service, and routes the request to a corresponding data pool in the data lake, provides an MOR (large On Read) mode when the data is Read, returns data according to whether the query is a snapshot stream or a change stream, provides a COW (Copy On Write) mode when the data is written, and triggers a data bloodborder and a data version to re-register and update the metadata management system when the data is changed; ACID transaction write operations are supported.

Further, step S2 further includes:

the edge cloud and the central cloud integrate standard Sql, offline calculation, real-time calculation, MPP analysis and visual calculation engines, and target data positions are obtained through data service support of the metadata management system to perform cross-platform calculation, so that migration of data sets among platforms is not needed.

Further, step S2 further includes:

the edge cloud and center cloud integrated packaging message publishing system supports a kafka data subscription mode and unifies data publishing paths.

Further, step S2 further includes:

the edge cloud and the center cloud are integrated with a data quality control module, a data development module and a data label module;

the data development module carries off-line, real-time and AI capabilities and excavates the value of medical data;

and the data label module constructs a label system, provides a data label and carries out data arrangement on the medical data.

Specifically, in this embodiment, based on the technology of the data lake, data security, data quality control, data development, model design, data tagging, task arrangement, imaging, and visualization modules are further provided, kerberos is integrated, a bill authentication mode is adopted, data authority and security are controlled through a full process, leakage of sensitive information of patients and medical care is prevented, data quality control is performed on data acquired according to detection information, diagnosis results, periodic detection and follow-up results of the patients, waterfall type error attacks on the data are prevented, instability of the data is solved fundamentally, a unified calculation development module is provided, admission of researchers is reduced, offline, real-time and AI capacity is carried, possibility is provided for mining larger data value, a rich tagging system is provided, tags are extracted on multiple branches such as cancer lesions and cell cancer, numerous data branches are sorted, existing data assets are deposited traditionally, and development and iteration of cancer scientific research are supported more rapidly from innovation.

In addition, in this embodiment, the data analysis model may be an electronic medical record post-structured processing model.

The electronic medical record is not only a massive corpus but also the basis of medical record big data analysis. The electronic medical record document not only contains completely unstructured content described by natural language text, but also contains semi-structured information and the like. In electronic medical records, the subject of the medical record and various diagnosis-related descriptions, physical examination records, ward rounds, medical orders, etc. contained in the medical record can be considered as semi-structured (or unstructured) contents containing rich semantic information.

In this embodiment, terms such as human body parts, disease names, symptoms, examination items, operations, treatments, and the like described by natural language are defined as medical named entities. A BilSTM (bidirectional cyclic neural network) is introduced, the problem of considering context information at the same time is solved, the output of the BilSTM is used as the input of the CRF by combining a CRF (Conditional Random Fields) algorithm, wherein the BilSTM has the function of perception, the CRF can learn the context information, and predict a label sequence with the maximum probability by combining the result of an output layer and the global probability of the label sequence, and the test of entity identification and extraction is carried out. With the accurate identification and extraction of the algorithm, a post-structured template can be further constructed in an electronic medical record structured splitting system: namely, the key or value comprises one or more entities, and then the modifiers are combined to form the electronic medical record, so that the electronic medical record is accurately split.

Meanwhile, the electronic medical record post-structured processing model provided by the embodiment is combined with a cloud-side cooperation mechanism, so that model training can be performed on the central cloud side, and the model structured accuracy is improved. The trained model is issued to the edge cloud through a central cloud scheduling mechanism, when data are acquired to the edge cloud, the trained model can perform data preprocessing on unstructured and semi-structured data of the electronic medical record, and data processing efficiency and a response mechanism are improved.

Example 2

As shown in fig. 2, the present embodiment provides a cloud-edge collaborative medical data management platform based on a data lake, which executes the management method described in embodiment 1, where the platform includes an edge cloud and a center cloud, and the edge cloud and the center cloud are in communication connection;

the edge cloud and the center cloud include:

the data storage layer 100 is used for acquiring medical data and storing the medical data;

the data governance layer 200 is used for carrying out data governance on the medical data;

the central cloud further comprises: the model training layer 300 is used for training a data analysis model according to the medical data and issuing the trained data analysis model to the edge cloud;

the edge cloud further comprises: and the model analysis layer 400 is configured to receive the trained data analysis model and analyze the medical data by using the trained data analysis model.

The data storage layer 100 performs data storage on the medical data by using a data lake technology.

In this embodiment, besides the central cloud, an edge cloud is also set, a cloud-edge coordination mechanism is constructed, and the edge cloud and the central cloud are used to manage the medical data together. The edge cloud is a cloud platform which is close to a data source or one side of a user side and executes edge operation, nearby data services can be provided, so that a faster network service response is generated, and therefore data transmission delay is reduced, and the central cloud is a central cloud platform which is located in the whole data center and provides cloud computing, and has high computing power and large-capacity data storage space. In the embodiment, data can be classified in a grading way according to the requirements of data access frequency and access response time to provide hot and cold data layering, data with high access frequency (such as patient electronic medical record data and examination data in a near week) and data with high requirement on data access speed (such as image, pathological data and result data set in a near week) are stored in a marginal cloud, and the data can be quickly sunk to a user side for access, so that the influence of network delay on data processing and access is reduced. In addition, through the edge cloud mechanism, data reach the gateway through the base station, the gateway has the function of data distribution, and data accessing the application system of the special network MEC are distributed to the special network. And the full data is stored to the central cloud for high computing power operation and large-capacity data storage. Therefore, the computing power and the storage requirement of medical data management can be guaranteed through a cloud edge cooperation mechanism, and the time delay requirement and the data safety requirement of data application and presentation can be met.

Meanwhile, in the embodiment, the edge cloud and the center cloud store the medical data by adopting a data lake technology, the data lake is a centralized storage database, and can store the structured data (such as a table in a relational database), the semi-structured data (such as CSV, logs, XML and JSON), the unstructured data (such as e-mail, documents and PDF) and the binary data (such as graphics, audio and video) as they are, without performing structural processing on the data in advance, so that various types of medical data can be effectively stored, and the comprehensiveness and integrity of the medical data are guaranteed.

Further, as shown in fig. 3, the data storage layer 100 specifically includes:

the slicing module 110 is configured to, when the medical data is a large-capacity file, perform file segmentation on the medical data by using a slicing mechanism to obtain a plurality of slice files, concurrently transmit the plurality of slice files to the edge cloud or the center cloud for splicing, and perform data storage on the spliced file;

and the merging module 120 is configured to, when the medical data is a small-volume file, perform data merging on a plurality of medical data by using a merging mechanism to obtain a merged file, and send the merged file to the edge cloud or the center cloud for data storage.

Specifically, in this embodiment, the slicing module and the merging module are implemented as follows:

the slicing module cuts a single large-capacity file such as pathology and the like by using a checksum blocking algorithm, firstly averagely cuts the file into a plurality of data small blocks, supposes that the default length of the slice is 4MB, and generates three types of data categories including 32-bit weak checksum, MD5 128-bit strong checksum and data block number after the file is cut. Two check values for each block are then calculated. The first check value uses a 32-bit weak checksum and the second check value uses a MD5 hashed 128-bit strong checksum. The method comprises the steps of taking a 4MB file block of a source file to perform 32-bit weak check calculation, then searching in a hash table, if the result is found, finding that the target file has a potentially same data block, then comparing MD5 hash strong checksums again, if the strong checksums are different, finding that the data block is different, if the weak checksums and the strong checksums are the same, determining that the data block has the same block in a target, and then recording the number of the data block of the block in the target file. If the weak check of the source file is not found in the hash table, the strong check is not calculated any more, which means that different information exists in the data. And then, transmitting the determined difference data blocks in all the source files back to the target file, and performing insertion splicing according to the transmitted data block information and the source data block information to finally finish the synchronous data. And if the target file does not exist, all the slice files are concurrently transmitted to the target end for splicing. For example, if a pathological file of liver cancer is 2GB, if it takes 4 seconds to read and write 500MBps by using a solid state disk IO in a conventional manner, 512 files are formed by using a slicing mechanism, and 512 distributed tasks are started for processing based on a data lake, which can be shortened to 8ms. Therefore, the data transmission quantity can be effectively reduced and the data reading and writing efficiency can be improved in a file slicing mode, so that the efficiency of lake entering of data and data retrieval is improved.

The merging module utilizes a merging mechanism to perform a data merging optimization strategy aiming at small files such as PACS massive images, and firstly, the quantity of metadata is reduced. A plurality of small files are combined into a large file, the access and query efficiency of metadata is improved by reducing the number of the files, the number of I/O operations of file reading and writing is reduced, for example, a slice is defaulted to 4MB, if one file is 256KB, IO reading and writing switching by 16 times can be reduced, the data processing efficiency can be greatly improved, and the data transmission time is saved. The large files are combined and stored in the file system, so that the I/O pressure of the file system can be reduced, and the storage performance is improved. When small files are merged and stored, an index file is generated, data is searched through indexes when being accessed, the index file can be prestored in the Cache, and therefore when the small files are read and written, the small files can be read and written only through one-time I/O. In the improved scheme of small file merging and storing, when a file is newly added, the size of the file is taken as a parameter, the small files are grouped, then the small files in the same group are merged and stored in the same data block, and an idle unit in the block is preferentially selected during storage. The original scheme does not consider the size of the file, and the newly added files are directly merged and stored in the data block in a write-after mode. When the small file merging storage improvement scheme is used for modifying a file, whether the size of the file is changed or not is judged, if so, the file is treated as a new file, and meanwhile, the old file is deleted. The improved scheme of small file merging and storing modifies the storage unit of the file into an idle state when the file is deleted, and takes priority to consider when the file with the same size is written

Further, the data storage layer 100 takes a Hadoop distributed cluster as a base, and builds a Hive and MPP data warehouse and a Hudi data lake cluster in a cluster heterogeneous mode to form a lake-warehouse integrated framework for data storage of the medical data.

In the embodiment, the data storage layer adopts a cluster heterogeneous mode, an elastic distributed storage layer and a distributed storage data warehouse based on a data lake are provided, the data lake has the advantages of a unified storage system, stored original data, rich calculation models and the like, the data warehouse has a built-in storage system and rich ETL processes, modeling and data management are emphasized, rich data diversity storage layers are effectively built, complicated medical information data are contained, and structured, unstructured and semi-structured data of medical information can be supported.

Specifically, the lake and storage integrated structure in this embodiment supports incremental and full intake, supports structured, semi-structured and unstructured data in content, and stores examination data related to acquisition and diagnosis, such as intermediate and result data of CT, MR, color ultrasound, DR/X-ray, blood detection, and the like.

Further, as shown in fig. 4, the data governance layer 200 specifically includes:

a metadata management system 210 for providing data consanguinity, data indexing, data version, and data routing functions;

specifically, in this embodiment, the metadata management system 210 provides a data lake access entry, performs data analysis and extraction On a data access request, provides a uniform data service, and routes the request to a corresponding data pool in the data lake, provides an MOR (large On Read) mode when the data is Read, returns data according to whether the query is a snapshot stream or a change stream, provides a COW (Copy On Write) mode when the data is written, and triggers a data bloodborder and a data version to re-register and update the metadata management system when the data is changed; ACID transaction write operations are supported.

Further, the data governance layer 200 further includes:

the calculation module 220 is used for integrating standard Sql, offline calculation, real-time calculation, MPP analysis and a visual calculation engine, acquiring a target data position through data service support of the metadata management system, and performing cross-platform calculation, so that migration of a data set between platforms is not required;

the message publishing system 230 is used for integrally packaging the message publishing system, supporting a data subscription mode of kafka and unifying data publishing paths;

further, the data governance layer 200 further includes:

a data quality control module 240, a data development module 250 and a data label module 260;

the data quality control module 240 integrates kerberos, adopts a bill authentication mode, controls data authority and safety in a whole process, and performs data quality control on the medical data;

the data development module 250 carries off-line, real-time and AI capabilities to mine the value of medical data;

the data tag module 260 constructs a tag system, provides data tags, and performs data sorting on the medical data.

Specifically, in this embodiment, based on the technology of the data lake, data security, data quality control, data development, model design, data labeling, task arrangement, imaging, and visualization modules are further provided, kerberos are integrated, a bill authentication mode is adopted, data authority and safety are controlled through a full process, leakage of sensitive information of patients and medical treatment is prevented, data quality control is performed on data acquired according to detection information, diagnosis results, regular detection and follow-up results of the patients, waterfall type error continuation of the data is prevented, instability of the data is solved fundamentally, a unified calculation development module is provided, admission of researchers is reduced, offline, real-time and AI capacity is carried, possibility is provided for mining larger data value, a rich label system is provided, labels are extracted on multiple branches such as cancer lesion and cell cancer, miscellaneous data branches are sorted, existing data assets are deposited from the prior art, and development and iteration of cancer research are supported more rapidly from innovation.

The electronic medical record is not only a massive corpus, but also the basis of medical record big data analysis. The electronic medical record document not only contains completely unstructured content described by natural language text, but also contains semi-structured information and the like. In electronic medical records, the subject of the medical record and various diagnosis-related descriptions, physical examination records, ward rounds, medical orders, etc. contained in the medical record can be considered as semi-structured (or unstructured) content containing rich semantic information.

Meanwhile, the electronic medical record post-structured processing model provided by the embodiment is combined with a cloud edge cooperation mechanism, so that model training can be performed on the central cloud side, and the model structuring accuracy is improved. The trained model is issued to the edge cloud through a central cloud scheduling mechanism, when data are acquired to the edge cloud, the trained model can perform data preprocessing on unstructured and semi-structured data of the electronic medical record, and data processing efficiency and a response mechanism are improved.

In addition, in this embodiment, an application of the platform in liver cancer medical data management is taken as an example to explain fig. 5 is a liver cancer medical data management analysis flow.

In the process of diagnosing liver cancer lesions, a diagnosis event flow baseline basically undergoes five links of primary visit, differential diagnosis, inspection and examination, treatment, outpatient service and the like, and each link involves the output and acquisition of medical data.

In the baseline procedure, the generation and flow of data is roughly:

the diagnosis information related to liver space occupying lesion and hepatocellular carcinoma appears for the first time in the first visit;

differential diagnosis, patients in the liver space occupying lesion group are tracked to determine whether relevant hepatocellular carcinoma appears or not; entering a hepatocellular carcinoma group for inspection;

checking, namely grouping according to the checking type, whether the liver cancer related checking records exist after the patients in the previous process see a doctor and the liver cancer related description appears in the checking result;

treatment, namely, grouping according to the main material type, according to whether the patients have outpatient service and hospitalization records related to the liver cancer after the examination of the patients in the previous process;

the clinic and the back-visit of the patient in the previous process are judged whether the patient has the relevant clinic and hospitalization record of the liver cancer after treatment, and a record-free group is detected; and judging whether the patient dies according to the discharge record.

In the baseline process, the management platform described in this embodiment will take over the following main basic functions and complete the whole process run-through:

based on a data lake storage technology, planned data pools such as a patient information pool, an outpatient service information pool, a medical detection pool, a medical document pool and the like can respectively store original data according to data output and lake entering specifications, and data assets are formed through sedimentation;

data security, which is based on authority control, can not change and change operations, and desensitizes sensitive data;

data blood margin, namely completing the acquisition of each data flow to form a complete data flow blood margin for visual display;

performing data quality control, performing quality detection by using ICD-10 standards such as R93.203 hepatic space occupying lesion, C22.900 hepatic malignant tumor, C22.000 hepatocellular carcinoma, C22.001 hepatic malignant cell tumor, C22.700 hepatic malignant tumor and the like, and reporting data which cannot pass through;

outputting authoritative and credible data under the condition that the data issuing and the platform-based data governance function are executed in a penetrating way;

after the mass data are collected and processed, the formed data assets have important data value, the following functions are built, and data information is further mined:

visualization, providing a holographic view based on the aggregate statistics of the grouped data, and displaying the change flow of the data;

the data map is used for forming a diagnosis case data map according to the diagnosis flow and the diagnosis result of different patients, the data map can further mine different case conditions by combining statistical information, different results appearing after different diagnosis modes are used, a liver cancer specific data map is formed, and data analysis of scientific research personnel is supported;

the data intelligence combines liver disease statistical information, case similarity, doctor's habit, treatment history and liver cancer research knowledge, an artificial intelligent recommendation algorithm is used for providing recommendation data, a doctor is assisted to provide a more constructive and reliable doctor's diagnosis solution, the experience of the liver disease doctor is improved, data intelligence based on historical correctness data is provided, the reliability of the diagnosis process can be proved, and the medical dispute of liver cancer is reduced.

Example 3

The embodiment provides an electronic device, which includes a memory and a processor, wherein the memory stores a computer program, and the processor implements the cloud-side collaborative medical data management method based on the data lake technology according to embodiment 1 when executing the computer program.

Example 4

The present embodiment provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the cloud-edge collaborative medical data management method according to embodiment 1.

It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims

1. A cloud-side collaborative medical data management method based on a data lake is applied to a cloud-side collaborative medical data management platform, the cloud-side collaborative medical data management platform comprises an edge cloud and a center cloud, the edge cloud and the center cloud are in communication connection, and the method comprises the following steps:

the center cloud trains a data analysis model according to the medical data and issues the trained data analysis model to the edge cloud, and the edge cloud performs data analysis on the medical data by using the trained data analysis model;

2. The cloud-edge collaborative medical data management method based on the data lake as claimed in claim 1, wherein the method for collecting medical data and sending the medical data to the edge cloud or the center cloud for data storage specifically comprises:

3. The cloud-edge collaborative medical data management method based on data lake technology as claimed in claim 1, wherein the edge cloud and the central cloud both use data lake technology to perform data storage on the medical data, and specifically include:

4. The cloud-edge collaborative medical data management method based on data lake technology according to claim 1, wherein the edge cloud and the center cloud perform data governance on the medical data, specifically comprising:

5. The cloud-edge collaborative medical data management method based on data lake technology according to claim 4, wherein the edge cloud and the center cloud perform data governance on the medical data, and further comprising:

6. The cloud-edge collaborative medical data management method based on data lake technology according to claim 1, wherein the edge cloud and the center cloud perform data governance on the medical data, and further comprising:

7. The cloud-edge collaborative medical data management method based on data lake technology according to claim 1, wherein the edge cloud and the center cloud perform data governance on the medical data, and further comprising:

8. A cloud-edge collaborative medical data management platform based on a data lake, wherein the method according to any one of claims 1-8 is performed, the platform comprises an edge cloud and a center cloud, and the edge cloud and the center cloud are in communication connection;

the edge cloud and the center cloud include:

the data management layer is used for carrying out data management on the medical data;

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the cloud-edge collaborative medical data management method based on data lake technology according to any one of claims 1 to 7.

10. A computer storage medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the cloud-edge collaborative medical data management method according to any one of claims 1 to 7.