CN111048164A

CN111048164A - Medical big data long-term storage system

Info

Publication number: CN111048164A
Application number: CN201911166191.8A
Authority: CN
Inventors: 胡佳慧; 钱庆; 方安; 范云满; 杨晨柳; 陈凌云
Original assignee: Institute of Medical Information CAMS
Current assignee: Institute of Medical Information CAMS
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2020-04-21

Abstract

According to the medical big data long-term storage system provided by the invention, the infrastructure layer distributes operation resources in the system; the acquisition receiving layer acquires and receives multi-source heterogeneous medical data, classifies the multi-source heterogeneous medical data to obtain medical big data to be stored, and intakes the medical big data to be stored into the data storage layer; the data storage layer encapsulates medical big data to be stored by using a storage metadata technology to obtain a standardized archiving information packet stored for a long time, and monitors, manages and maintains the archiving information packet; the data application layer provides functional level microservices and plug-in level microservices; and the data service layer is used for providing data distribution services through the data access and interaction interface, and the data distribution services comprise data retrieval, browsing and downloading. The authenticity, integrity, availability and long-term interpretability of medical data are guaranteed, and powerful guarantee is provided for long-term storage and utilization of important medical resources.

Description

Medical big data long-term storage system

Technical Field

The invention relates to the technical field of big data, in particular to a medical big data long-term storage system.

Background

Under a new form of data-driven scientific research, medical big data serve as strategic resources and have an important supporting role on medical science and technology innovation. Particularly, with the development of medical scientific research and practice activities and the construction of various application platforms, a large amount of medical scientific research data is generated, and the related subjects of the large amount of medical data are very wide while wide economic and social benefits are generated, so that the method has large data research values of deep analysis, mining and recycling.

The medical big data has wide sources, including biomedical science and technology literature data, medical insurance data, clinical electronic medical records, medical forum data and the like. In addition to being characterized by large data features that are large in number, diverse in type, fast in growth, and valuable for mining, medical big data also exhibit a characteristic complexity. First, medical data is difficult to acquire in view of the sensitivity of the data and the lack of sharing mechanisms; moreover, due to factors such as specialization and fineness of instruments and equipment, the acquisition cost of medical data may be expensive; moreover, based on the rigor of disciplines, the knowledge in the medical field generally dominates during data analysis and result interpretation; furthermore, medical data also exhibit a certain degree of irreproducibility characteristics compared to general data.

In the face of large-scale, diversified and dynamically-changed medical big data, how to take effective measures to ensure the authenticity, integrity, reliability and long-term interpretability of the data to the maximum degree becomes a problem which needs to be solved at present.

Disclosure of Invention

In view of the above, in order to solve the above problems, the present invention provides a medical big data long-term storage system.

The technical scheme is as follows:

a medical big data long-term preservation system, the system comprising:

the infrastructure layer is used for distributing operation resources, and the operation resources comprise various calculation, storage and network resources;

the acquisition receiving layer is used for acquiring and receiving multi-source heterogeneous medical data, classifying the multi-source heterogeneous medical data to obtain medical big data to be stored, and importing the medical big data to be stored into the data storage layer;

the data storage layer is used for packaging the medical big data to be stored by using a metadata storage technology to obtain a standardized archiving information packet for long-term storage, and monitoring, managing and maintaining the archiving information packet;

the data application layer is used for providing functional level microservices and plug-in level microservices so as to realize the quick and flexible deployment of resources and applications;

and the data service layer is used for providing data distribution services through data access and interactive interfaces, and the data distribution services comprise data retrieval, browsing and downloading.

Preferably, the infrastructure layer is specifically configured to:

the cloud platform and the bottom layer virtualization platform are utilized to work cooperatively so as to realize abstraction, pooling and automation of computing, network and storage infrastructure services.

Preferably, the acquisition receiving layer includes:

the data acquisition module is used for acquiring multi-source heterogeneous medical data;

the data receiving module is used for receiving the multi-source heterogeneous medical data, classifying the multi-source heterogeneous medical data, and detecting the number, format and content of data to obtain medical big data to be stored;

and the data intake module is used for taking the medical big data to be stored into the data storage layer and tracking a data intake process.

Preferably, the data acquisition module is specifically configured to:

configuring an acquisition site, configuring an acquisition template and managing acquisition tasks.

Preferably, the data receiving module is specifically configured to:

creating a receiving task, and carrying out automatic processing on the receiving task by utilizing a workflow technology, wherein the automatic processing comprises data registration, SIP packet receiving, original data uploading, data virus detection, decompression test, MD5 code generation, receiving task management and receiving task monitoring.

Preferably, the data intake module is specifically configured to:

creating an ingestion task, and performing manual assistance and system automation processing on the ingestion task by combining manual intervention and system automation processing, wherein the system automation processing comprises data packet copying, MD5 detection, data backup, data packet decompression, quantity check, format check, content check, SIP standardization check, AIP generation and uploading, index creation, ingestion task management and ingestion task monitoring.

Preferably, the system further comprises:

the system management layer is used for carrying out task scheduling, workflow configuration and management and background management on the data acquisition module, the data receiving module, the data intake module, the data storage layer, the data application layer and the data service layer;

the task scheduling comprises task creation, task management, task distribution, executor management and task visual monitoring;

the workflow configuration and management comprises a creating process, an editing process, a deleting process, a releasing process and a process monitoring;

the background management comprises user management, role management, mechanism management, system parameter setting, authority setting, log management, format management, running environment/software management, protocol management, plug-in management, statistical analysis management and content management.

Preferably, the data storage layer includes metadata storage, service data storage, and file storage, and the data storage layer is specifically configured to:

archive object management, integrity and invariance audit, format management, data migration management, change tracking management, statistical report, data protocol management and user authority management; and the quantity and the quality of the archived information packets are checked by comprehensively utilizing the periodic audit and the irregular audit, and the integrity and the invariance of the archived information packets are ensured.

Preferably, the data application layer is specifically configured to:

providing functional level microservices including data acquisition, data reception, data intake, public service, storage management and cooperative storage; providing plug-in level microservices including MD5 code detection, decompression testing, virus checking, quantity checking, format checking, and data backup.

Preferably, the data service layer is specifically configured to:

the data distribution facing the cooperative organization and the data distribution facing the public are provided, the data distribution facing the cooperative organization comprises establishment of a cooperative storage protocol, cooperative storage user management, cooperative storage namespace maintenance and organization user background statistics, and the data distribution facing the public comprises IP user login, common retrieval, advanced retrieval, professional retrieval, resource retrieval, online reading, resource downloading and auxiliary tool downloading.

Compared with the prior art, the invention has the following beneficial effects:

the medical big data long-term storage system provided by the invention is based on the development requirements of resource construction and information service in the medical field, combines the new trend that the resource storage object in the big data era is transferred from the traditional paper literature resource to various types of data resources, aims at sound medical information guarantee and service capacity, and constructs the medical big data long-term storage system around the conceptualization, acquisition, reception, ingestion, storage and access of the digital resource storage so as to ensure the authenticity, integrity, availability and long-term interpretability of medical data, thereby providing powerful guarantee for the long-term storage and utilization of important medical resources.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a system architecture diagram of a medical big data long-term storage system provided by an embodiment of the invention;

FIG. 2 is a data flow diagram in a medical big data long-term storage system according to an embodiment of the present invention;

FIG. 3 is a framework for associative integration of multi-source heterogeneous digital objects according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an embodiment of a bottom-level metadata repository structure of a medical big data long-term storage system according to the present invention;

FIG. 5 shows an example of a record detail for data intake;

fig. 6 is an interface schematic diagram of a medical big data long-term storage system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For the convenience of understanding the present application, the following description will first be made on long-term storage:

the Reference model for an Open Archive Information System (OAIS) provides a basic framework for long-term preservation, wherein the term "open" indicates that the formation of relevant recommendations and standards of the model is open, and does not indicate that access to archived content is not limited, and in reality, the preserved information usually adopts a "Dark preservation (Dark preservation)" mode, that is, access to archived information is enabled only when a specific event is triggered. The model was originally proposed in 1999 by the National Aviation and Space Administration (NASA) and the council for Space Data Systems (CCSDS), and has become an important standard commonly followed by the construction of digital storage Systems through continuous development and improvement, ISO 14721: 2012.

according to the definition of OAIS, long-term preservation is an action of long-term management and maintenance of preserved contents, which is intended to ensure that the preserved contents can be understood by a specific community and provide evidence supporting the authenticity thereof to ensure that the preserved contents can be accurately presented for a desired long-term time. Wherein a long term is understood to mean a sufficiently long period of time during which technological changes, new media and data formats, and changes in the particular user community may have a certain impact on the stored information.

It can be seen that long-term storage not only means data backup, but also only a security management measure, and the long-term storage emphasizes life cycle maintenance of digital content, including data auditing, data association, and data monitoring, where data auditing is performed to ensure integrity, data association is performed to ensure discoverability, and access control monitoring is performed on stored content to meet requirements of related privacy permission and intellectual property rights restriction.

The main goal of long-term preservation is to preserve certain information for an indefinite period of time, and in order to preserve the information object, the long-term preservation system must adequately understand the data object and its associated presentation information. The OAIS reference model emphasizes the preservation of information content, which is the key to long-term preservation.

The information package contains content information and storage description information. The content information is saved target information, and is composed of content data objects and presentation information related to the content data objects, so that the content data objects can be understood by designated communities. There are five types of stored description information, which are guide information, origination information, environment information, invariance information, and access right information, respectively. Wherein the guide information provides an identifier to uniquely identify the content information; the origin information describes the source of the content information, provides audit trail for the content information and provides basis for authenticity and reliability of the content information; the environment information records the reason for creating the content information and the relationship between the reason and other content information objects in the environment; the invariance information provides for checking and verifying the integrity of the data to ensure that a particular content information object has not been altered in an unrecorded manner; the access right information provides a right permission range for saving, distributing and using the content information.

On the basis, the invention provides a medical big data long-term storage system, the system architecture diagram of which is shown in fig. 1, and the system comprises: an infrastructure layer 10, an acquisition reception layer 20, a data storage layer 30, a data application layer 40, and a data service layer 50.

The infrastructure layer 10 is configured to allocate operating resources, where the operating resources include various computing, storage, and network resources.

In the embodiment, various resources such as calculation, storage, network and the like are reasonably utilized and distributed by adopting a virtualization technology, so that the utilization rate of the resources and the reliability of application are improved. Through the cooperative work of the cloud platform and the bottom layer virtualization platform, the abstraction, the pooling and the automation of computing, network and storage infrastructure services are realized.

One of the important challenges facing long-term storage is the high cost price that can provide an economically viable long-term storage solution by utilizing cloud computing and virtualization technologies. In addition, the cloud storage also has flexibility and dynamic expandability, and solutions such as mass storage, cooperative storage, efficient backup and real-time migration are provided for long-term storage of digital resources in a big data environment.

The medical big data long-term storage system adopts a cloud-based storage perception storage service, and has the advantages that ① the possibility of data damage or loss can be reduced by unloading the functions related to storage to the storage system, so that the digital storage system is more robust, ② a cloud-based long-term storage scheme supports logical storage of resources, so that the change of the physical position of an object in the cloud does not affect the access of a user to data, and ③ the understandability of the stored content in the future can be enhanced by using a cloud-based virtual device to store the data content and specific software required for presenting the data.

The single cloud storage mode has certain application limitation and potential safety hazard. With the development of cloud technology, the multi-cloud storage provides a new service mode for various applications in a big data environment. The medical big data can be stored for a long time by simultaneously utilizing a plurality of clouds with different functions, so that the dynamic allocation, flexible scheduling and cross-domain sharing of resources in a larger range are realized, and the overall utilization rate of the resources is improved. In addition, through flexible configuration of data management functions, long-term storage requirements of various types of digital resources at different stages along with time can be met based on a multi-cloud storage mode.

And the acquisition receiving layer 20 is configured to acquire and receive multi-source heterogeneous medical data, perform classification processing on the multi-source heterogeneous medical data to obtain medical big data to be stored, and import the medical big data to be stored into the data storage layer 30.

In the embodiment, the multi-source heterogeneous medical data has medical science data, a professional database, webpage data and the like with long-term storage value, and a proper resource acquisition and collection mode is determined according to different data types, so that comprehensive collection and classification processing of the multi-source heterogeneous massive medical data are realized.

In the specific implementation process, the acquisition receiving layer 20 includes: a data acquisition module 201, a data reception module 202 and a data intake module 203.

The data acquisition module 201 is used for acquiring multi-source heterogeneous medical data.

In this embodiment, a data acquisition tool, a data capture tool, and a data synchronization tool are used. For example, http live takes a role of capturing web data, for example, Kafka/Flume takes tasks of various data stream transmission and data capture, and for example, logstack realizes data synchronization and log data capture analysis.

The data acquisition module 201 is specifically configured to configure an acquisition site, configure an acquisition template, and manage an acquisition task.

Configuring an acquisition site, wherein the acquisition site is used for configuring a site name and a site description of a site to be acquired, and the limiting conditions are that the site name is unique and the site name and the site description are not null;

the method comprises the steps of configuring an acquisition template, wherein the acquisition template is used for configuring template information of a site to be acquired and comprises acquisition parameters, an acquisition entry page and an acquisition agent, the acquisition parameters comprise related parameters such as thread number, connection timeout, retry times, retry intervals, domain names, cookies and headers, the acquisition entry page is a site home page URL, the acquisition agent comprises related parameters such as an agent IP (Internet protocol), an agent port, an agent user name and an agent password, and the acquisition agent is configured to reduce the probability of being shielded by the site;

and managing the collection tasks, wherein the management is mainly used for controlling the execution of the collection tasks, and comprises starting the collection tasks, suspending the collection tasks, recovering the collection tasks, stopping the collection tasks and checking collection logs.

The data receiving module 202 is configured to receive multi-source heterogeneous medical data, perform classification processing on the multi-source heterogeneous medical data, and detect the number, format, and content of the data to obtain medical big data to be stored.

See the data flow in the medical big data long-term storage system shown in fig. 2. In the medical big data long-term storage system, the storage contents are interacted in the form of packets, and the system comprises three types of important packets, namely a Submission packet (SIP), an archive packet (AIP) and a distribution packet (DIP).

The data submitter submits the content to be saved to the medical big data long-term saving system, and the SIP packet contains data and content information so as to ensure that the saving system can maintain the saved content and future data users can access, understand and use the saved content through the saving system.

The medical big data long-term preservation system receives SIP from a data submitter, converts the SIP into a group of AIP suitable for data archiving and data management through a data intake function entity, classifies received information objects, determines a set of each object, and creates a message to update set description after AIP archiving is completed.

The data archiving functionality receives the AIP generated by the ingestion process and adds it to the persistent repository. And the data management functional entity adopts the packet description generated in the data intake stage and expands the existing set description. In the process of archiving and managing data, operations such as medium updating, error correction and database maintenance are required to prevent information loss caused by changes of technologies, media, data formats and user groups over time.

The data access functional entity interacts with data archiving and data management according to the data access request of a data user, and interacts with the AIP corresponding to the DIP and the description of the related information packet thereof. Data archiving and data management creates a copy of the requested object in temporary storage, and data access converts the set of AIPs and associated package descriptions into a set of DIPs and stores them in a physical distribution medium for delivery to data consumers in a data distribution session.

A framework for associative integration of multi-source heterogeneous digital objects is shown in fig. 3. Since collecting received data information does not have sufficient presence information and stored description information, SIP must be processed during the ingestion phase to ensure that long-term accessibility and availability information for stored data objects is collected. Metadata associated with the data objects is extracted and all content is packaged in AIP for archiving. Based on the captured data object metadata representation, it is encoded as an RDF triplet and stored in the index. Modeling is performed on object formats and concepts of stored knowledge and specific fields in an application-oriented manner, and efficient management of multi-source digital object metadata is achieved. The data objects are stored in a DIP mode to provide access and utilization services, and the data objects are stored through a graph database to provide support for knowledge reasoning and mining and complex graph data query.

The data receiving module is specifically configured to: and creating a receiving task, and carrying out automatic processing on the receiving task by utilizing a workflow technology, wherein the automatic processing comprises data registration, SIP packet receiving, original data uploading, data virus detection, decompression test, MD5 code generation, receiving task management and receiving task monitoring.

In this embodiment, the distributed tasks are created in batches based on the distributed task scheduling system, the batch number of the created and received tasks in batches is generated by the date and the current machine information, and the batch number is a popular and effective identifier of the subsequent tasks.

Further, the workflow technology provides an automatic solution for complex process management of medical big data long-term storage. The long-term preservation needs to ensure the long-term availability of the preservation resources under the change of factors such as time, environment, technology, laws and regulations, and the like, through a workflow management tool, specific functional links in the preservation life cycle can be pre-configured into a flow to be executed based on preservation planning, and through real-time monitoring of external events in the preservation system, the flow reconfiguration when the change occurs is supported.

In view of the advantages in the aspects of data persistence, process design, native support, data access efficiency and the like, the flexible configuration of the workflow of each link for long-term storage is realized on the basis of the active engine Activiti. The high-efficiency processing of the distributed tasks of the mass data is provided by combining the workflow and the task scheduling mechanism.

Further, in an automated process:

and data registration, namely, data processed during data receiving is mainly registered and recorded, so that later data examination is facilitated, and a registration number is generated according to date and current machine information.

And receiving the SIP packet, wherein the SIP packet is mainly used for receiving the SIP packet uploaded by the user. And receiving the SIP packet obtained by the data acquisition module and the data receiving module. In the SIP packet receiving link, the receiving protocol of the data submission SIP and copyright and privacy statement in the protocol need to be checked.

And uploading the original data, namely uploading the uploaded original data to a specified position of a server.

And data virus detection, which mainly detects whether the uploaded file contains viruses and trojans, and if so, the uploaded file cannot be successfully uploaded. Virus detection will be closely related to the server-installed antivirus software.

And (4) performing decompression test, namely performing decompression test on all the uploaded data packets, wherein the data packets which do not meet the requirement cannot execute subsequent processes or delete the data packets.

The MD5 code is generated, and the valid 32-bit MD5 code of the data packet can be generated based on the MD5 code algorithm, so that the problems that the data packet is maliciously tampered in the process of flow or data is lost due to improper operation and the like can be prevented.

And receiving task management, namely managing the process of the overall process, including the management of the processes which are not started, are in progress and are completed, and starting, stopping and deleting the processes. The limiting conditions are as follows: the executed flow cannot be stopped, and the stop function has only a stop function for the execution period (polling period) of the flow, that is, once the task is stopped, the flow is not executed the next time it reaches a fixed time.

Receiving task monitoring, and monitoring the running process, wherein the monitored function points comprise process execution state query and log visual display.

And the data intake module 203 is used for taking the medical big data to be saved into the data storage layer and tracking the data intake process.

A data intake module specifically configured to: creating an ingestion task, and performing manual assistance and system automation processing on the ingestion task by combining manual intervention and system automation processing, wherein the system automation processing comprises data packet copying, MD5 detection, data backup, data packet decompression, quantity check, format check, content check, SIP standardization check, AIP generation and uploading, index creation, ingestion task management and ingestion task monitoring.

In this embodiment, the data ingest module receives SIP packets in batches, and the system executes ingest tasks using distributed task scheduling. The ingested task batch number is generated according to the date and the current machine information, and the batch number is an effective execution identifier of the subsequent task flow.

Further, in the system automation process:

data packet replication, which is mainly used for processing SIP packets received according to batches and supporting the SIP packets to be replicated from an area A to an area B; the system supports the execution of packet replication operations in a task scheduling manner.

The MD5 detects the MD5 code of the ingestion stage and compares the MD5 code with the MD5 code generated in the receiving stage, so that the problems that the data packet is maliciously tampered in the process or the data is lost due to misoperation and the like are prevented.

And data backup, namely, the backup operation of the SIP package, the backup of a plurality of copies is supported, and the copies are stored in a plurality of server file systems.

And decompressing the data packet, decompressing the SIP data packet, and simultaneously supporting the decompression of the SIP packet sub-packet.

And the quantity check is mainly used for checking the content after the SIP packet is decompressed and verifying the size of the data packet and the number of the data files according to the protocol content.

And format check, which is mainly used for checking the content after the SIP packet is decompressed and verifying the file format of the data packet according to the protocol content.

And content inspection, which is mainly to inspect the content of the file after the SIP packet is decompressed and verify whether the data file has the field information according to the protocol content and the description file content.

And SIP standardization check, which is mainly to comprehensively check the inside of the file after the SIP packet is decompressed according to the SIP packet specification and reasonably verify whether the AIP packet can be generated or not.

AIP generation and uploading, mainly mass generation of AIP packages, and mass uploading to related warehouses.

And creating an index, wherein the index is created by mainly creating an index of the data file after the SIP data is decompressed into a related warehouse.

The intake task management is the management of the whole process flow process, including the management of the processes which are not started, are in progress and are completed, and can start, stop and delete the processes. The executed task cannot be stopped.

And the intake task monitoring is a function of monitoring the running process, and the monitored function points comprise process execution state inquiry and log visual display.

The medical big data long-term storage system supports flexible configuration and calling of workflows. The core processing tool of the information package provides services in a component form, so that each storage mechanism can flexibly assemble the required workflow according to the actual situation of the body. For example, according to the trusted digital warehouse audit and certification standard ISO 16363:2012, the integrity and correctness of the AIP needs to be verified at the beginning of its creation and the intelligibility of the AIP content information is guaranteed. Fig. 5 shows an example of record details of data ingestion, and by defining a data ingestion workflow, an automated process can be realized for a whole set of flows from data backup, data packet decompression, data inspection, format inspection, content inspection, SIP normalization inspection to AIP generation and upload, and index creation.

The data storage layer 30 encapsulates the medical big data to be stored by using a storage metadata technology to obtain a standardized archive information package stored for a long time, and monitors, manages and maintains the archive information package.

In the embodiment, the medical big data is saved based on distributed storage, including metadata storage, business data storage and file storage. The metadata is stored in Fedora, the data index is stored in ElasticSearch, and the business data supports the storage of a relational database.

The underlying metadata repository structure of the medical big data long-term storage system is shown in fig. 4. And checking and processing the submitted information packet SIP based on the workflow defined by Activiti, and finally storing the data in Fedora and ElasticSearch.

Fedora is used as a flexible extensible digital object storage Framework, provides a metadata multi-version management strategy, adopts a network Resource Description Framework (RDF) to manage digital resources, can realize correlation discovery and semantic retrieval services, simultaneously supports original file storage, and can be packaged into an archive information package AIP required by a long-term storage system according to different service requirements.

The ElasticSearch provides index service based on metadata, supports distributed deployment and multiple retrieval strategy configurations, and meets multiple retrieval requirements in storage management and public service.

A data storage layer, specifically to: archive object management, integrity and invariance audit, format management, data migration management, change tracking management, statistical report, data protocol management and user authority management; and the quantity and the quality of the archived information packets are checked by comprehensively utilizing the periodic audit and the irregular audit, and the integrity and the invariance of the archived information packets are ensured.

And the archived object management is used for providing retrieval, inquiry and downloading of the archived object for an administrator, so that the administrator can conveniently manage the archived object.

Integrity and invariance audits, by checking the number and quality of archived packets, to ensure packet integrity and invariance.

Format management, which mainly provides a manager with the data format for detecting the ingested flow task and can facilitate the system user to carry out operations such as audit.

And data migration management, wherein the data migration comprises hardware migration, software migration, format migration, version migration, access point migration and carrier migration, and the migration state is recorded after a migration detail list is filled in and a migration list is generated after auditing.

And change tracking management, which provides audit tracking of the content information, manages the change record of the content information and provides a basis for the authenticity and reliability of the content information.

And the statistical report is used for recording and counting the operation information of the information packet and providing inquiry and downloading of the audit report.

The data protocol management mainly provides management and maintenance of protocol information of a cooperative organization, and comprises online protocol submission, protocol confirmation, effective management, protocol state management, mapping management of a resource library, a resource type and the like corresponding to a protocol, and can facilitate a system administrator to uniformly manage the conditions of each organization. The main functions include information input, addition, deletion and modification operations of an acquisition protocol, an acceptance protocol, a storage protocol, an audit protocol, a public service protocol and a cooperative storage protocol.

And managing the user authority, distributing the authority to the users with different roles, and managing the authority information of the users.

Moreover, auditing refers to a method of ensuring packet integrity and invariance by checking the number and quality of archived packets.

The periodic audit refers to the periodic audit of the archived information packets at a specific time according to the storage plan.

The irregular audit refers to the audit of the archived information packets at unspecified time.

And the data application layer 40 is used for providing functional-level microservices and plug-in-level microservices so as to realize rapid and flexible deployment of resources and applications.

Based on the consideration of the rapid and flexible deployment mode of resources and applications, the medical big data long-term storage system supports rapid decoupling and integration by adopting micro-service management, and supports distributed deployment and dynamic capacity expansion without influencing the existing services.

In order to meet various application scenarios, micro services provided by the medical big data long-term storage system comprise application micro services, integrated micro services and data micro services. The application micro-service realizes the micro-service of an application system/module by taking a constructed system as a reference, and a single system or module can independently run and also support the data communication between the system and the module; the integrated micro-service realizes the integration between systems, including an internal system and an external system, the integrated framework provides basic capability required by synchronous and asynchronous communication of components, and the interaction between the systems only needs to follow the appointed REST interface and message definition; the data microservice provides a data retrieval and browsing interface, a data facet summarizing interface and a data statistics analysis summarizing interface, supports permission distribution and control, supports the requirement of data transmission encryption, and provides guarantee for data security.

The data application layer 40 is specifically configured to: providing functional level microservices including data acquisition, data reception, data intake, public service, storage management and cooperative storage; providing plug-in level microservices including MD5 code detection, decompression testing, virus checking, quantity checking, format checking, and data backup.

In this embodiment, each service module for acquisition, reception, ingestion, storage, management, access, and the like is designed based on a microservice concept, and plug-in management of MD5 code detection, decompression test, virus check, quantity check, format check, data backup, and the like is implemented.

Fig. 6 is an interface schematic diagram of the medical big data long-term storage system provided by the invention. The system has realized the long-term preservation of medical electronic publications for two backtracking electronic books, namely, a Karger electronic book 1827 and a Wiley electronic book 2239, which have acquired the long-term preservation right. The system generates a series of software tools in the development process, such as: content inspection tools, AIP packet generation tools, upload tools, etc., for which MedPRES has enabled long-term preservation.

And the data service layer 50 is used for providing data distribution services through data access and interactive interfaces, and the data distribution services comprise data retrieval, browsing and downloading.

In the embodiment, the resources existing in the medical big data long-term storage system can be searched and browsed for details on line by a user, and the document reading and downloading on line are supported. The retrieval comprises ordinary retrieval, advanced retrieval and professional retrieval.

The data service layer is specifically configured to: the data distribution facing the cooperative mechanism comprises cooperation storage protocol formulation, cooperation storage user management, cooperation storage namespace maintenance and mechanism user background statistics, and the data distribution facing the public comprises IP user login, common retrieval, advanced retrieval, professional retrieval, resource retrieval, online reading, resource downloading and auxiliary tool downloading.

And establishing a cooperative storage protocol, which is mainly used for establishing a corresponding cooperative storage related protocol according to requirements and providing a basis for subsequent steps. The method is mainly divided into protocols for offline flow and protocol management.

And the cooperative storage user management mainly comprises the management of information such as user names, mechanism names, login modes, creation time, states, login passwords, contacts and the like. The organization management list supports the functions of deleting and editing existing organization data.

And the cooperative storage namespace maintenance is used for maintaining the uniqueness of the space and ensuring the uniqueness of the resource identifier among the long-term storage systems in cooperative storage.

And the mechanism user background statistics comprises the functions of using a list, inquiring, displaying a graph and exporting a user use table 4 by a user. The user usage list can check the record of the user usage, and the display graph is a line graph and represents the trend in the statistical period. It should be noted that, in order to realize long-term storage of medical big data, the medical big data long-term storage system is designed according to the current internationally recognized standard, wherein the definition of the model complies with ISO 14721: 2012-Open Archive Information System (OAIS), trusted Certification of digital warehousing following ISO 16363: 2012-trusted digital warehousing Audit and Certification standards (auth and Certification of trusted digital warehousing transactions).

Furthermore, the medical big data long-term storage system can also comprise a system management layer which is used for carrying out task scheduling, workflow configuration and management and background management on the data acquisition module, the data receiving module, the data intake module, the data storage layer, the data application layer and the data service layer;

task scheduling comprises task creation, task management, task distribution, executor management and task visual monitoring;

workflow configuration and management, including creating flow, editing flow, deleting flow, releasing flow and monitoring flow;

and background management, including user management, role management, mechanism management, system parameter setting, authority setting, log management, format management, running environment/software management, protocol management, plug-in management, statistical analysis management and content management.

And a task is created, wherein a system user mainly creates a scheduling task for tasks such as data receiving, data intake, auditing and the like, and the scheduling task is used for a program to automatically execute the task.

Task management, including task creation (creating scheduled task), task start (starting task, running according to scheduling period), task stop (stopping started task, task will not be executed according to scheduling period), task recovery (recovering stopped task to executing state according to scheduling period), and task deletion (deleting scheduled task, stopping task before deleting)

Task distribution, which is mainly to set the distribution mode of tasks, mainly includes two types: automatic distribution of programs and manual distribution.

Executor management mainly includes: adding actuators (actuator name, injection mode, actuator description and server IP), editing actuators and deleting actuators.

The task visual monitoring is used for facilitating a manager to check the execution condition of each task so as to make correct and timely judgment on the task. The method mainly comprises the following steps: inquiring the log according to the executor, the task name and the scheduling time; and executing the single log and checking the task execution condition in real time.

And (4) creating a flow, and customizing the design flow by editing the workflow name, the workflow description and the application module in the design flow page by a user.

And the editing process is mainly used for redesigning the designed process, and the editing process and the creating process can be customized as required.

And deleting the flow, wherein when the created flow is too much, the deleted flow is supported to be deleted for some flows which are not used for a long time or are not used any more, and the deleted flow is directly deleted and cannot be recovered.

And (4) issuing the flow, wherein the issuing flow can check whether the designed flow is reasonable, but the designed flow cannot be issued unreasonably, some unreasonable settings in the flow need to be edited and modified newly, the flow is stored and issued again after the modification is finished, the system repeatedly checks the design again, and if no problem is found, the system is directly deployed successfully, and the system is circulated all the time.

And the process monitoring can check the state of the process and the task progress. The administrator can conveniently know the flow progress in real time so as to manage. The process monitoring is divided into workflow monitoring (all published processes will be shown, all task details of each process are checked in support), all task details of the current process are shown, and the task details are dynamically shown in a graphic mode.

And user management, which mainly comprises user list display, user adding, user deleting and user inquiring. The system supports the administrator to perform functions of adding, editing, deleting, inquiring and the like on the user, and supports the maintenance of information such as user name, password, gender, real name, mailbox, mobile phone number, affiliated organization, academic calendar, specialty, research field and the like.

And the role management mainly comprises adding roles, editing roles, deleting roles, inquiring roles and adding by a user. The function of maintaining role information is provided for the administrator, including definition of role names and role descriptions, and a user list under the role can be set.

And organization management, namely registering and maintaining information of the organizations. The method mainly comprises the following steps: the organization user (Ip terminal setting, organization administrator name, contact information, etc.), user basic information, user agreement, user authority, administrator operation (addition, editing, viewing, and deletion) are registered.

The system parameter setting mainly provides the functions of data source management, data type management, subscription protocol management, data provider management, public service configuration, data responsible person management and the like for an administrator.

And the permission setting is used for providing different permissions for the user and the role. The authorities in the system are divided into operation authority management and data management authority. The operation authority includes entering a certain menu and executing a certain function. And data management authority comprising data access, metadata access, full text download, data download operating environment and the like. The authorization mode comprises default authority distribution of the system, manual distribution of system management personnel and authority obtained after application of the data owner.

And log management, namely recording service operation logs of all users of the management platform, such as user login, push-out, workflow starting, task starting and task receiving. The system records service data of all users, such as adding, updating and deleting data logs, and can track when and where the users add a specific piece of data; which workflow the user enters, which task is started. Which critical services have changed. The log provides a query function that can be queried by module and date range, and the log provides a log export function.

And the operation environment/software management is used for managing a software environment and a hardware environment for data operation and providing support for long-term storage (data migration, operation environment simulation and the like) of data. The configuration runtime environment can be managed. (1) And storing the software environment for data operation into the system, and supporting the test and operation of the data in the software environment. (2) Registering the hardware environment required for data operation in the system.

The protocol management mainly provides management and maintenance of protocol information of the cooperative organization, and comprises online protocol submission, protocol confirmation, validation management, protocol state management, mapping management of a resource library, a resource type and the like corresponding to the protocol, and can facilitate a system administrator to uniformly manage the conditions of each organization. The main functions include information input, addition, deletion and modification operations of an acquisition protocol, an acceptance protocol, a storage protocol, an audit protocol, a public service protocol and a cooperative storage protocol.

Plug-in management, which mainly provides simple query and complex query of plug-ins; the supporting plug-in is updated according to different versions; the function of deleting the plug-in is supported; the main plug-ins to be implemented include: SIP package virus test plug-in, SIP package decompression plug-in, SIP package quantity check plug-in, SIP package content check plug-in, relational database to RDF plug-in, SIP package split plug-in, SIP package data to RDF plug-in, full text retrieval test plug-in, and the like.

And the statistical analysis management is mainly used for counting suppliers, collection, acceptance, intake and audit and providing visual icon display and data list display.

The content management mainly comprises column management and news management, and comprises operations of creating, editing and deleting columns, adding, editing, deleting and auditing news and the like.

The medical big data long-term storage system provided by the embodiment of the invention is based on the development requirements of resource construction and information service in the medical field, combines the new trend that the resource storage object in the big data era is transferred from the traditional paper literature resource to various types of data resources, and constructs the medical big data long-term storage system around the conceptualization, acquisition, reception, ingestion, storage and access of the digital resource storage by taking sound medical information guarantee and service capability as the aims, so as to ensure the authenticity, integrity, availability and long-term interpretability of the medical data and provide powerful guarantee for the long-term storage and utilization of important medical resources.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A medical big data long-term preservation system, characterized in that the system comprises:

2. The system according to claim 1, characterized in that said infrastructure layer is specifically configured for:

3. The system of claim 1, wherein the acquisition receiving layer comprises:

4. The system of claim 3, wherein the data acquisition module is specifically configured to:

5. The system of claim 3, wherein the data receiving module is specifically configured to:

6. The system of claim 3, wherein the data intake module is specifically configured to:

7. The system of claim 3, further comprising:

8. The system of claim 1, wherein the data store layer comprises a metadata store, a business data store, and a file store, and the data store layer is specifically configured to:

9. The system of claim 1, wherein the data application layer is specifically configured to:

10. The system of claim 1, wherein the data service layer is specifically configured to: