CN111737216A

CN111737216A - Data user environment, data governance method, and computer-readable storage medium

Info

Publication number: CN111737216A
Application number: CN201910974971.9A
Authority: CN
Inventors: 沈唐秀蓉
Original assignee: CETC Information Science Research Institute
Current assignee: CETC Information Science Research Institute
Priority date: 2019-03-25
Filing date: 2019-10-14
Publication date: 2020-10-02
Also published as: CN111737215A

Abstract

The application provides a data user environment of a multi-user collaborative data management system, comprising: one or more data connectors; one or more data directories; one or more data sets; one or more collaborators and a data user environment service. The application also provides a data governance method, computer equipment for implementing the data governance method and a computer readable storage medium.

Description

Data user environment, data governance method, and computer-readable storage medium

Technical Field

The present application relates to data processing technologies, and in particular, to a data user environment, a data governance method, and a computer-readable storage medium.

Background

In many countries with Information Technology (IT) infrastructure, most government departments or organizations do not typically share or exchange data, although they may operate digitally. This data islanding situation not only results in low productivity within each organization, but also causes inconvenience to people. For example, a person may need to visit multiple government offices to obtain certain certifications. As another example, without some government data, private entities creating application services for the public may not be able to effectively implement some meaningful product functionality, and so forth. On the other hand, most government departments currently rely on the awkward approving procedure of paper documents in order to prevent illegal use of data and to protect citizenship. And after data access is granted, staff is required to manually filter and transform sensitive data at the time of data export for security and privacy access control reasons. The exported outdated data is then placed in a shared location or converted into portable form for download. Government organizations are often reluctant to share their data due to the high cost of labor and security and privacy concerns.

Currently, many developing and developed cities are developing smart city projects. An important item in the smart city is government cross-organization cross-regional data sharing and exchange. Government data sharing and exchange is typically managed and serviced by data hosting services. Each government organization sends its sharable data to the data hosting service provider. Data hosting service providers are responsible for managing data assets, building data catalogs, establishing digital data subscription approval processes, implementing security and privacy access control for data access, and auditing all data usage. However, many government agencies are not willing to hand data to data hosting service providers so far because these providers cannot effectively manage the security and privacy of data and monitor data usage. Once the data is handed over to the data hosting service provider, government agencies worry that they will not be able to control the manner in which the data is used. To solve this problem, a data hosting service provider needs a multi-tenant platform, and all government organizations can manage data by themselves, set security and privacy access control rules for data access, and share data through a secure publishing and subscribing mechanism. However, these service providers cannot find a suitable solution on the market. Currently, most data hosting service providers only provide government data catalogs for private or public browsing. Application for access to government data remains a manual process and the application flow still involves paper approval by the data service provider and by the data owner. This method is not efficient. Furthermore, the downloadable data is mainly a statistical summary and not actual data, whereas real-time data is never actually available.

In addition to the smart city project connected to the government, clinical research still relies heavily on paper documents and mail exchange in the internet era. The clinical studies described herein relate to protocol design, participant screening, protocol review and execution, and the like. The results of these clinical studies are forwarded to medical researchers and pharmaceutical companies for processing. If successful, the final result will be submitted to regulatory and surveillance authorities for review and approval prior to commercialization. However, this process leaves the problem that the entire business is not completely transparent to all participants, from the data collected to the analytical methods to the final approval step, nor is there a simple or systematic way to infuse the data of other relevant studies to see potential side effects or benefits to more fully understand the results. This problem is far from unique to clinical trials. While some studies may involve participants submitting data to websites, in many cases, once the studies are completed, these websites will be taken offline, making these data unavailable to future researchers and potential collaborators. In recent years, attempts have been made to establish a collaborative clinical trial research network through which data can be shared. However, security and privacy management remains a major issue. Currently, there are no products on the market that are specifically designed to support such applications.

In addition to clinical research, universities and research institutes have also generated large amounts of biological and other scientific data. Many research groups have published some of the research results on the web for sharing. However, it is not easy for researchers to find data related to their own field because the data is distributed over the internet without a directory of topics available for searching, and traditional scientific discipline publishers still dominate the media, limiting the ways in which researchers can publish and share data.

Current data asset management and data sharing products are developed primarily from Business Intelligence (BI) products or data warehouse extraction, transformation, and loading (ETL, Extract, Transform, and Load) products. These traditional products are typically used by enterprises with centralized data control, and there are IT administrators in these enterprises that are responsible for managing all data.

Fig. 1 shows a structure of a conventional data management system. As shown in fig. 1, in the conventional data management product 0100, an IT administrator 0101 first processes corporate data through operations of extraction, cleansing, and transformation to create accurate and error-free data. The IT administrator then connects the collated accurate data 0102-a, 0102-b, and 0102-c to the platform as data sources for administrative purposes. In this platform, data source objects 0121-a, 0121-b, 0121-c are logical entities created to manage the actual data sources 0102-a, 0102-b, 0102-c. In addition, the IT administrator will also build and manage a static data directory (also referred to as data directory 0105). Wherein the directory contains a list of all data sources connected to the platform.

Some of these conventional data management systems also support virtual data sources 0126, 0127. The virtual data source may combine data from multiple data sources or may present a subset of the real data sources. In these data management systems, these virtual data sources 0126, 0127 may be served by creating virtual data servers 0129-d, 0129-e. These virtual data sources are also listed in the data directory 0105 above.

An IT administrator may manually create virtual data servers 0129-a, 0129-b, 0129-c, 0129-d, 0129-e to serve each data source 0121-a, 0121-b, 0121-c, 0126, 0127 in a directory. For each virtual data server 0129-a, 0129-b, 0129-c, 0129-d, 0129-e, the IT administrator will create a granular access control policy for the user/user group. For example, for a virtual data server 0129-a mapped to a data source 0121-a, the IT administrator may configure which data user/data user group may access which row and which column of data, and which data must be masked for which user/user group, etc. The virtual data server enforces an access control policy for all data users accessing its corresponding data source.

In the above conventional data management system, the data consumer 1030 may browse 0131 data directory to find the data source 0121-a, 0121-b, 0121-c, 0126 or 0127 and its virtual data server information 0129-a, 0129-b, 0129-c, 0129-d or 0129-e. The data consumer 1030 may then connect 0132 to the virtual data server 0129-a, 0129-b, 0129-c, 0129-d, or 0129-e to request data. The virtual data server 0129-a, 0129-b, 0129-c, 0129-d or 0129-e extracts data from the actual data source 0121-a, 0121-b, 0121-c, 0126 or 0127 through the data source object 0121-a, 0121-b, 0121-c, 0126 or 0127 and converts the data according to the requester's credentials and granularity access control policy before returning the data to the requester.

IT can be seen that in these conventional data management systems, management of all data and data usage (e.g., security and privacy controls) is centrally managed and controlled by IT administrators. The distributed participant group cannot manage its own data sources, or perform its own cleaning and conversion, or control the sharing of its data sources, or set its own security and privacy control rules. The conventional scheme of performing security management in a virtual data server manner is very unrealistic from the viewpoint of computing resources and system management in an environment of a large amount of data. Thus, these existing solutions are not practical for the above-described present-day use cases, e.g., government cross-organization cross-regional data sharing and exchange solutions in smart cities. Thus, most government organizations remain reluctant to hand over data to IT administrators of data hosting service providers. Therefore, it can be seen that a new and safe data sharing scheme is urgently needed in all aspects of smart cities, clinical research, scientific research and the like.

Disclosure of Invention

An embodiment of the present application provides a data user environment of a multi-user collaborative data management system, where the data user environment includes:

one or more data connectors;

one or more data directories;

one or more data sets;

one or more collaborators; and

a data user environment service to:

associating each of the one or more data sets with a data item, wherein the data item is from one of the one or more data connectors;

associating each of the one or more data sets with a subscription data item, wherein the subscription data item is subscribed from one of the one or more data directories;

associating each of the one or more data sets with a publishing data set, wherein the publishing data set is from a catalog of data that publishes the data set to the one or more data categories through a publishing process; and

associating each of the one or more collaborators with one or more data sets using the permissions.

Wherein the data consumer environment service is further configured to receive a user request to register a data item selected by a data consumer to the data consumer environment, wherein the data item is a data item selected from the data connector or a subscription data item selected from the data catalog; creating a data set; and associating the data set with the data item.

Wherein the data consumer environment service is operable to associate the data set with the data item from the data connector by connecting the data set with the data item.

Wherein the data consumer environment service is configured to associate the data set with the subscription data item from the data catalog by concatenating the data set with the subscription data item.

The data consumer environment further comprises: a subscription service for receiving references to published data sets selected by data users in the data catalog; creating a subscription data item in the data user environment; and associating the subscription data item with the published data set by concatenating the subscription data item with the published data set.

The subscription service is further used for acquiring a subscription approval process of the published data set before a subscription data item is created in the data user environment; sending an examination and approval request to an approver appointed by the subscription examination and approval process; and receiving an approval response of the approver.

The data consumer environment further comprises: the publishing service is used for receiving the reference of the data set selected to be published by the data user when the data set is published to a data directory; verifying whether the data set is publishable; receiving references to data categories and categories selected by a data user; receiving a reference to part or all of the contents of a data set selected for publication by a data user; receiving metadata provided by a data user; and presenting the selected dataset and the provided information as published datasets in a catalog under the selected category.

Wherein the metadata includes role-based security and privacy access control rules defined by the data user.

The metadata comprises a subscription approval process defined by a data user.

The data consumer environment further comprises: a collaboration service to receive collaborators and data sets selected by data users, add the collaborators to the data sets; setting a use permission for the collaborator to use the data set; setting specific security and privacy access control rules for the collaborators to access the data set content; creating a new data set in the collaborator's data user environment; and concatenating the new data set and the data set.

Wherein the specific security and privacy access control rules are personalized security and privacy access control rules.

The data consumer environment further comprises:

one or more item containers; and

an item container manager to receive a selected one or more data sets; associating each of the one or more data sets with an item container by adding the data set to the item container.

Wherein the project container manager is further configured to receive an instruction from a data user to create a data processing pipeline in the project container.

Wherein the project container manager is further configured to receive a reference to a data processing program; uploading the data handler in the project container if the data handler does not exist in the system; adding the data handler in the project container if the data handler exists in the system.

Wherein the item container manager is further configured to associate each of the one or more collaborators with an item container using the permissions.

Wherein the project container manager is further configured to receive a selection of one or more collaborators; associating the selected one or more collaborators with a project container by: adding the selected one or more collaborators to the project container, adding the project container to a data user environment of the selected one or more collaborators; for one or more data sets associated with the project container, configuring the collaboration service to add the selected one or more collaborators to the data set.

The data consumer environment further comprises: a data profiling service for receiving selected data portions or data fields to be examined in the data set; receiving a selected data profiling method; performing the data profiling method for the data portion or data field; and generating a data profiling result.

The data consumer environment further comprises: a data lineage service to receive references to a data set; creating a data blood relationship map; wherein the data lineage graph includes one or more ancestor data sets of the data set and one or more descendant data sets of the data set; wherein the data content of the data set is a derivative of the data content of the one or more ancestor data sets; the data content of the offspring data set is a derivative of the data content of the data set.

The data user environment service is further configured to, when a data user or an application of the data user initiates a data access request to a data set, obtain, from a virtual data set service subsystem of the data sharing system, a virtual data set corresponding to an original data set related to the data access request, and return the virtual data set to the data user or the application of the data user.

The embodiment of the application also provides a multi-user cooperative data governance method, which comprises the following steps:

associating each of the one or more data sets with a subscription data item, wherein the subscription data item is subscribed from one of one or more data directories;

each of the one or more collaborators is associated with the one or more data sets using the permissions.

The above method further comprises: receiving a user request to register a data item selected by a data user to the data user environment, wherein the data item is a data item selected from the data connector or a subscription data item selected from the data catalog; creating a data set; associating the data set with the data item.

The above method further comprises: receiving a reference to a published data set selected by a data user in the data catalog; creating a subscription data item in the data user environment; and associating the subscription data item with the published data set by concatenating the subscription data item with the published data set.

Prior to creating a subscription data item in the data user environment, the method further comprises: acquiring a subscription approval process of the published data set; sending an examination and approval request to an approver appointed by the subscription examination and approval process; and receiving an approval response of the approver.

The above method further comprises: when the data set is published to a data directory, receiving the reference of the data set selected to be published by a data user; verifying whether the data set is publishable; receiving references to data categories and categories selected by a data user; receiving a reference to part or all of the contents of a data set selected for publication by a data user; receiving metadata provided by a data user; and presenting the selected dataset and the provided information as published datasets in a catalog under the selected category.

The above method further comprises: receiving collaborators and data sets selected by a data user; adding the collaborators to the dataset; setting a use permission for the collaborator to use the data set; setting specific security and privacy access control rules for the collaborators to access the data set content; creating a new data set in the collaborator's data user environment; and concatenating the new data set and the data set.

The above method further comprises: receiving the selected one or more data sets; associating each of the one or more data sets with an item container by adding the data set to the item container.

The above method further comprises: receiving an instruction from a data user to create a data processing pipeline in the project container.

The above method further comprises: receiving a reference to a data processing program; uploading the data handler in the project container if the data handler does not exist in the system; adding the data handler in the project container if the data handler exists in the system.

The above method further comprises: each of the one or more collaborators is associated with an item container by using the permissions.

The above method further comprises: receiving the selected one or more collaborators; associating the selected one or more collaborators with a project container by: adding the selected one or more collaborators to the project container, adding the project container to a data user environment of the selected one or more collaborators; for one or more data sets associated with the project container, configuring the collaboration service to add the selected one or more collaborators to the data set.

The above method further comprises: receiving a data portion or data field selected in the data set to be examined; receiving a selected data profiling method; performing the data profiling method for the data portion or data field; and generating a data profiling result.

The above method further comprises: receiving a reference to a data set; creating a data blood relationship map; wherein the data lineage graph includes one or more ancestor data sets of the data set and one or more descendant data sets of the data set; wherein the data content of the data set is a derivative of the data content of the one or more ancestor data sets; the data content of the offspring data set is a derivative of the data content of the data set.

The above method further comprises: when a data access request is initiated to a data set by a data user or an application of the data user, a virtual data set corresponding to an original data set related to the data access request is obtained from a virtual data set service subsystem of the data sharing system, and the virtual data set is returned to the data user or the application of the data user.

Implementations of the present application also provide a non-transitory computer-readable storage medium, wherein the storage medium stores one or more instructions that, when executed by one or more processors, implement the above-described collaborative data governance method.

By the data user environment and the data management method of the multi-user cooperative data management system, various operations of accessing, subscribing, releasing, cooperating and the like of a user to a data set can be realized. Thereby making the use and sharing of data sets by users more convenient and secure.

Furthermore, by automatically creating virtual data sets when a data user attempts to access a data set (each data visitor has a virtual service when accessing a data set, i.e., one-to-one), different data users can get different virtual data sets even if they access the same data set. That is, in the above scheme, the virtual data set is created to perform online data transformation and filtering according to personal or role-based security and privacy access control rules defined by the data owner for collaborators or subscribers.

Further, in embodiments of the present application, the virtual data set is created in real-time upon receiving a data access request from a data user or an application of the data user. The data sharing system also does not need to maintain these virtual data sets when the data users no longer use the data sets. That is, there are no virtual data sets in the data sharing system that need to be maintained without data users accessing the data. It can be seen that such an approach can greatly save system memory, bandwidth, and computational resources. In addition, in the embodiment of the present application, the virtual data set established in the above manner is related to the accessed data and the visitor, and the data actually obtained by different data users accessing the same data set may also be different, which also greatly facilitates the data owner to realize the secure sharing of the data.

Another advantage of the embodiments of the present invention is that although all raw data sets may come from data sources of different types, different formats and/or different data access interfaces, by providing data in such a way that the above-mentioned association relationship is established, the data user environment according to the embodiments of the present invention can provide a data access interface with a uniform data format for all users to access data.

With this data user environment and data governance methodology, data users can use subscribed data sets as well as data sets shared by other data users and combine these data sets with their own data sets to generate new data sets and reports. In addition, through the data user environment and the data governance method, the data users can also publish new data sets, and other data users can create more new data sets by using the data sets. By analogy, the data sharing model allows for the recursive creation of novel and useful information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of the internal structure of a conventional data sharing system

Fig. 2 is a schematic internal structural diagram of a data sharing system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an internal structure of a data user environment subsystem in the data sharing system according to the embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a process of adding a data server according to an embodiment of the present application;

FIG. 5a is a schematic diagram illustrating a data set registration process according to an embodiment of the present application;

FIG. 5b is a flowchart of a method for adding collaborators to a data set according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an internal structure of a data-sharing directory service subsystem in the data-sharing system according to the embodiment of the present application;

FIG. 7 is a schematic diagram of a data set publishing process according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a data set subscription process according to an embodiment of the present application;

fig. 9 is a schematic diagram illustrating an internal structure of a virtual data service subsystem in the data sharing system according to the embodiment of the present application;

FIG. 10 is a diagram illustrating a process for initiating data set access according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating a process for accessing a subscription data set or a shared data set according to an embodiment of the present application;

FIG. 12 is a diagram illustrating a process for accessing a directly owned data set according to an embodiment of the present application;

FIG. 13 is a schematic diagram illustrating an internal structure of a data sharing system according to another embodiment of the present application;

FIG. 14 is a flowchart illustrating a method for recursively generating a new data set, according to an embodiment of the present application;

FIG. 15 illustrates the internal structure of a data set object according to an embodiment of the present application;

FIG. 16 illustrates a data set data profiling service according to an embodiment of the present application;

FIG. 17 shows an example of data lineage of a dataset object (dataset A) of one user (user-1) according to one embodiment of the present application;

FIG. 18 illustrates an example of a data set data lineage service according to one embodiment of the present application;

FIG. 19 shows an example of the creation of an ancestor kinoform by a dataset data lineage service according to one embodiment of the present application;

FIG. 20 illustrates an example of the creation of a descendant blood margin map by a data set data lineage service according to one embodiment of the present application;

FIG. 21 illustrates an internal structure of an item container object according to one embodiment of the present application;

FIG. 22 illustrates an example of a process for adding collaborators via the project container collaborator management service according to one embodiment of the present application; and

FIG. 23 shows an example of a project container management service according to one embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For simplicity and clarity of description, the invention will be described below by describing several representative embodiments. The numerous details of the examples are merely provided to assist in understanding the inventive arrangements. It will be apparent, however, that the invention may be practiced without these specific details. Some embodiments are not described in detail, but rather are merely provided as frameworks, in order to avoid unnecessarily obscuring aspects of the invention. Hereinafter, "including" means "including but not limited to", "according to … …" means "at least according to … …, but not limited to … … only". When the number of one component is not particularly specified hereinafter, it means that the component may be one or more, or may be understood as at least one.

The present application is directed to creating a collaborative environment in which a group of individuals and/or organizations may securely share their data and may complete data analysis projects together. These groups can be processed and analyzed using each other's data. They may share their work results including reports generated from existing data and new data sets, among others.

The collaboration environment created by the present application has many use cases, for example, as a connection between government Smart City programs. Through the collaborative environment, government organizations may collaboratively share, exchange and analyze data, as well as share information with private entities such as universities and corporations. As another example, the collaboration environment created by the present application may also facilitate collaboration between clinical studies, where clinicians in hospitals, scientists in research institutions, researchers in pharmaceutical companies, and managers in regulatory agencies may share their data and collaborate on data analysis, and so forth.

The collaboration environment, the data sharing system, created by the present application is described in detail below with reference to the accompanying drawings.

Fig. 2 shows an internal structure of a data sharing system 0200 provided by the embodiment of the present application. As shown in fig. 2, in some embodiments of the present application, the data sharing system 0200 may comprise: a data user environment subsystem 0221, a data sharing directory service subsystem 0230, and a virtual data set service subsystem 0240.

In some embodiments of the present application, the data user environment subsystem 0221 described above may provide an operating environment for data users to manage and process data, create project containers. Data users may register data, publish data, subscribe to data, process data, and create project containers, etc. through the data user environment subsystem 0221 described above.

In some embodiments of the present application, data user environment subsystem 0221 described above may comprise a plurality of data user environments 0221-a, 0221-b, 0221-c, wherein each data user environment has a one-to-one correspondence of data users 0203-a, 0203-b, 0203-c. The above-described data user environment subsystem 0221 provides an operating environment for managing own data and item containers for each data user, respectively. Each data user 0203-a, 0203-b, 0203-c can only manage its own data and item containers in its own data user environment through its respective corresponding user interface 0212-a, 0212-b, 0212-c. Wherein each data user environment 0221-a, 0221-b, 0221-c is isolated from all other data user environments.

In some embodiments of the present application, data users 0203-a, 0203-b, 0203-c can use their own data, and other data users share data 0213 to him/her and data 0163 to which they subscribe, by means of the data sharing system 0200 described above. In addition, data users 0203-a, 0203-b, 0203-c can also share their data 0213 to specific data users (e.g., other collaborators), or publish their data for data sharing 0226-a, 0226-b, 0226-c with unknown data users. That is, data sharing system 0200 described above allows data users 0203-a, 0203-b, 0203-c to share data in their project containers and to cooperate with each other to process and analyze the data. As such, in embodiments of the present application, data consumers 0203-a, 0203-b, 0203-c may be both data consumers and data providers.

Furthermore, in some embodiments of the present application, personalized data security and privacy access control rules may be defined by the data owner for collaborators in the sharing process 0213. Furthermore, role-based data security and privacy access control rules may also be defined by the data owner in the data distribution process 0226-a, 0226-b, 0226-c.

In some embodiments of the present application, the data-sharing directory service subsystem 0230 may include a dynamic data directory module 0233, a data subscription service module 0231, and a data publication service 0232. The rdf module 0233 includes one or more rdf directories. The dynamic data set catalog contains a plurality of categories, and published data sets are linked to one or more categories. Wherein each published data set has a set of role-based data security and privacy control rules defined by the data owner. The data subscription service module 0231 is used for managing a subscription process of the data user to the published data set.

In some embodiments of the present application, the virtual data set service subsystem 0240 is configured to identify a role of a data user (data visitor), retrieve associated personalized data security and privacy access control rules or role-based data security and privacy access control rules 0234 set by a data owner for a particular data set and data visitor, and create a virtual data set for a data access request according to the retrieved personalized data security and privacy access control rules or role-based data security and privacy access control rules.

In some embodiments of the present application, the virtual data set service subsystem 0240 described above includes a virtual data set 0235 and a virtual data access interface 0242. The virtual data access interface 0242 is an interface for a data user to access a data set. When an application item of a data user initiates an access request to a data set 0222-a, 0222-b, 0222-c, a data access connection 0224-a, 0224-b, 0224-c is established between the data set and the item container. In the embodiment of the present application, the application items of the data user may specifically refer to external applications 0213-a,0213-b,0213-c and applications inside item containers 0223-a,0223-b, 0223-c. If the accessed data set is a user's own data set, the associated data set 0222-a, 0222-b, 0222-c object connects 0225-a, 0225-b, 0225-c to the virtual data access interface 0242 of the virtual data set service subsystem 0240. The virtual data set service subsystem 0240 then creates a virtual data set 0235 that is connected to the actual data in the user data sources 0211-a, 0211-b, 0211-c. If the accessed data set is a shared or subscribed data set, the virtual data set service subsystem 0240 identifies the role of the user after the associated data set 0222-a, 0222-b, 0222-c object 0225-a, 0225-b, 0225-c is connected to the virtual data access interface 0242 of the virtual data service 0240. If the data user is a collaborator, the virtual data set service subsystem 0240 obtains personalized security and privacy access control rules defined by the data owner by sharing the data set itself. If the data user is a subscriber, the virtual data set service subsystem 0240 obtains 0234 data owner defined subscriber role based security and privacy access control rules through settings in the directory. In an embodiment of the present application, the virtual data set service subsystem creates a virtual data set 0235 for each data access request. In an embodiment of the present application, each data set and each data access user has its own virtual data set 0235 when a data access request is initiated. The virtual data set can be deleted after the data access is completed, that is, when there is no data user to access the data, the virtual data set 0235 in the virtual data set service subsystem 0240 may be empty.

How to implement data sharing by the above-described data sharing system 0200 will be described in detail below with reference to the drawings and specific examples.

In an embodiment of the present application, a data user 0203-a, 0203-b, 0203-c may select one or more data items from its data sources 0211-a, 0211-b, 0211-c and register the selected one or more data items into its own data user environment 0221-a, 0221-b, 0221-c, the registered data set may be referred to as data set 0222-a, 0222-b, 0222-c. The registered data set will be used for application projects and sharing. Current systems will manage and control the use of these data sets. The data set owner may share the data set 0213 directly with collaborators, or publish the data set through a data directory to share 0226-a, 0226-b, 0226-c with unknown users. Data sets shared by other users or subscribed to data sets are also contained in the current data user's environment 0222-a, 0222-b, 0222-c. In their respective corresponding data user environments, data users may create project containers 0223-a,0223-b, 0223-c. Item containers 0223-a,0223-b,0223-c may access one or more data sets 0222-a, 0222-b, 0222-c. Project containers 0223-a,0223-b,0223-c can clean up, convert, and filter existing data sets to create new data or data sets, as well as analyze the data to generate reports. As previously described, each data user's data user environment 0221-a, 0221-b, 0221-c is isolated from all other data user environments, however, different data users can share data on their project containers and collaborate with each other 0213. In this case, the selected collaborators may see the shared data and items.

In conventional systems, data sources are centrally managed, whereas in the embodiment of the present application, data sources 0211-a, 0211-b, 0211-c belong to data sources that personal data users 0203-a, 0203-b, 0203-c manage themselves. Data users 0203-a, 0203-b, 0203-c have a corresponding data user environment 0221-a, 0221-b, 0221-c. Data users can only manage their data and items in their environment through user interfaces 0212-a, 0212-b, 0212-c.

In addition, data users 0203-a, 0203-b, 0203-c may also publish 0226-a, 0226-b, 0226-c data sets through collaboration 0213 or to a dynamic data directory module 0233 of a data sharing directory service subsystem 0230. When a data user selects a collaborator, the data user defines a particular set of personalized security and privacy access control rules for the collaborator. During publication 0226-a, 0226-b, 0226-c of the data, the data users 0203-a, 0203-b, 0203-c will select categories and catalogs, enter metadata, define security and privacy control rules based on subscriber roles, and define subscription approval processes, etc. The above-described personalized or role-based security and privacy access control rules are used to define different collaborator or subscriber roles and their corresponding data usage permissions, e.g., which subscriber roles can view published data sets in a directory, which portions of data must be filtered for which collaborator or subscriber roles, which portions of data must be masked or transformed for which collaborator or subscriber roles (transforming data using an owner-defined transformation formula), and specifying a particular access time range for a subscriber role, etc. The subscription approval process is used for defining approvers of data subscription requests, approval sequence and the like.

In the above example, the data users 0203-a, 0203-b, 0203-c are all providers of data, and the data sharing system 0200 manages data of itself and can share data of itself to collaborators or publish data of itself to unknown data users for subscription, and can also specify a role of a subscriber accessing data of itself and security and privacy access control rules based on the role to ensure security in the data sharing process.

On the other hand, data users 0203-a, 0203-b, 0203-c may also view or subscribe to data shared by other data users as consumers of data. In the embodiment of the present application, when a certain data user 0203-a, 0203-b, 0203-c browses or searches 0262 dynamic directory module 0233 and finds a published data set that is interested to the data user, the data user can subscribe 0263 to the data set through the data subscription service module 0231. After receiving a data subscription request from a data user, the data subscription service module 0231 may send a data subscription request 0265 to a specific data user (approver) according to a security and privacy access control rule based on a subscriber role defined by the data owner and a subscription approval process, for example, a subscription request to a data manager defined by the data owner for approval. After the data subscription request is approved, the data set subscribed to by the data user will be added to the data user's data sets 0222-a, 0222-b, 0222-c. Thereafter, each time an application item of a data user initiates access to a subscription data set, the virtual data set service subsystem 0240 will establish a virtual data set 0235 for that data user. That is, the data user accessing the subscribed data set actually accesses the virtual data set 0235 corresponding to the data set to be established by the virtual data set service subsystem 0240 for the data user. Similarly, when an application item of a collaborator 0213 initiates access to a shared data set, the virtual data set service subsystem 0240 creates a virtual data set 0235 to service the collaborator. Specific data access procedures may refer to the foregoing description.

In addition to the data users 0203-a, 0203-b, 0203-c described above, in embodiments of the present application, the data sharing system 0200 described above may be provided with a system administrator 0201 to manage the software and hardware of the system, manage user accounts, and assign user roles to the data users 0203-a, 0203-b, 0203-c.

Further, the above-described data sharing system 0200 may be further provided with a directory administrator 0202 to manage the data sharing directory subsystem 0230. Catalog administrator 0202 creates and manages a catalog, creates and manages categories in a catalog, and defines keyword tags for categories, and the like.

In some embodiments of the present application, the data sharing system 0200 shown in fig. 2 comprises three subsystems, namely, data user environment subsystems 0211, 0221, a data sharing directory service subsystem 0230 and a virtual data set service subsystem 0240. The subsystems can be implemented as a Service package by any physical computing device, any virtual computing device, any Software Deployment Container (Software Deployment Container), or any cloud Platform as a Service (PaaS). In some embodiments of the present application, the software deployment container may also be referred to as a PaaS container, for example, an open-source application container engine Docker. The cloud PaaS may employ Amazon Web Services (AWS), for example. Hereinafter, the PaaS container and the cloud PaaS are collectively referred to as PaaS for convenience of description.

In addition, in some embodiments of the present application, the subsystems may be implemented by a combination of physical computing equipment, virtual computing equipment, and PaaS in a variety of different deployment manners. For example, the three subsystems may be deployed on three different computing devices (physical or virtual) or PaaS, respectively. For another example, two of the three subsystems may be implemented by the same device (physical computing device or virtual computing device) or the same PaaS, and the other subsystem may be implemented by the other device (physical computing device or virtual computing device) or the other PaaS.

In some embodiments of the present application, the subsystem may include a Command Line Interface (CLI) and/or an Application Programming Interface (API) to enable user interaction with an application. In addition, the subsystem may further include a Graphical User Interface (GUI) to facilitate user interaction with the application. Where the subsystems described above include a GUI, the GUI will be considered to be the front-end module, while the CLI, API, and the collection of other functional modules will be considered to be the back-end module. In some embodiments of the present application, the front-end module and the back-end module in any of the above subsystems may be implemented as one service package by a combination of any physical computing device, virtual computing device, and PaaS. In other embodiments of the present application, the front-end module and the back-end module in any of the above subsystems may be separately deployed. For example, the front-end module may be implemented using a WEB server or WEB cluster; the back-end module can be realized by the combination of physical computing equipment, virtual computing equipment and PaaS under any cluster or non-cluster scene. In this case, the front-end module will communicate with the back-end module through the API. Furthermore, in some embodiments of the present application, the front-end module and the back-end module may also be configured to operate in the same or different network environments.

In a clustering scenario, some or all of the three subsystems may be implemented by a clustering technique. Each of the three subsystems or some of the three subsystems can be configured as a set of use cases in a parallel computing manner. The clustering technique may include: tightly coupled clustering techniques with or without shared storage, loosely coupled clustering techniques, active-active techniques, active-passive techniques, map-reduced clustering (e.g., Hadoop) techniques, and so forth. The above subsystems may be implemented by any of the above clustering techniques. Similarly, the computing device used may be any physical computing device, virtual computing device, or PaaS. And when deployed, the use case can be implemented by any combination of physical equipment, virtual equipment, and PaaS.

It should be noted that, in some embodiments of the present application, the deployment may be stored in a conventional data center, a private cloud, a public cloud, a hybrid cloud, or a cloud formed by connecting multiple clouds. Wherein, public cloud in the mixed cloud is used as extension of private cloud or data center.

As can be seen from the above description, the above data sharing scheme is very different from the conventional solution. First, in the conventional solution, there is one virtual data server per data source, a security policy is enforced for all users accessing the data source (multiple users use one virtual service), and the security and privacy access control rules of each data source can only be defined by the system administrator of the data user, which is a centralized data sharing system. Each data source typically has a matching one-to-one virtual data server to multiple data visitors (i.e., one virtual data service to multiple visitors). While conventional solutions may enable role-based security and privacy access control, only system administrators can set access control rules. Conventional solutions do not allow any data users to each share their own data and individually defined security and privacy access control rules. In the scheme of the application, when a data user or an application of the data user initiates a data access request, the virtual data set service subsystem determines an original data set related to the data access request, and automatically creates a virtual data set corresponding to the data user request access according to the determined original data set (wherein each data visitor has a virtual service when accessing one data set; and each visitor has a virtual service when accessing one data set; namely, one virtual data service is for one visitor). The virtual data set service subsystem will then feed back the created virtual data set. Thus, different data collaborators or subscribers may obtain that the virtual data set 0235 is different even though the same data set is accessed. The accessible data set content will be limited based on the role of the collaborator or the subscriber. That is, in the above scheme, the virtual data set is created to perform online data transformation and filtering according to personal or role-based security and privacy access control rules defined by the data owner for collaborators or subscribers. And after the data access is finished, the virtual data set service subsystem can also delete the created virtual data set, thereby releasing the storage space occupied by the virtual data set.

Still further, through the data sharing system, data users can use subscribed data sets as well as data sets shared by other data users and combine these data sets with their own data sets to generate new data sets and reports. In addition, through the data sharing system, data users can publish new data sets, and other data users can create more new data sets by using the data sets. By analogy, the data sharing model allows for the recursive creation of novel and useful information.

Another advantage of the embodiments of the present invention is that although all raw data sets may come from data sources of different types, different formats and/or different data access interfaces, the data sharing system according to the embodiments of the present invention can provide a data access interface with a uniform data format for all users to access data by providing data in the form of virtual data sets.

The various subsystems within the data sharing system described above are further described in detail with reference to the figures.

Fig. 3 is a schematic diagram of internal structures of a data user environment subsystem 0310 and a data user environment subsystem 0221 in the data sharing system according to the embodiment of the present application.

In the embodiment of the present application, as shown in FIG. 3, the data user environment subsystem 0310 manages a plurality of data user environment objects 0312-a, wherein each data user environment object 0312-a represents the same meaning as the data user environments 0221-a, 0221-b, 0221-c shown in FIG. 2, and is temporarily represented by reference 0312-a hereinafter.

In some embodiments of the present application, the data user environment object 0312-a is used for recording and storing user information of a corresponding data user and resources allocated to the data user. The data user environment subsystem 0310 is used to manage all data user environment objects 0312-a and provide support for the graphical user interface provided to the data user.

In some embodiments of the present application, each data user environment object 0312-a comprises the following objects: user information objects 0314 for holding data user account information, a set of data source objects 0320 for managing the data sources of the data users, a plurality of data sets 0390, such as data sets 0330-a, … …, 0330-z, and a plurality of item containers 0340-a, 0340-b, …. The user information object 0314 contains account information, configuration files, security, preference settings, and the like of the data user. The set of data source objects 0320 includes a data source connector. The data source connector object contains connection information and methods for connecting to a data source. In some embodiments of the present application, the set of data source objects 0320 may comprise: personal online file storage data connector 0321, one or more enterprise data connectors 0323, and subscription data connector 0326.

The personal online file storage data connector 0321 is connected to a personal online file storage of a data user, and the file storage may include a plurality of data files 0328-a. In an embodiment of the present application, a data user can upload and store a file 0328-a containing private data.

Each of the corporate data connectors 0323 contains connection information for connecting to a data server object 0324. The data server object 0324 may be interfaced to a data server such as a database server, document server, application data server, or the like. The data user may select the data server to be added. After adding a data server, data source object group 0320 will create enterprise data connector 0323 containing connection information to the data server and create data server object 0324 for managing the data server metadata. Data tables or data files 0328-b stored in enterprise data servers are accessible via data server object 0324 as described above.

FIG. 4 shows a process for adding a data server according to an embodiment of the present application. This process of adding a data server may be applied to the data user environment object 0312-a described above. As shown in fig. 4, the process mainly includes the following steps:

at S0400, a data user initiated request for adding a data server is received, where the request for adding a data server includes information such as the type, address, port, and access identifier and password of the data server to be added.

At S0402, the data consumer environment object 0312-a may detect whether the data server can be successfully connected according to the received information, such as the type, address, port, access identifier, and password of the data server. If the data server can be successfully connected, continuing to execute S0404; otherwise, ending.

At S0404, enterprise data connector 0323 is created under data source object group 0320 to store connection information for connecting to the data server.

At S0406, a data server object 0324 is created and the metadata (e.g., data tables or data files 0328-b), description and attributes of the data server are stored. And then ends.

Fig. 4 above shows only one embodiment of adding a data server. In other embodiments of the present application, it is also possible to not create data server object 0324, but to store the metadata of all data servers in enterprise data connector 0323. In different embodiments, data asset information may be organized and managed in different ways, but may produce the same results.

The subscription data connector 0326 contains data items 0328-c subscribed to by the data users owning the data user environment 0312-a. These subscribed data items 0328-c are published by other data users in the dynamic data directory 0233. The process of data distribution will be described in detail later, and will not be described in detail herein.

In some embodiments of the present application, the data user may also create new data files or tables in their personal online file storage data connector 0321 and enterprise data connector 0323. The data user selects useful data items (e.g., data files or tables, etc.) and registers them as data sets 0390, 0329-a, 0329-b, and 0329-c, for analysis, report generation, new data generation, etc. When a data user registers certain data files or tables, their corresponding data user context object 0312-a will create a corresponding data set 0330-a, 0330-b, 0330-c to track and manage these data files and tables. For example, in FIG. 3, data set 0330-a represents a data set created after a person has registered 0329-a with data file 0328-a connected to online file storage data connector 0321; data set 0330-b represents a data set created after a data table or data file 0328-b contained in data server object 0324 is registered 0329-b; and data sets 0330-c represent data sets created after the subscribed data items 0328-c are registered 0329-c.

FIG. 5a illustrates a data set registration process according to an embodiment of the present application. This data set registration process may be applied to the data user environment object 0312-a described above. It should be noted that fig. 5a is only one embodiment of the present application, and the data set registration process described in the present application may also be implemented by other methods, and the present application does not limit the specific implementation manner.

In S0502, judging a data source of the data set to be registered, and if the data set is registered for the personal online file storage data set, executing S0510; if the enterprise data set is registered, executing S0520; if it is to register the subscribed data set, S0530 is performed.

The personal online file storage data set may also be referred to as a local data set.

At S0510, a data file to be registered selected by the data user in the folder of the personal online file storage is received as a data item to be registered S1, and then S0522 is performed.

In this step, the data user may browse the folder of the personal online file storage, which may also be referred to as a local folder, and select a data file to be registered as a data item to be registered S1.

At S0520, a data file or table to be registered selected by the data consumer among the data stored by the data server of the enterprise is received as a data item to be registered S1, and then S0522 is executed.

In this step, the data user may first select a data server of the enterprise, connect to the data server, browse its directory, select a data file or table to be registered, and use it as a data set to be registered S1.

At S0530, one subscribed data set selected by the data user is received as the data set to be registered S1, and then S0522 is performed.

At S0522, the registered data set name input by the data user is received.

At S0524, a registration data set object R1 (e.g., data set 0330-a, 0330-b, or 0330-c in FIG. 3) is created, and the created registration data set object R1 is connected to the selected data item (file, table, or subscribed data item) to be registered S1.

At S0526, after creating the registered data set object R1, metadata entered by data users, added collaborators (e.g., collaborators 0312-c in FIG. 3) to share the data set R1, and personal personalized security and privacy access control rules defined for each collaborator (e.g., 0360-x in FIG. 3 set for collaborators 0312-c) are received.

Through the method, the data in the personal online file storage, the data of the enterprise server and the subscribed data can be registered as the data set 0390 in the data user environment of the data user.

In some embodiments of the present application, the method for registering data sets may further include the following step to further create a shared data set R2 in the data user environment objects of the respective collaborators and connect the shared data set R2 to the own registered data set R1 so as to share the own registered data set R1 to the collaborators. These steps include:

at S0528, it is judged if there are any collaborators that have not created the shared data set R2 in the own data consumer environment object, i.e., the operation of the following step S0529 is not performed, and if there are any other collaborators that have created the shared data set, S0529 is performed in the collaborators' data consumer environment object; otherwise, ending.

At S0529, collaborators are added to the data set object R1.

FIG. 5b illustrates a flowchart of a method for adding collaborators to a data set according to an embodiment of the present application. As shown in FIG. 5b, the above process of adding collaborators to the data set object R1 may include:

at S0562, collaborator information provided by the data user is received.

In an embodiment of the present application, the above-mentioned collaborator information may include names and IDs of collaborators. For example, data set R1 is located in a collaborator's data user environment.

At S0563, collaborator rights set by the data user are received.

In the embodiment of the present application, the collaborator rights also define or limit operations that collaborators can perform on the data set object. For example, defining whether collaborators can modify metadata of a data set object; defining whether collaborators can share and disclose data and information of the data set object; it is defined whether collaborators can read the data content of the data set object data item, and whether data content can be written to the data set object data item.

At S564, it is determined whether the collaborator is allowed to read or write the data content.

At S566, if the collaborators are allowed to read or write data content, the data user defines personalized security and privacy access control rules (e.g., security and privacy access control rules 0360-X) to restrict the collaborators' access to the data content. Thus, in this step, the personalized security and privacy access control rules defined by the data user will be received.

For example, the personalized security and privacy access control rules described above may define which portions of data content need to be masked, which portions of data content need to be filtered out, and which portions of data content need to be transformed before being shared to collaborators.

At S568, the data collaborators, the collaborator permissions described above, and the defined personalized security and privacy access control rules are written to the source data set object (e.g., data set R1 in this example).

At S0570, a new data set object R2 is created in the collaborator' S data user environment.

In an embodiment of the present application, the new data set object R2 (e.g., data set 0330-x) described above is connected to its source data set object R1 (e.g., data set 0330-a, 0330-b, or 0330-c).

Further, in embodiments of the present application, collaborators may be added or deleted at any time after registering a data set and creating a data set object.

The shared data set R2 is created in the collaborator's data user environment object and the newly created shared data set object R2 is connected to its own registered data set object R1.

It should be noted that the data set R2 is only a link to the data set R1 that was just registered, and the data user can share the data set R1 with collaborators through R2 to enforce personalized security and privacy access control rules.

It should be noted that the above-mentioned S0528-S0529 may be executed in a loop until the above-mentioned data user ends without a newly added collaborator.

It should be noted that, in the above process, the way in which the addition collaborators directly share data is different from the way in which data sharing is performed through data publishing. The way the data set is published will share data with unknown subscribers, while adding collaborators to the data set is to share data directly with known data users.

As shown in FIG. 3, the current data user shares 0351-a, 0351-b their data set 0330-a or 0330-b with collaborators 0312-c through their own data user context object 0312-a. The data set 0330-a or 0330-b currently shared by data users is displayed as data set 0330-x in the data user environment object of collaborators. The data user context object 0312-a of the current data user defines specific security and privacy access control rules 0360-x for collaborators 0312-c. As shown in FIG. 3, another data user shares his/her data set 0330-y with the data user (the collaborator of the 0312-d data user) corresponding to the current data user context object 0312-a through its data user context object 0312-d. This data set 0330-y is displayed as data set 0330-z in the data user context object 0312-a of the current data user. The data user to which the data user context object 0312-d corresponds also defines a specific personalized security and privacy access control rule 0360-y for the data user (i.e., collaborator) to which the current data user context object 0312-a corresponds. This means that when the data user corresponding to the data user environment object 0312-a accesses the data set 0330-z, he/she may not see the complete content in the data set 0330-y. The data in the data sets 0330-z are obtained by transforming the data in the data sets 0330-y according to the security and privacy access control rules 0360-y defined by the collaborators corresponding to the data user environment objects 0312-d.

In addition, according to some embodiments of the present application, a data user may also create a project container 0340 in their data user environment object 0312-a. Project resources are managed in a project container 0340. In one project, a data user may choose to use 0342 one or more of the data sets 0330-a, 0330-b, 0330-c, 0330-z to create a user program 0344 and/or to process and analyze data using a data processing tool 0346 associated with the system and generate reports or new data into the data sets 0330-a, 0330-b, 0330-c. Project container 0340 may also create a new data set in registered data cluster 0390.

In some embodiments of the present application, as shown in FIG. 3, a collaborator's project container 0340-x may use a data set 0330-x shared with the collaborator. It should be noted that the data set 0330-x is a restricted data subset of 0330-a or 0330-b. These restrictions are defined in the security and privacy access control rules 0360-x. In addition, the item container 0340-a of the data user context object 0312-a can use the data set 0330-y shared to him/her through the data set 0330-z.

As can be seen from the above description, through the data user environment subsystem 0310 shown in FIG. 3, a data user can register, manage, use, share, publish and create his/her data set, and create projects that process and analyze the data, and so on.

Fig. 6 is a schematic diagram illustrating an internal structure of a data-sharing directory service subsystem 0610 in the collaboration system according to the embodiment of the present disclosure. It should be noted that the data-sharing directory service subsystem 0610 is represented by the same meaning as the data-sharing directory service subsystem 0230 shown in fig. 2, and is denoted by the reference numeral 0610 in the following. As shown in FIG. 2, the data-sharing directory service subsystems 0610, 0230 are managed by directory administrator 0202 and used by data users 0203-a, 0203-b, 0203-c. Data consumers 0203-a, 0203-b, 0203-c may be data publishers (data owners) and data subscribers (data consumers). All system resources (e.g., dynamic data directories 0233, 0620) are recorded and maintained as logical objects and managed by the data-sharing directory service subsystem 0610.

In an embodiment of the present application, directory administrator 0202 may create one or more dynamic data directories 0620 through data sharing directory service subsystem 0610. In the created one or more dynamic data directories 0620, the directory administrator 0202 may create a category 0622 of the dynamic data directory and add tags or keywords to the category. In embodiments of the present application, each dynamic data directory 0620 also includes a data distribution service.

As shown in FIG. 6, in some embodiments of the present application, a data owner, such as data user 0628, may publish a data set 0330-a connected to a personal online stored data file 0328-a and a data set 0330-b connected to an enterprise data file or data table 0328-b into one or more categories in one or more directories to form a data set publication object 0626 for data sharing with an unknown number of subscribers.

Further, in some embodiments of the present application, a data subscriber, such as data user 0638-a, may subscribe to a data set, such as data set 0330-a, in data set publication object 0626. If the data subscription is successful, the subscriber's information can be added in subscriber object 0632 and subscribed data sets 0330-c can be added in the subscriber's data user context according to subscribed data items 0328-c.

Fig. 7 illustrates a process of data set publishing according to an embodiment of the present application. The publishing process for the data set may be applied to the data user context object 0312-a. As shown in fig. 7, the publishing process of the data set includes the following steps:

at S0700, a registered data set to be published selected by a data user is received.

The data set selected by the data user to be published can be the whole data set or a partial data set.

At S0702, it is verified whether the selected data set is a publishable data set, and if so, S0604 is executed; otherwise, ending.

In the above step, if the selected data set is a subscribed data set or a data set shared by other data owners, the owner of the data set may not allow the subscriber or collaborator to publish the data set. As such, in this step, the data user environment object 0312-a may verify that the selected data set is a publishable data set by checking whether the owner of the selected data set allows publication of the data set. The data set is a publishable data set if the owner of the selected data set allows publication of the data set.

At S0704, the dynamic data directory 0620 selected by the data consumer for the data set to be published is received.

At S0706, one or more categories 0622 selected by the data consumer for the data set to be published are received.

After the above steps are performed, in S0707, the data user will also select all data contents or part of data contents.

At S0708, metadata provided by the data user for the data set to be published is received.

At S0710, role-based security and privacy access control rules defined by the data user for the data set to be published are received.

In an embodiment of the present application, the role-based security and privacy access control rule defines data contents that subscribers in different roles can see. The role-based security and privacy access control rules described above may relate to desensitization and masking of some data, conversion of original information (e.g., conversion from code to name string, etc.), filtering, denial of access based on time-range criteria, i.e., access time control, prohibition of release of derivative data, and so forth.

At S0712, the subscription approval process defined by the data user for the data set to be published is received.

In the embodiment of the present application, the subscription approval process defines which management role or roles must be involved in approving the data subscription request. For example, the data owner may specify that the subscription request for data can only be approved if the administrator of the data subscription requesting user, the catalog administrator, the data owner, and the administrator of the data owner all agree. For another example, the data owner may specify that the data subscription request can be approved only if one of the manager of the data subscription request, the catalog manager, the data owner, and the manager of the data owner, or if some person agrees. The subscription approval process can also define the order in which the approval is required to be carried out when the subscription request of the data is approved, that is, according to the defined approval order, only the approver in the front order approves the subscription request of the data, and the approver in the back order needs to carry out the approval; however, when the approver in the previous order does not approve the subscription request of the data, the approver in the next order does not need to perform approval.

At S0714, the data set is published.

In some embodiments of the present application, the publishing data set may specifically include: the above information such as the reference of the received data set, the reference of the data content, the reference of the data directory, the metadata, etc. is packaged and uploaded to the shared directory service subsystem 0610, and the shared directory service subsystem 0610 sends the packaged information to the data publishing service in the dynamic data directory 0620 to complete the publishing of the data set, which may specifically include: the data sharing directory service subsystem 0610 creates a new data set publishing object 0626, linking the data set publishing object 0626 to the selected category 0622 under the selected dynamic data directory 0620. The data-sharing directory service subsystem 0610 stores metadata for published data sets, role-based security and privacy access control rules, and subscription approval processes to the data set publishing object 0626, and links the data set publishing object 0626 with the original data set (e.g., 0330-a or 0330-b). In some embodiments of the present application, the reference to the data set may specifically be an address, a pointer, an identifier, a tag, or a unique name of the data set. The above-mentioned references to data content can generally be used to determine the location of a portion of the data content in a data set. The reference of the data directory may specifically be a pointer, an identification, a tag, an address or a name of the data directory, etc.

Through the method, a data user can publish its registered data set (for example, 0330-a or 0330-b) to the directory classification of the data sharing directory service subsystem 0610 through its own data user environment object 0312-a 0626, and at the same time, define a role-based security and privacy access control rule and a subscription approval process corresponding to the data set, thereby ensuring that the published data set can be used safely.

The data-sharing directory service subsystem 0610 further includes a data subscription service module 0630, which is configured to provide data subscription service to the user.

Fig. 8 illustrates a process of data set subscription according to an embodiment of the present application. The subscription process for the data set may be applied in the data subscription service module 0630. As shown in fig. 8, when a data user browses 0262 the

dynamic data directory

0233, 0620 and selects a published data set to subscribe to from the data set publishing object 0626, then the following data set subscription 0263 process is initiated:

at S0802, a data subscription request 0263 sent by a data user is received.

In some embodiments of the present application, the data subscription request 0263 described above will contain a reference to a published data set selected by the data user. The reference of the published data set may specifically be a name, an address, a pointer, an identifier or a tag of the published data set.

At S0804, the received data subscription request is sent to one or more approvers for approval 0265 according to the subscription approval process defined by the data owner for the published data.

At S0806, responses 0265 for all approvers are received.

At S0808, it is determined whether all of the approvers approve the data subscription request, and if so, S0810 is performed; otherwise, ending.

At S0810, there is a subscription object 0632 for each published data set to track all subscribers 0634-a to the data item. Thus, at step S0810, if the data user is not in the subscriber list for the data set, a subscriber object 0634-a for the data user is created in the subscriber object 0632 for the data set. The subscriber object 0634-a described above connects to the subscriber data user's environment object 0312-a and to the data set publishing object 0626.

At S0812, the data subscription service module 0630 creates a data source object 0326 of the subscription data in the context 0638-a of the subscriber' S corresponding data user to connect to the data item subscribed by the data user, and subscribes to data connector 0328-c.

At S0814, the subscriber may register the subscribed data set 0330-c in the item container of their data user environment 0638-a.

Alternatively, as an alternative to S0814 described above, the system may also automatically register the subscribed data set.

Therefore, it can be seen that, through the data sharing directory service subsystem 0610, a data user can publish its own data set and also subscribe to data sets published by other data users.

Fig. 9 shows the internal structure of the virtual data service subsystem 0910 according to the embodiment of the present application. Note that the above-mentioned virtual data service subsystem 0910 represents the same meaning as the above-mentioned virtual data service subsystem 0240 shown in fig. 2, and is hereinafter referred to by a reference symbol 0910 for the moment. Fig. 9 shows how data from different data sources is accessed by the application items 0911-a, 0911-b, 0911-c of the data user. It should be noted that the application items 0911-a, 0911-b, 0911-c include applications 0213-a,0213-b,0213-c of the data user and programs in the item containers 0223-a,0223-b, 0223-c.

How a data user's application project accesses data sets from different data sources can be illustrated by the structure shown in fig. 9. In embodiments of the present application, there are three data sources: 1) data sets directly owned by data users that are registered from a local or enterprise data server, referred to as directly owned data sets, e.g., data sets 0330-a, 0330-b in fig. 3. 2) Subscribed data sets, simply referred to as subscribed data sets, such as data set 0330-c in fig. 3. 3) For example, the data set 0330-x in fig. 3 is a data set shared by a data user (hereinafter, referred to as data user 0312-a) corresponding to the data user environment object 0312-a and the data set 0330-z in fig. 3 is a data set owned by and shared with a data user (hereinafter, referred to as data user 0312-d) corresponding to the data user environment object 0312-c, and the data set shared by another data user (hereinafter, referred to as data user 0312-d) corresponding to the data user environment object 0312-d.

As shown in fig. 9, the virtual data service subsystem 0910 mainly includes a virtual data set access interface 0912. The virtual data service subsystem 0910 will also generate one or more virtual data sets 0916-a, 0916-b, …, 0916-e when a data user accesses the directly owned data set, the subscribed data set, or a data set shared by other data users.

In an embodiment of the present application, all data access by the data user will be via the virtual data set access interface 0912. In other embodiments, different dataset types may have different access interfaces, or the functions of the access interfaces may be performed by the dataset object itself.

Specifically, in some embodiments of the present application, as shown in FIG. 9, when an application item 0911-a of a data user 0312-a wants to access a data set 0330-c to which the user subscribes (the data set 0330-c is a subscription data item 0328-c), the data user 0312-a will send a data access request to the virtual data set access interface 0912. After the virtual data set access interface 0912 receives the data access request, the virtual data server subsystem 0910 judges that the data set 0330-c related to the data access request is the data set which is published by the data owner 0917 through the data publishing process, and then generates a virtual data set 0916-a according to the data set published by the data owner 0917, the role-based security and privacy access control rule, and the role of the subscriber data user 0312-a.

When an application item 0911-b of a data user 0312-a wants to access a data set 0330-a or 0330-b to which the data user is subscribed, the data user 0312-a sends a data access request to the virtual data set access interface 0912. When the virtual data server subsystem 0910 determines that the data set 0330-a or 0330-b related to the data access request is a data set directly owned by the data owner 0932 after the virtual data set access interface 0912 receives the data access request, the virtual data set 0916-b, 0916-c, or 0916-d is generated according to the data table or the data file 0328-a or 0328-b requested to be accessed.

When a data user, i.e. an application item 0911-x of a collaborator 0312-c, wants to access a data set 0330-x shared by other data users, the data user 0312-c will send a data access request to the virtual data set access interface 0912. After the virtual data set access interface 0912 receives the data access request, the virtual data server subsystem 0910 judges that the data set 0330-x related to the data access request is a data set shared by other data users, and then the data set 0330-a or 0330-b requested to be accessed and the personalized security and privacy access control rule set by the data user 0922 for the collaborator 0312-c generate a virtual data set 0916-e.

FIG. 10 illustrates a process for initiating data set access according to an embodiment of the present application. In an embodiment of the present application, the above-described startup data set access procedure may be applied to the virtual data set service subsystem 0910. If the process completes successfully, the virtual data set service subsystem 0910 may provide a virtual data set handle for the application item that sent the data access request. The application item can then access the data through the virtual data set handle. The access operation may specifically include a READ (READ) or WRITE (WRITE) operation. The above-described read and write data access operations are illustrated in fig. 11 and 12 later, and will not be described in detail here.

As shown in fig. 10, the process of initiating data set access includes the following steps:

at S1000, a request for access to data set a by an application item is received.

For example, the process begins at S1000 when an application item 0911-a of a data user 0312-a initiates access to a data set 0330-c. Wherein the data set 0330-c is a data set subscribed to by the data user 0312-a. As another example, the process also begins at S1000 when an application item 0911-x of a data user 0312-c initiates access to a data set 0330-x. Where the data set 0330-x is a shared data set linked to data set 0330-a or 0330-b. This data set 0330-x is shared by another data user collaborating with the current user 0312-c. As another example, the process also begins at S1000 when an application item 0911-b of a data user 0312-a begins accessing a data set 0330-a or 0330-b. Wherein the data set 0330-a or 0330-b is a data set directly owned by the data user 0312-a.

At S1001, the access request is connected to the virtual data set access interface 0912.

In S1002, determining the data source of the data set by the virtual data set access interface 0912, and if the data set is the subscription data set 0330-c, executing S1012; if the data set is the shared data set 0330-x, then S1062 is performed; s1030 is performed if the data set is the directly owned data set 0330-a or 0330-b as described above.

At S1012, the virtual data set access interface 0912 searches for the subscription data item 0328-c and data set publication object 0626 corresponding to the subscription data set 0330-c, and searches for the associated published data set 0330-a or 0330-b through the data set publication object 0626.

At S1014, a specific role-based security and privacy access control rule is extracted at the corresponding data set publishing object 0626 according to the subscriber' S role.

At S1016, a virtual data set 0916-a is created according to the extracted role based security and privacy access control rules.

The virtual data set 0916-a converts the extracted security and privacy access control rules into data conversion logic and loads the data conversion logic on the virtual data set 0916-a.

At S1018, the original data set 0330-a or 0330-b is looked up by the data set publishing object 0626, data set A is set as the original data set 0330-a or 0330-b, and then S1031-S1041 is performed to open the original data set 0330-a or 0330-b and save the original data handle in the virtual data set 0916-a.

In S1062, the original data set 0330-a or 0330-b associated with the shared data set 0330-x is searched for.

At S1064, specific personalized security and privacy access control rules defined by the data owner for collaborators sharing the original data set 0330-a or 0330-b are extracted.

At S1066, virtual data sets 0916-e are created according to the particular security and privacy access control rules described above.

In this step, the virtual data set 0916-e converts the extracted security and privacy access control rules described above into data conversion logic, which is then loaded into itself.

In S1068, data set A is set as the original data set 0330-a or 0330-b, and S1031-S1041 is performed to open the actual data set 0330-a or 0330-b and save the original data set sentence stems in the virtual data set 0916-e.

In S1030, the opening of the own data set 0330-a or 0330-b is started, the virtual data set 0916-d is created first, then data set a is set as the original data set 0330-a or 0330-b, and S1031-S1041 is performed to open the actual data set 0330-a or 0330-b and save the original data handle in the virtual data set 0916-d.

To this end, all the virtual data sets 0916-a, 0916-d and 0916-e will go through the same path starting from S1031 to open the actual data items, and the process starts from S1031 described above.

As described in the previous section, both virtual data sets 0916-a and 0916-e include data transformation logic for enforcing security and privacy access control rules. The data transformation logic in virtual data set 0916-a transforms the original data and sends the transformed data to the data user according to role-based security and privacy access control rules defined by the publisher for the role of the particular subscriber. The data transformation logic in the virtual data sets 0916-e will transform the data according to the personalized security and privacy access control rules defined for the collaborators when the original data set is shared to the collaborators by the data owner. It should be noted that collaborators are current data users. When the data user extracts the data, the original data is converted and then sent to the data user. Virtual data sets 0916-d do not include data transformation logic.

In S1031, the data source of data set a is tested, and if the source of the data set is a stored personal online file (see personal online file storage data connector 0321 in fig. 3, in which case the data set is 0330-a), S1032 is performed; if the source of the data set is an enterprise data server (see enterprise data server object 0324 in FIG. 3), then S1034 is performed.

At S1032, the virtual data set access interface 0912 connects to the personal online file stored by the data user, and then S1036 is performed.

At S1034, the virtual data set access interface 0912 connects to the data server associated with the data server object 0324, and then S1036 is performed.

At S1036, the data type of data set A is examined, where the data type can be a file, a file object, or a database table. If the data set a is a file or a file object, S1038 is performed; if the data set A is a database table, S1040 is performed.

At S1038, the associated data file or data file object 0328-a or 0328-b is opened and the file handle is obtained, and then S1041 is performed.

At S1040, a handle is created, the relevant data table 0328-b is found at the data server, and the database table is associated with the handle. Then S1041 is performed.

In S1041, a file handle or a database table handle is saved in the virtual data set 0916-a, 0916-d or 0916-e, and then the virtual data set 0916-a, 0916-d or 0916-e can be returned.

Once the virtual data sets 0916-a, 0916-d and 0916-e have been created, as shown in FIG. 9 or 10, data extraction access requests (READ or WRITE) from the user items can be processed through the virtual data sets 0916-a, 0916-d and 0916-e.

FIG. 11 illustrates a process of accessing a subscribed or shared data set after a virtual data set is created. As shown in FIG. 10, at S1016, a virtual data set 0916-a is created to access the subscribed data sets; at S1066, a virtual data set 0916-e is created to access the shared data set (shared with the current user by another data user).

At S1100, a data access request for a data set 0330-c or 0330-x from an application item 0911, 0911-x is received. In an embodiment of the present application, the data access request may be a READ Request (READ) or a WRITE Request (WRITE) request.

At S1102, data set 0330-c or 0330-x forwards the data access request using the virtual data set 0916-a or 0916-e obtained from S1000.

At S1104, the virtual data set 0916-a or 0916-e forwards the data access request to the original data set (file or database table) according to the handle of the original data set obtained from S1041.

That is, in some embodiments of the present application, if the data access request is a read request, the corresponding data is obtained from the original data set according to the obtained handle of the original data set in the above step, and if the data access request is a write request, the corresponding data is written into the original data set according to the obtained handle of the original data set in the above step.

At S1106, it is determined whether the type of the data access request is a READ Request (READ) or a WRITE Request (WRITE), and if the data access request is a READ request, S1108 is performed; if the data access request is a write request, the write request thereof has completed the operation at S1104, and then S1110 is performed;

at S1108, the virtual data set 0916-a or 0916-e transforms the obtained data using the data transformation logic (loaded in S1016 or S1066) already loaded by itself, and then sends the transformed data back to the invoked application item;

at S1110, the data access request result is sent back to the invoked application item.

FIG. 12 illustrates a process of accessing a directly owned dataset after a virtual dataset object is created.

At S1230, the application item 0911 issues a data access request to a data set 0330-a or 0330-b. In an embodiment of the present application, the data access request may be a READ Request (READ) or a WRITE Request (WRITE) request.

At S1232, data set 0330-a or 0330-b forwards the data access request to virtual data set 0916-d obtained from S1000.

At S1234, the virtual data set 0916-d forwards the data access request to the file or database table according to the file or database handle obtained from S1041. And if the request is a read request, acquiring corresponding data from the original file or the database according to the acquired file or database handle. And if the data access request is a write request, writing corresponding data into the original file or the database according to the obtained file or database handle.

At S1236, the result is returned to the application item.

Through the virtual data service subsystem 0910, when a data user accesses a subscribed data set, a virtual data set can be created for the data user according to the security and privacy access control rule of the role of the data user, so that the accessibility of the content of the data set can be flexibly limited according to the role of the data user. It should be noted that, in the embodiment of the present application, the virtual data sets 0916-a, 0916-d and 0916-e are created in real time after the virtual data service subsystem 0910 receives a data access request from a data user. The virtual data service subsystem 0910 also does not need to maintain the data sets when the data users no longer use them. That is, in the case where no data user accesses data, there is no virtual data set that needs to be maintained in the virtual data service subsystem 0910. It can be seen that such an approach can greatly save memory, bandwidth, and computational resources of the system. In addition, in the embodiment of the present application, the virtual data set established in the above manner is related to the accessed data and the visitor, and the data actually obtained by different data users accessing the same data set may also be different, which also greatly facilitates the data owner to realize the secure sharing of the data.

The structure of the various data objects used in the embodiments of the present application will be described in detail below by means of some specific exemplary tables.

In some embodiments of the present application, the data-related objects are objects for managing resources (e.g., data servers, files, tables, items, etc.) and managing resource usage. It should be noted that the information displayed in the following tables is only one example of the data object according to the embodiment of the present application, and the present application is not limited to the mode shown in the following tables. For example, the information displayed by the underlying table may be a subset of the information needed for the data-related object, or some of the information may be redundant. And in different embodiments it is also possible to combine several pieces of information together as a single information item or to split one information item below into several different information items. These variations do not affect the technical solution and the protection scope of the present application.

Table 1 shows the information contained in enterprise data connector 0323. The enterprise data connector 0323 includes connection information for connecting to the enterprise's data server object 0324.

As shown in table 1, enterprise data connector 0323 may include: the type of data server (e.g., Oracle database, MySQL, Hive, HDFS, NFS, CIFS, Salesforce, etc.), the address of the data server, the data server port, and the data owner credentials. Wherein the type of data server indicates whether the data server is a database or a file server. The address and port of the data server allow the data user environment 0312-a to connect to the data server. The data owner credential enables the data user environment 0312-a to establish a secure and reliable connection with the data server, which may specifically include access identification and password information.

TABLE 1

It can be seen that enterprise data connector 0323 containing the above information may enable data user environment 0312-a to establish a connection with a data server using a suitable data server protocol.

Table 2 shows the information contained in data server object 0324. The data server object 0324 includes information necessary to manage the enterprise data servers. Wherein the enterprise server may be a database server or a file/object server. In other embodiments of the present application, the data server object 0324 can be incorporated with the enterprise data connector 0323.

As shown in table 2, the data server object 0324 may include: a connection to enterprise data connector object 0323; data source name (e.g., file store, folder or database name), creation time, modification time, metadata associated with the data server (e.g., identity of data owner, security level, attributes and characteristics, etc.), usage control policy, and registered data items.

TABLE 2

The owner of the data server can set the security type of the data server object 0324 by setting a usage control policy of the data server, e.g., the owner of the data server can limit the downloading of data or limit the storage location of its derived data set, etc. A database server may contain tens to hundreds of tables and a file server may contain hundreds or thousands of files. When a data user selects and registers one or more forms/files to be processed, the forms/files are marked as registered data items.

Table 3 shows the information contained in the data file 0328-a object associated with personal online file storage data connector 0321. Among them, the personal online file storage data connector 0321 is a connector connected to a personal online folder, where a data user can upload and store a personal file. When an uploaded personal data file 0328-a is registered as a data set 0330-a for application to an item, the personal file is labeled as registered data set 0330-a.

As shown in table 3, the data file 0328-a may include: a connection to personal online file storage data connector 0321; a unique data item ID; file path names connected to the real files; file content style (file type); summary (including column name, description, type, etc.); the registration date and the registration data set ID. Where the registered dataset ID associates data file 0328-a to registered dataset object 0330-a.

TABLE 3

Table 4 shows the information contained in the data item (file or table) 0328-b object associated with enterprise data server object 0324. When a data user selects and registers a data item 0328-b from data server object 0324, the data item is labeled as registered data set 0330-b.

As shown in table 4, the data item (file or table) 0328-b object may include: a connection to a corresponding data server object 0324; a unique data object ID; type (file or table); a data item name connected to the real data item; summary (including column name, description, type, etc.); the registration date and the registration data set ID.

TABLE 4

Table 5 shows the information contained in the subscribed data item 0328-c object. Wherein the subscribed data item 0328-c object contains the data set published by the current data user from the

dynamic data directory

0233, 0626. Data users may publish their data sets to dynamic data items for data sharing, while other data users may subscribe to these published data sets. In data user environment 0312-a, subscribed data items 0328-c will be grouped under subscription data connector 0326.

As shown in table 5, the subscribed data item 0328-c object may include: a unique data ID; subscribe to the contents of data connector 0326; connection of published data set 0626; the published catalog and category; a data type; summary (including column name, description, type, etc.); metadata; the registration date and the associated registration data set ID.

TABLE 5

Table 6 shows the information contained in the registered data set. Where a registered data set refers to a managed list of data selected for data processing and analysis by a data user. The registered data sets may be personal data sets selected by the data user from personal online file storage data connector 0321, or enterprise data sets selected from data server object 0324, or data sets shared by other collaborators, or data items subscribed from dynamic data directory 0620.

As shown in table 6, the registered data set may include: a connection to the own data user context object 0312-a, a connection to the data ID of the actual data item, e.g. 0328-a,0328-b,0328-c, or 0330-y; registered dataset name and ID; a registration date; a data type; summary (including column name, description, type, etc.); metadata; data blood margin; data introduction; subscribing; collaborate (shared to or by me) and publish information.

TABLE 6

If a registered data set is a subscription data item 0328-c, it is associated with subscription data connector 0326 by means of the subscription ID. If a registered data set is a shared data set that is shared by other collaborators to the current data user, the registered data set object will contain "shared to me" information including the collaborators' IDs, the original data set ID, the access permissions (read/write metadata) granted by the data owner and the data security and privacy access control rules defined by the data owner. The current data user may also share their data set with other data users by adding collaborators. When a collaborator is added by a current data user, a "shared by me" record will be created in the registered data set object, allowing the data user to enter collaborator information, modify access permissions (read/write metadata), and set data security and privacy access control rules. Current data users may publish registered data sets under specific catalogs for data sharing with uncertain data users. If a registered dataset is published, the associated dynamic data directory 0620, categories 0622, publication ID, metadata, role-based data security and privacy access control rules, and subscription approval processes provided by the user will be recorded in the registered dataset object.

Table 7 shows the information contained in the item container 0340. Where the project container 0340 is a container in which a data user can manage the resources of a project, a user program 0344 can be created or a data processing pipeline can be formed using data processing tools 0346 to process and analyze data. A data user may create a project in their data user environment object 0312-a, add one or more registered data sets to the project, program or assemble a data processing tool in a data processing pipeline, and may schedule a program or data processing pipeline to form a job to perform a particular job. These jobs may be performed manually, on a schedule, or triggered by some event. Thus, a project object controls the scheduling and execution of one or more jobs, and records the progress and results of the execution. One project object is connected to its corresponding data user environment 0312-a.

As shown in table 7, the above item container 0340 may include: a connection to its own data user environment object 0312-a; an item ID; a project name; date of creation and update; metadata; one or more registered datasets; one or more data pipes and programs; one or more jobs; a job execution plan; the procedure and the result are performed.

TABLE 7

Table 8 shows the information contained in published data set 0626. Data users (data owners) may publish their data sets in the dynamic data directory 0620 to share their data with unknown data users (other users may subscribe to the data after the data sets are published). Data users (data consumers) may browse or retrieve the dynamic data directory 0620 and subscribe to data sets that are of interest to themselves.

As shown in table 8, the published data set 0626 may include: issuing an ID; issuing a name; an associated dataset ID and name; the catalog and category published by the dataset; metadata (e.g., information, attributes, and keywords available to the data, etc.); the data owner or the publisher defines the data security and privacy access control rules based on roles and the subscription approval process; and a list of subscribers (including subscriber ID, role, subscription data item, and subscription time).

TABLE 8

Table 9 shows the information contained in the personalized or role-based data security and privacy access control rule object. As shown in table 9, the object includes the following rules: rule 1, desensitization and data masking rules. The data owner can define the data fields that need to be masked to certain user roles by this desensitization and masking rule. Rule 2, a transformation rule. The rules define a conversion function by which particular data fields can be converted for particular user roles. Rule 3, filtering rule. The data owner can define the data that needs to be filtered out to certain user roles through the filtering rules. Rule 4, a restriction rule is issued. The data owner may restrict certain user roles from publishing data through the publication restriction rule. Rule 5, a time constraint rule. The data owner can add temporal constraints to certain user roles through the temporal constraint rule. The security and privacy access control rules in the table below are some examples, and there may be more rules for practical applications.

TABLE 9

When a data collaborator attempts to access shared data or a data subscriber attempts to access a subscribed data set, he first establishes a virtual data set by the method shown in fig. 10 above and connects to the shared or published data set (i.e., the original data) through the virtual data set. When the virtual data set is created, the original data set may be accessed in the manner of FIG. 11. At this point, the virtual data set will convert and filter the raw data online according to the personalized or role-based data security and privacy access control rules of table 9 above, and provide the results to the visitor entry.

FIG. 13 depicts a scenario of the present application in which data users can securely share their data with each other through publish and subscribe processes. In addition, in the scenario shown in FIG. 13, personalization of shared data and role-based security access control are enforced through virtual dataset objects that are created (instantiated) when a data user initiates data access.

FIG. 13 is similar to FIG. 2, e.g., in FIG. 13, data users 1301-a, 1301-b and 1301-c are similar to data users 0203-a, 0203-b, 0203-c in FIG. 1; data user environments 1302-a, 1302-b, 1302-c are similar to data user environments 0221-a, 0221-b, and 0221-c in FIG. 2; the data-sharing directory service subsystem 1304 is similar to the data-sharing directory service subsystem 0230 in FIG. 2; the virtual data set services subsystem 1305 is similar to the virtual data set services subsystem 0240 in fig. 2.

However, fig. 13 also discloses more details about how to implement secure sharing of data internally than fig. 2. In FIG. 2, individual data users may create items in item containers 0223-a,0223-b,0223-c to access data sets 0222-a, 0222-b, 0222-c in the respective data user environments. In FIG. 13, application items 1303-a,1303-b,1303-c include items in user's applications 0213-a,0213-b,0213-c and item containers 0223-a,0223-b, 0223-c. Each data user may create an application item 1303-a,1303-b,1303-c to access 1350-a, 1350-b, 1350-c data set 0222-a, 0222-b, 0222-c in the respective data user environment 1302-a, 1302-b, 1302-c). In an embodiment of the present application, the above-mentioned application items 1303-a,1303-b,1303-c may read data in the data sets 0222-a, 0222-b, 0222-c, and may create new data sets in the respective data user environments 1302-a, 1302-b, 1302-c or create new data in the existing data sets 0222-a, 0222-b, 0222-c.

In an embodiment of the present application, the data sets 0222-a, 0222-b, 0222-c in the data user environments 1302-a, 1302-b, 1302-c described above may also constitute data set groups, as illustrated by data set group 0390 in FIG. 3. As previously described, data sets 0330-a, 0330-b, 0330-c and 0330-z may be included in data set 0390 in FIG. 3. In FIG. 3, some of the data sets of a data user are data sets that the user directly owns, such as data sets 0330-a and 0330-b; some data sets are data sets that are shared directly by collaborators to the data user, such as data set 0330-z; while other data sets may come from data users' subscriptions through dynamic data catalogs, such as data set 0330-c. Through the data set registration process shown in FIG. 5, data sets (personal and corporate data) directly owned by a data user, data sets shared by subscription data sets and other collaborators, may be registered into the data user environment 1302-a, 1302-b, 1302-c of the data user. Wherein, for a subscribed data set, the role-based security and privacy access control rules are defined by the data owner through the publishing process of the data. Whereas for data sets shared by collaborators, personalized security and privacy access control rules are defined by the data owner when adding the collaborators.

In an embodiment of the present application, as shown in FIG. 13, data consumer 0203-a may publish 1310-a one or more data sets 0222-a to a dynamic data directory 0233. And during data set publication, data user 0203-a should define role-based security and privacy access control rules for data set subscribers (subscribers). The data user can further define the subscription approval process of the data set. Similarly, data users 0203-b and 0203-c may also publish 1310-b, 1310-c their data sets 0222-b, 0222-c. The publishing process of the data set may refer to fig. 7 described above. In embodiments of the present application, published data sets may also include data sets created by application items 1303-a,1303-b,1303-c, which may be created from data sets owned by data users themselves, data sets shared by other collaborators, and/or data sets subscribed to.

As shown in FIG. 13 above, the data sets 0222-a, 0222-b, 0222-c from data users 0203-a, 0203-b, 0203-c can include data sets 1320-a, 1320-b, 1320-c subscribed to by data users from the dynamic data catalog module 0233. In embodiments of the present application, a data set owner may define a set of role-based security and privacy access control rules for published data sets. These rules are stored in the dataset subscription object. Different data user roles can see different data according to these role-based security and privacy access control rules. Once the subscription process is complete, the data users 0203-a, 0203-b, 0203-c can register the subscribed data sets in their data user environments. The process of subscribing to a data set may refer to fig. 8 described above.

As shown in FIGS. 2 and 13, data users 0203-a, 0203-b, 0203-c can also collaborate by sharing 0213 their data set directly with collaborators. In fig. 3, data sets 0330-a, 0330-b, 0330-x, 0330-y, 0330-z shared in a collaborative manner are shown. During the sharing of a data set, the data owner may define personalized security and privacy access control rules for its collaborators.

In embodiments of the present application, when data users 0203-a, 0203-b, 0203-c share their data through direct data sharing or through publish and subscribe processes, they can define personalized (for direct data sharing) and role-based security and privacy access control rules for collaborators and data subscribers. By enforcing and enforcing personalized and role-based security and privacy access control rules, data subscribers and collaborators accessing a shared data set can see different information. When an application item 1303-a,1303-b,1303-c initiates a connection with one of the data sets 0222-a, 0222-b, 0222-c of data users, the data set object will spontaneously create (instantiate) one virtual data set 0235 based on the role of the data visitor (i.e., the item owner) and load the above defined personalized or role based security and privacy access control rules into the created virtual data set 0235. When an application item 1303-a,1303-b,1303-c accesses data through a data set 0222-a, 0222-b, 0222-c, the actual data access is performed by the created virtual data set 0235. Specifically, when the data user's application items 1303-a,1303-b,1303-c access data, the corresponding virtual data set 0235 accesses the actual data and transforms the actual data according to data transformation logic determined by personalized or role-based security and privacy access control rules, and then forwards the transformed data back to 1330-a, 1330-b, 1330-c application items 1303-a,1303-b,1303-c via the corresponding data sets 0222-a, 0222-b, 0222-c.

In embodiments of the present application, the application items 1303-a,1303-b,1303-c may further create new data sets 0222-a, 0222-b, 0222-c by combining information from multiple data sets. The sources of the above information may be data sets shared by collaborators, data sets subscribed to by data users, and data sets owned by data users themselves. Moreover, the application items 1303-a,1303-b,1303-c may further publish 1310-a, 1310-b, 1310-c the created new data sets 0222-a, 0222-b, 0222-c to the dynamic data directory 0233 as long as the personalized and role-based security and privacy access control rules allow publication of the data.

Through the mechanism, the data owner can manage the access of the data user to share the data according to the role and the state of the collaborators of the data user, and can further generate new data recursively in the process of combining the data shared by the collaborators, the subscribed data and the data owned by the collaborators.

FIG. 14 shows a process by which data users share their data while maintaining access control to their shared data as described in an embodiment of the present application. Through the process shown in FIG. 14, new data sets may be recursively generated and shared in turn.

In fig. 14, the beginning of the operational process for two different data users (data user a and data user B) is shown using reference numerals 1401-a and 1401-B.

In the process shown in FIG. 14, data servers are added to the data user environments of the two data users, respectively, at steps 1402-a and 1402-b.

In steps 1403-a and 1403-b, the data sets selected by the two data users from one of the added data servers are received, respectively.

In steps 1404-a and 1404-b, the selected data set is registered in the data user contexts of the two data users, respectively.

It should be noted that the above process is repeatable, and there are other registrable data sets as long as there are other data servers that can be added. And since the complete scenario of the present invention is too complex to be easily depicted in a single flow chart, the process shown in fig. 14 shows only one example of the present application. The application does not limit the execution sequence of the above steps, that is, the above steps may also be executed concurrently, for example, during the registration of the data set, the data users (data user a and data user B) may also continue to add more data servers.

When there are one or more registered data sets, the data user may choose to publish their data sets to the dynamic data catalog, as can be seen in steps 1405-a and 1405-b. While as part of the data publication preparation, the data user may provide metadata, prepare role-based security and privacy access control rules, and define a subscription approval process. Finally, the data user can publish the metadata, role-based security and privacy access control rules, and subscription approval process into the dynamic data directory.

The data sets published by the data users (user a and user B) may be subscribed to by other data users, as can be seen in steps 1406-B to 1407-a and steps 1406-a to 1407-B. The subscribed-to data set may be added to any user's registration data set, as referenced in steps 1407-a through 1404-a and 1407-b through 1404-b.

Data users (user A and user B) may also collaborate with each other and share their data sets directly by defining personalized security and privacy access control rules, which may be referred to in steps 1415-a and 1415-B. This process is also shown in fig. 5. In steps 1415-a through 1404-b and 1415-b through 1404-a of FIG. 14, the directly shared data set is added to the collaborator's registered data set.

In steps 1408-a and 1408-b, the project of the data user is connected to one or more registered data sets to merge, clean, analyze and create new data sets. The process of connecting to each data set may be as shown in fig. 10. Once connected to the data set, the items read data from or write data to the data set at steps 1409-a and 1409-b. The process of reading data from or writing data to a data set can be described with reference to fig. 11 and 12.

At steps 1410-a and 1410-b, the project accesses the one or more registered datasets connected to and generates a new dataset by merging, cleaning and analyzing, among other operations.

When a new data set is generated for a project, the data user environment will automatically register the newly generated data set at steps 1410-a through 1404-a and 1410-b through 1404-b.

If security and privacy access control rules allow, the newly generated data set may be further published to a dynamic data directory at steps 1405-a and 1405-b or shared with other collaborators at steps 1415-a and 1415-b.

It should be noted that the above process of adding registered data sets, and then sharing and creating new data sets is continuous and permanent. The newly generated data sets may be published to a dynamic data catalog to be subscribed by other data users, which may then combine the subscribed data sets with their own data sets, data sets shared by other collaborators to generate new data sets again, and may then publish or share the new data sets for data sharing.

It should be noted that fig. 14 shows only two examples of data users. In a practical application scenario, multiple data users may share their data sets with multiple other data users at the same time. And by allowing data owners to share their data either directly with others or through personalized publishing, they have full control over personalized and role-based security and privacy access rules through the mechanisms shown in fig. 9, 11 and 12, the present application supports recursive generation of new data sets through a combination of novel and shared data among data users.

In an embodiment of the present application, the data set object may include metadata and a service method as shown in tables 1 to 9. FIG. 15 shows an example of a data set object according to an embodiment of the present application.

As previously described, a data user may generate project containers 0340-a, 0340-b, … in their data user environment 0312-a. Project resources are managed among project containers 0340-a, 0340-b, etc. In a project container, a data user can select 0342 one or more data sets 0330-a, 0330-b, 0330-c, 0330-z, create a user program 0344, and/or use a data processing tool 0346 to process and analyze data, as well as generate data reports or new data 0342 in the data sets. Where some data sets may be used for input of data, some data sets may be used for output of data (generating new data sets), and some data sets may be used for both input and output. User programs 0344 and data processing tools 0346 in the project container may generate new data sets in the registered data set pool 0390. In embodiments of the present application, the above-described item container may also be referred to as an item container object. Fig. 21 shows an example of an item container object.

FIG. 3 also shows an example of sharing data sets 0330-x in a collaborator project container 0340-x by data users/collaborators 0312-c. In FIG. 3, data sets 0330-z shared by collaborators 0312-d may also be used in a collaborator project container 0340-y. Moreover, collaborators 0312-d may also share project container 0340-y to users of data user environment 0312-a.

Various data set services will be described in detail below in conjunction with the accompanying drawings.

FIG. 15 shows an example of the internal structure of a data set object according to an embodiment of the present application. As shown in fig. 15, one dataset object 1502 may include: a metadata management module 1510, a collaborators management module 1520, a data profiling module 1530, and a data consanguinity module 1540. Note that the data sets 0330-a, 0330-b, 0330-c, etc. in fig. 3 are also examples of the data set object 1502.

The metadata management module 1510 is configured to manage metadata 1550 of the data set. In an embodiment of the present application, the data set metadata may be as shown in table 10 below, and specifically includes:

watch 10

The foregoing table 6 also shows an example of metadata for a data set. The metadata management module 1510 is used to manage and store metadata of the data set object. The data set object 1502 links to its own user data environment (e.g., 0312-a), raw data items (e.g., 0328-a,0328-b,0328-c, …), subscription data, collaborators, and data published in a directory through the metadata. The metadata also includes a description of the data content, such as data format, schema, attributes, tags, security level, and privacy level. Still further, the data set profiling management service 1530 can generate data profiling information that provides a summary of the data content to the data user in a visual manner. And the data blood margin module 1540 may generate a data blood margin map. Data profiling information and data kinoforms are also managed by the metadata management module 1510 in embodiments of the present application.

The collaborator management module 1520 is used to manage collaborators. The collaborators management module 1520 allows data users to share the current data set with other data users (collaborators). The collaborators management module 1520 allows the data owner to add collaborators, delete collaborators, change sharing rules, and modify personalized security and privacy access control rules, etc. FIG. 5b shows an example of a process for adding collaborators by the collaborator management module 1520 according to an embodiment of the present application. It should be noted that fig. 5b is only an example, and in another embodiment, collaborator management may be performed outside the data set. For example, the collaborator management module 1520 may serve as part of a data user environment. Once a collaborator is added, the collaborator information is sent to the metadata management service 1510 for storage in the data set object by the metadata management service 1510.

FIG. 16 shows an example of an operational process 1602 of a data set data profiling module 1530 according to one embodiment of the present application. The data profiling module 1530 described above allows a data user to add

data profiling methods

1604, 1606 and examine

data content

1604, 1608. For example, the user may check whether a data field contains a unique value and whether this field can be a key field. As another example, the user may examine the data value distribution, average, etc. of a data field. Embodiments of the present application may have a built-in data profiling file method, as shown in

steps

1604 and 1606 in FIG. 16. The data profiling module 1530 described above also allows a data user to add custom data profiling methods. Each data profiling method includes a name, the type of data that the data profiling method can examine, and the specific algorithm that examines the data fields. For example, an algorithm that can check the time stamp of a particular data set can only be used to process data types such as date and time. To perform data inspection, the data profiling module 1530 allows a data user to select portions of data and data fields to be reviewed or studied in step 1608. At step 1610, the data user selects a data profiling method from the methods provided by the system and the methods customized by the data user. The data profiling module 1530 performs the selected method on the selected data and returns the generated data profiling result to the metadata management module 1510, and the data profiling result is stored in the data set object by the metadata management module 1510 at step 1612. In step 1612, the data analysis process can be triggered by the data user or can be automatically generated by the data profiling module 1530. In addition, specific data types and specific profiling methods may also be bound together. In this way, the data profiling module 1530 of the data set object can automatically perform the data analysis process and automatically generate the data profiling result. It should be noted that fig. 16 shows only one embodiment of the present application. In further embodiments, the data profiling service may be implemented external to the data set. For example, it may be served as part of a data user environment.

Fig. 18, 19, and 20 show an example of a dataset data consanguinity module 1540 according to an embodiment of the application. The data blood margin module 1540 may generate a data blood margin map for the data set. The generated data edge graph may be stored in a dataset object and may be managed by the metadata management module 1510.

FIG. 17 shows an example of a data bloodline of a data set object (i.e., data set A1701) according to one embodiment of the present application. In this data-lineage diagram, the left-hand portion of the diagram shows ancestor 1703 of data set A; the right part of the figure shows the descendants of the data set a. The data content of ancestor, also referred to as ancestor object 1703, is the source of data content of data set A (e.g., data items of data set A). That is, the data content of data set A is a derivative of its ancestor data content. On the other hand, the data content of the descendants of data set a is also a derivative of the data content of data a. That is, the data set A descendant (data set M) may be the product of the data of data set A and the data of the other data set. The other data sets are not shown in the data-edge graph shown in fig. 17.

The flow illustrated in FIG. 18 begins execution after the data lineage module 1540 receives a reference to a data set (e.g., data set A). Wherein the reference to the data set may be an identification, name, address or label of the data set, or the like. In FIG. 18, at step 1804, a new data bloodborder is created with one node (data set A). At step 1804, a data bloodborder map is generated using data set A. In this data kinoform, data set a 1701 has neither ancestors nor offspring. A cursor may also be placed at the location of data set a on the data edge graph at step 1804. At this time, data set a may be referred to as a cursored data set, and a data bloodborder map centered on data set a is then generated. At step 1806, the entire ancestor portion is added to the data set A at the left side of the data edge map. At step 1808, the entire offspring portion is added at the location on the data edge map to the right of data set A.

FIG. 19 shows an example process 1902 for adding an ancestor kinoform to a cursored data set, according to one embodiment of the present application.

In step 1903, it is checked where the cursored data set came from.

In embodiments of the present application, a data set may come from several places:

(1) direct registration (e.g., data set 0330-a or 0330-b in fig. 3) of a data-owner data item (e.g., 0328-a or 0328-b in fig. 3) from either data-owner local 0321 or data server 0324 (obtained through data connector 0323);

(2) collaborators' sharing (e.g., data sets 0330-z shared by collaborators 0312-d in FIG. 3); or

(3) Subscriptions (e.g., data sets 0330-c of FIG. 3 made up of subscribed data items 0328-c).

If the cursored data set is a directly registered data set, at step 1910, the data lineage module 1540 detects whether the data set contains generated data (e.g., the data set is an output data set). If the data set is an output data set, at step 1912, the data lineage module 1540 locates the project container that generated the contents of the cursored data set. The data lineage module 1540 then finds all of the input datasets that were used to generate the contents of the cursored dataset. Taking fig. 17 as an example, the cursored data set is currently data set a 1701. The data consanguinity module 1540 finds that the data set A1701 is integrated by the data set I1722, the data set J-11728, and the data set K-11734. Thus, as shown in FIG. 17, the data blood margin module 1540 will add data set I1722, data set J-11728, and data set K-11734 and their ancestors to the left of data set A through

subsequent steps

1912, 1914, 1916, and 1918.

Taking data set I1722 as an example, at step 1914, data set I1722 is the data set that was first added to the left side of data set A1701 (the cursored data set). Then, the cursor is set at the data set I1722. Thus, at step 1916, the ancestor of the data set I1722 (the currently cursored data set) begins. Then, the process returns to step 1902. Since the data set I1722 is also a directly registered data set, the process returns to step 1903 and then continues to step 1910. In this case, the current cursored data set I1722 is not an output data set, so step 1920 would next be performed, adding the data source to the left of the cursored data. In this step, as shown in FIG. 17, the data connector 1720 of the current data user environment (user 1) will be added to the left of the data set I1722.

Returning now to step 1914. In this step, the data set J-11728 would be added to the left side of the data set A1701. The dataset J-11728 is then set as the cursored dataset. Step 1916 is then performed to establish the ancestors of the data sets J-11728. Specifically, step 1902 is returned to. In step 1903, the data lineage module 1540 finds that the data set J-11728 is a shared data set, and therefore, jumps to step 1930. In this step, the real data set J1726 shared by user 2 is found in user 2's data user context. At step 1932, add dataset J1726 to the left of dataset J-11728 and set dataset J1726 as the cursored dataset. Step 1934 is then executed to establish an ancestor of the data set J1726 (the currently cursored data set). Then, the process returns to step 1902. Since the data set J1726 is a data set directly registered in the user 2 data user environment,

steps

1903 and 1910 are performed next. Since the current cursored data set J1726 is not an output data set, the flow directly jumps to step 1920, adding the data source to the left side of the cursored data. In this step, as shown in FIG. 17, user 2's local data set 1724 is added to the left of data set J1726.

Returning again to step 1914, in this step, data set K-11734 is added to the left side of data set A1701, and data set K-11734 is set as the cursored data set. Then, at step 1916, the creation of an ancestor for dataset K-11734 (the current cursored dataset) begins. Return to step 1902. In step 1903, the data lineage module 1540 finds that data set K-11734 is a subscription data set, and therefore jumps to step 1940. In this step, the real dataset K1732 published by user 3 is found in the data user environment of user 3. Note that because user 3 published data set K1732 and user 1 subscribed to published data set K1732, data sets K-11734 are generated within the data user environment of user 1. At step 1942, data set K1732 is added to the left of data set K-11734. Next, the data set K1732 is set as the cursored data set. Then, step 1944 is performed to create an ancestor of dataset K1732 (the currently cursored dataset). Then, the process returns to step 1902 again. Since the data set K1732 is a data set registered directly in the user 3 data user environment,

steps

1903 and 1910 are performed next. The current cursored data set K1732 is not an output data set, so, jumping to step 1920, adds the data source to the left side of the cursored data. In this step, as shown in FIG. 17, user 3's data connector K1730 is added to the left side of data set K1732.

With the above operations, i.e., step 1806 of FIG. 18, the entire ancestor graph of the data set A1701 is completely created, and then the entire descendant graph of the data set A1701 may be created at the next step 1808. FIG. 20 shows an example of a process 2002 for adding a descendant blood relation map to a cursored data set according to one embodiment of the present application.

Still taking fig. 17 as an example, the data set a 1701 is set as a cursored data set. Then at step 2010 of fig. 20, the data margin module 1540 checks whether the cursored data set (data set a 1701) is used as input in a certain project container to create any new output data set. All of the new data sets described above are descendants of data set a 1701. If the check of step 2010 above is yes, then the

subsequent steps

2012, 2014, 2016, 2018, and 2020 continue to be performed by interacting with all project containers to add a new output data set to the right of data set A1701. After checking the output data set of data set a 1701, it is checked in step 2030 whether data set a 1701 is shared with other collaborators. If so, the

subsequent steps

2032, 2034, 2036 and 2038 continue to be performed to add the corresponding data sets in the collaborator data user environment as descendants to the graph. Further, in step 2050, the data lineage module 1540 will also check whether the data set a 1701 has been published. If so, the

subsequent steps

2052, 2054, 2056, 2058, and 2060 are continued by interacting with all publications and adding the published data sets as descendants to the kindred at step 2054.

The following paragraphs give a detailed description of the implementation process of each step in fig. 20 by taking fig. 17 as an example.

In step 2010, the data margin module 1540 checks the curbed data set A1701 for an output data set (derived data set). If so,

step

2012 and 2020 is performed for each item container in which the output data set is added to the right side of the cursored data set for each output data set having the cursored data set as input. Then step 2014-. Specifically, in step 2014, the data margin module 1540 adds the output data set to the right of the cursored data set. Based on fig. 17, the data set L1760 is the only derived data set, and therefore, the data set L1760 is added to the right side of the data set a 1701. Then, in step 2016, the data set L1760 is checked for offspring by looping back to step 2002 and offspring of the data set L1760 is added. In this example, the data set L1760 has no descendants.

In step 2030, the data margin module 1540 checks whether the cursored data set (currently data set a 1701) is shared with other collaborators. If so,

step

2032 and 2038 are performed for each collaborator. At step 2034, the data consanguinity module 1540 looks up the relevant data set in the collaborator's data user environment and adds the found relevant data set to the right of the cursored data set. In this example, the cursored data set is data set A1701, and data set A1701 is shared with user 4. Thus, in step 2034, its corresponding data set A-11762 is added to the right side of data set A1701. Next, step 2036 returns to step 2002 to add descendants of the data set A-11762 (the now cursored data set). A new data set M1764 is created as a result of the user 4 using the shared data set a-11762. Thus, by performing

steps

2002, 2010, 2012, 2014, the data set M1764 is discovered and the data set M1764 is added to the right side of the data set A-11762.

In step 2050, the data margin module 1540 checks whether the cursored data set (currently data set A1701) is published. If so, the data bloodline module 1540 iteratively examines each published data set in turn at step 2052-2060. For each published data set, each subscription is checked in turn at steps 2054-2056 and 2058. Taking FIG. 17 as an example, data set A1701 is published and there are two subscriber users 5 and 6, then at step 2054, the subscription data set A-21766 for subscriber user 5 and the subscription data set A-31768 for user 6 are added to the right side of data set A1701. Next, at step 2056, for each of the two subscription data sets a-21766 and a-31768, the data consanguinity module 1540 will return to step 2002 to discover its descendants. In step 2056, to find descendants of the data set A-21766, the data set A-21766 needs to be set as the cursored data set before returning to step 2002. Since the subscriber user 5 did not perform any operations on the data set A-21766, the data set A-21766 did not have any descendants. In step 2056, to find descendants of the data set A-31768, the data set A-31768 needs to be set as the cursored data set before returning to step 2002. A new data set N1770 is created as a result of subscriber user 6 using data set a-31768. Then this data set N1770 can be found through

steps

2010, 2012 and 2014 and data set N1770 is added to the right side of data set a-31768.

Through the above-described processes shown in fig. 19 and fig. 20, a whole blood-level map can be created for the data set a, that is, step 1810 in fig. 18 is implemented.

FIG. 21 illustrates an internal structure of an item container object according to one embodiment of the present application. As shown in FIG. 21, a typical item container object 2102 includes: a collaborator management module 2110, a project container manager 2115, a task management module 2150, and one or more data set objects 2120. The item container object 2102 may further include a program 2130, a process pipe 2140, and a task management 2150.

Fig. 3 has shown an example of a data user environment. Included in the data user environment are a plurality of item container objects 0340-a, 0340-b, 0340-x, 0340-y, and so on. In the example shown in FIG. 3, the item container objects 0340-a and 0340-b described above belong to user 0312-a; project container object 0340-x belongs to user 0312-c; and item container object 0340-y belongs to user 0312-d.

FIG. 22 shows an example flow 2202 in which the project container collaborator management module 2110 adds collaborators to the project container 2102, according to one embodiment of the present application.

At step 2204, the project container collaborator management module 2110 checks whether all data sets in the project container 2102 are sharable.

If there is a data set 2120 that cannot be shared, no collaborators 2220 can be added. If all data sets 2120 can be shared, step 2206 is performed.

At step 2206, the project container collaborator management module 2110 locates the collaborator's data user environment based on the user-provided collaborator information.

At step 2208, user-set rights for collaborators to use the item container 2102 are received.

The above rights include: whether collaborators are allowed to edit metadata in the project container 2102; whether collaborators are allowed to edit the data content of the data set; whether collaborators are allowed to edit the program 2030, process pipe 2140, and tasks 2052; and whether the collaborators are allowed to perform tasks 2152, etc.

At step 2210, collaborators are added in the project container 2102.

At step 2212, the item container 2102 is added in the collaborator's data user environment.

At

steps

2214 and 2218, the project container collaborator management service 2110 traverses all data sets 2120 in the project container 2102 and adds collaborators to each data set 2120 by calling S0560 in fig. 5 b.

It should be noted that fig. 22 shows only one embodiment of the present application, and in this example, the process of deleting collaborators is not shown.

FIG. 23 illustrates an example project container management process 2302 of the project container management module 2115 according to an embodiment of the present application. In this example, the item container management service 2115 may determine an operation type at step 2304, including: manage item container metadata 2116 (e.g., as shown in table 7), add or delete data sets 2117, 2120, support for uploading or selection 2118 of program 2130, and manage process pipes 2119, 2140.

At step 2310, the item container management module 2115 manages the editing of the item container metadata (e.g., as shown in Table 7).

At step 2320, the item container management module 2115 adds a new data set 2120 to the item container 2102.

At step 2322-2326, the item container management module 2115 traverses collaborators in the item container 2102.

At step 2324, the project container management module 2115 adds collaborators to the new data set using the process illustrated in FIG. 5 b.

Step 2330 is for deleting an already existing data set 2120 from the item container 2102. While the corresponding shared data set is also deleted from the collaborator environment.

Step 2340 is for uploading program 2130 to item container 2102.

Specifically, in some embodiments of the present application, the project container management module 2115 may first receive a reference to the program 2130 and then upload the program 2130 to the project container 2102. The reference of the program indicates the location of the program, and may specifically be an address, a unique identifier, a name, or the like of the program.

In further embodiments, the item container management module 2115 allows selection of programs that have been uploaded as programs for use in the item container 2102.

At step 2350, the project container management module 2115 allows the user to create or edit process pipelines using tools already in the system. If a new process pipeline is created, the name of the process pipeline is set.

Once the data set 2120, program 2130, and/or process pipeline 2140 are determined, a task management module 2150 for the project container 2102 allows tasks 2152 to be created. Each task 2152 includes one or more programs 2130-a, 2130-b or pipes 2140-c, one or more input data sets 2120-a, 2120-b, and one or more output data sets 2120-x, 2120-y, 2120-z. Once a task is established, it may be executed to generate a report or new data set 2120-x, 2120-y, 2120-z.

Corresponding to the data sharing system, the present application further provides a data user environment of the multi-user collaborative data management system, where the data user environment may include:

one or more data connectors;

one or more data directories;

one or more data sets;

one or more collaborators; and

a data user environment service to:

Wherein the data consumer environment service may be further configured to receive a user request to register a data item selected by a data consumer to the data consumer environment, wherein the data item is a data item selected from the data connector or a subscription data item selected from the data catalog; creating a data set; and associating the data set with the data item.

Wherein the data consumer environment service may be operable to associate the data set with the data item from the data connector by connecting the data set with the data item.

Wherein the data consumer environment service may be operable to associate the data set with the subscription data item from the data catalog by concatenating the data set with the subscription data item.

The data user environment may further include: a subscription service for receiving references to published data sets selected by data users in the data catalog; creating a subscription data item in the data user environment; and associating the subscription data item with the published data set by concatenating the subscription data item with the published data set.

The subscription service may be further configured to obtain a subscription approval process of the published data set before creating a subscription data item in the data user environment; sending an examination and approval request to an approver appointed by the subscription examination and approval process; and receiving an approval response of the approver.

The data user environment may further include: the publishing service is used for receiving the reference of the data set selected to be published by the data user when the data set is published to a data directory; verifying whether the data set is publishable; receiving references to data categories and categories selected by a data user; receiving a reference to part or all of the contents of a data set selected for publication by a data user; receiving metadata provided by a data user; and presenting the selected dataset and the provided information as published datasets in a catalog under the selected category.

The metadata may include role-based security and privacy access control rules defined by the data user.

The metadata may include a subscription approval process defined by a data user.

The data user environment may further include: a collaboration service to receive collaborators and data sets selected by data users, add the collaborators to the data sets; setting a use permission for the collaborator to use the data set; setting specific security and privacy access control rules for the collaborators to access the data set content; creating a new data set in the collaborator's data user environment; and concatenating the new data set and the data set.

The data user environment may further include:

one or more item containers; and

Wherein the project container manager may be further configured to receive an instruction from a data user to create a data processing pipeline in the project container.

Wherein the project container manager may be further configured to receive a reference to a data processing program; uploading the data handler in the project container if the data handler does not exist in the system; adding the data handler in the project container if the data handler exists in the system.

Wherein the above-mentioned project container manager may be further operable to associate each of the one or more collaborators with a project container through the use of permissions.

Wherein the project container manager may be further configured to receive a selection of one or more collaborators; associating the selected one or more collaborators with a project container by: adding the selected one or more collaborators to the project container, adding the project container to a data user environment of the selected one or more collaborators; for one or more data sets associated with the project container, configuring the collaboration service to add the selected one or more collaborators to the data set.

The data user environment may further include: a data profiling service for receiving selected data portions or data fields to be examined in the data set; receiving a selected data profiling method; performing the data profiling method for the data portion or data field; and generating a data profiling result.

The data user environment may further include: a data lineage service to receive references to a data set; creating a data blood relationship map; wherein the data lineage graph includes one or more ancestor data sets of the data set and one or more descendant data sets of the data set; wherein the data content of the data set is a derivative of the data content of the one or more ancestor data sets; the data content of the offspring data set is a derivative of the data content of the data set.

The data user environment service may be further configured to, when a data user or an application of the data user initiates a data access request for a data set, obtain, from a virtual data set service subsystem of the data sharing system, a virtual data set corresponding to an original data set related to the data access request, and return the virtual data set to the data user or the application of the data user.

The embodiment of the application also provides a multi-user cooperative data governance method, which can comprise the following steps: associating each of the one or more data sets with a data item, wherein the data item is from one of the one or more data connectors; associating each of the one or more data sets with a subscription data item, wherein the subscription data item is subscribed from one of one or more data directories; associating each of the one or more data sets with a publishing data set, wherein the publishing data set is from a catalog of data that publishes the data set to the one or more data categories through a publishing process; and associating each of the one or more collaborators with the one or more data sets using the permissions.

The above method may further comprise: receiving a user request to register a data item selected by a data user to the data user environment, wherein the data item is a data item selected from the data connector or a subscription data item selected from the data catalog; creating a data set; associating the data set with the data item.

The above method may further comprise: receiving a reference to a published data set selected by a data user in the data catalog; creating a subscription data item in the data user environment; and associating the subscription data item with the published data set by concatenating the subscription data item with the published data set.

Prior to creating a subscription data item in the data user environment, the method may further comprise: acquiring a subscription approval process of the published data set; sending an examination and approval request to an approver appointed by the subscription examination and approval process; and receiving an approval response of the approver.

The above method may further comprise: when the data set is published to a data directory, receiving the reference of the data set selected to be published by a data user; verifying whether the data set is publishable; receiving references to data categories and categories selected by a data user; receiving a reference to part or all of the contents of a data set selected for publication by a data user; receiving metadata provided by a data user; and presenting the selected dataset and the provided information as published datasets in a catalog under the selected category.

The above method may further comprise: receiving collaborators and data sets selected by a data user; adding the collaborators to the dataset; setting a use permission for the collaborator to use the data set; setting specific security and privacy access control rules for the collaborators to access the data set content; creating a new data set in the collaborator's data user environment; and concatenating the new data set and the data set.

The above method may further comprise: receiving the selected one or more data sets; associating each of the one or more data sets with an item container by adding the data set to the item container.

The above method may further comprise: receiving an instruction from a data user to create a data processing pipeline in the project container.

The above method may further comprise: receiving a reference to a data processing program; uploading the data handler in the project container if the data handler does not exist in the system; adding the data handler in the project container if the data handler exists in the system.

The above method may further comprise: each of the one or more collaborators is associated with an item container by using the permissions.

The above method may further comprise: receiving the selected one or more collaborators; associating the selected one or more collaborators with a project container by: adding the selected one or more collaborators to the project container, adding the project container to a data user environment of the selected one or more collaborators; for one or more data sets associated with the project container, configuring the collaboration service to add the selected one or more collaborators to the data set.

The above method may further comprise: receiving a data portion or data field selected in the data set to be examined; receiving a selected data profiling method; performing the data profiling method for the data portion or data field; and generating a data profiling result.

The above method may further comprise: receiving a reference to a data set; creating a data blood relationship map; wherein the data lineage graph includes one or more ancestor data sets of the data set and one or more descendant data sets of the data set; wherein the data content of the data set is a derivative of the data content of the one or more ancestor data sets; the data content of the offspring data set is a derivative of the data content of the data set.

The above method may further comprise: when a data access request is initiated to a data set by a data user or an application of the data user, a virtual data set corresponding to an original data set related to the data access request is obtained from a virtual data set service subsystem of the data sharing system, and the virtual data set is returned to the data user or the application of the data user.

It should be noted that, for the specific implementation of the data user environment and the data governance method, reference may also be made to the contents described in fig. 1 to fig. 23, and repeated descriptions are omitted.

Further, with the data user environment and data governance method, data users can use subscribed data sets as well as data sets shared by other data users and combine these data sets with their own data sets to generate new data sets and reports. In addition, through the data user environment and the data governance method, the data users can also publish new data sets, and other data users can create more new data sets by using the data sets. By analogy, the data sharing model allows for the recursive creation of novel and useful information.

It should be noted that not all steps and modules in the above flows and structures are necessary, and some steps or modules may be omitted according to actual needs. The execution order of the steps is not fixed and can be adjusted as required. The division of each module is only for convenience of describing adopted functional division, and in actual implementation, one module may be divided into multiple modules, and the functions of multiple modules may also be implemented by the same module, and these modules may be located in the same device or in different devices.

The hardware modules in the embodiments may be implemented in hardware or a hardware platform plus software. The software includes machine-readable instructions stored on a non-volatile storage medium. Thus, embodiments may also be embodied as software products.

Some examples of the present application therefore also provide a computer readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the steps of the method described above and in fig. 1 to 23.

In various examples, the hardware may be implemented by specialized hardware or hardware executing machine-readable instructions. For example, the hardware may be specially designed permanent circuits or logic devices (e.g., special purpose processors, such as FPGAs or ASICs) for performing the specified operations. Hardware may also include programmable logic devices or circuits temporarily configured by software (e.g., including a general purpose processor or other programmable processor) to perform certain operations.

In addition, each example of the present application can be realized by a data processing program executed by a data processing apparatus such as a computer. It is clear that a data processing program constitutes the present application. Further, the data processing program, which is generally stored in one storage medium, is executed by directly reading the program out of the storage medium or by installing or copying the program into a storage device (such as a hard disk and/or a memory) of the data processing device. Such a storage medium therefore also constitutes the present application, which also provides a non-volatile storage medium in which a data processing program is stored, which data processing program can be used to carry out any one of the above-mentioned method examples of the present application.

The nonvolatile computer-readable storage medium may be a memory provided in an expansion board inserted into the computer or written to a memory provided in an expansion unit connected to the computer. A CPU or the like mounted on the expansion board or the expansion unit may perform part or all of the actual operations according to the instructions.

In addition, the devices and modules in the examples of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more devices or modules may be integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A data user environment of a multi-user collaborative data management system, the data user environment comprising:

one or more data connectors;

one or more data directories;

one or more data sets;

one or more collaborators; and

a data user environment service to:

2. The data consumer environment of claim 1, wherein the data consumer environment service is further to receive a user request to register a data item selected by a data consumer to the data consumer environment, wherein the data item is from the data connector selection or from a subscription data item selected in the data catalog; creating a data set; and associating the data set with the data item.

3. The data consumer environment of claim 2 wherein the data consumer environment service is to associate the data set with the data item from the data connector by connecting the data set with the data item.

4. The data consumer environment of claim 2 wherein the data consumer environment service is operative to associate the data set with the subscription data item from the data catalog by concatenating the data set with the subscription data item.

5. The data user environment of claim 4, further comprising:

a subscription service for receiving references to published data sets selected by data users in the data catalog; creating a subscription data item in the data user environment; and associating the subscription data item with the published data set by concatenating the subscription data item with the published data set.

6. The data user environment of claim 5, wherein the subscription service is further configured to obtain a subscription approval process for the published data set prior to creating a subscription data item in the data user environment; sending an examination and approval request to an approver appointed by the subscription examination and approval process; and receiving an approval response of the approver.

7. The data user environment of claim 1, further comprising:

the publishing service is used for receiving the reference of the data set selected to be published by the data user when the data set is published to a data directory; verifying whether the data set is publishable; receiving references to data categories and categories selected by a data user; receiving a reference to part or all of the contents of a data set selected for publication by a data user; receiving metadata provided by a data user; and presenting the selected dataset and the provided information as published datasets in a catalog under the selected category.

8. The data user environment of claim 7, wherein the metadata comprises data user-defined role-based security and privacy access control rules.

9. The data user environment of claim 7, wherein the metadata comprises a data user-defined subscription approval process.

10. The data user environment of claim 1, further comprising:

a collaboration service to receive collaborators and data sets selected by data users, add the collaborators to the data sets; setting a use permission for the collaborator to use the data set; setting specific security and privacy access control rules for the collaborators to access the data set content; creating a new data set in the collaborator's data user environment; and concatenating the new data set and the data set.

11. The data user environment of claim 10, wherein the particular security and privacy access control rule is a personalized security and privacy access control rule.

12. The data user environment of claim 1, further comprising:

one or more item containers; and

13. The data user environment of claim 12, wherein the project container manager is further to receive instructions for a data user to create a data processing pipeline in the project container.

14. The data user environment of claim 12, wherein the item container manager is further to receive a reference to a data handler; uploading the data handler in the project container if the data handler does not exist in the system; adding the data handler in the project container if the data handler exists in the system.

15. The data user environment of claim 12, wherein the project container manager is further to associate each of the one or more collaborators with a project container through the use of permissions.

16. The data user environment of claim 15, wherein the project container manager is further to receive a selection of one or more collaborators; associating the selected one or more collaborators with a project container by: adding the selected one or more collaborators to the project container, adding the project container to a data user environment of the selected one or more collaborators; for one or more data sets associated with the project container, configuring the collaboration service to add the selected one or more collaborators to the data set.

17. The data user environment of claim 1, further comprising:

a data profiling service for receiving selected data portions or data fields to be examined in the data set; receiving a selected data profiling method; performing the data profiling method for the data portion or data field; and generating a data profiling result.

18. The data user environment of claim 1, further comprising:

a data lineage service to receive references to a data set; creating a data blood relationship map; wherein the data lineage graph includes one or more ancestor data sets of the data set and one or more descendant data sets of the data set; wherein the data content of the data set is a derivative of the data content of the one or more ancestor data sets; the data content of the offspring data set is a derivative of the data content of the data set.

19. The data consumer environment of claim 1, wherein the data consumer environment service is further configured to, when a data consumer or an application of a data consumer initiates a data access request for a data set, obtain a virtual data set corresponding to an original data set involved in the data access request from a virtual data set service subsystem of the data sharing system, and return the virtual data set to the data consumer or the application of the data consumer.

20. A multi-user cooperative data governance method, comprising:

21. The method of claim 20, wherein the method further comprises:

receiving a user request to register a data item selected by a data user to the data user environment, wherein the data item is a data item selected from the data connector or a subscription data item selected from the data catalog;

creating a data set;

associating the data set with the data item.

22. The method of claim 20, wherein the method further comprises:

receiving a reference to a published data set selected by a data user in the data catalog;

creating a subscription data item in the data user environment; and

associating the subscription data item with the published data set by concatenating the subscription data item with the published data set.

23. The method of claim 22, prior to creating a subscription data item in the data user environment, the method further comprising:

acquiring a subscription approval process of the published data set;

sending an examination and approval request to an approver appointed by the subscription examination and approval process; and

an approval response of the approver is received.

24. The method of claim 20, wherein the method further comprises:

when the data set is published to a data directory, receiving the reference of the data set selected to be published by a data user;

verifying whether the data set is publishable;

receiving references to data categories and categories selected by a data user;

receiving a reference to part or all of the contents of a data set selected for publication by a data user;

receiving metadata provided by a data user; and

the selected dataset and the provided information are presented in the catalog under the selected category as published datasets.

25. The method of claim 20, wherein the method further comprises:

receiving collaborators and data sets selected by a data user;

adding the collaborators to the dataset;

setting a use permission for the collaborator to use the data set;

setting specific security and privacy access control rules for the collaborators to access the data set content;

creating a new data set in the collaborator's data user environment; and

connecting the new data set and the data set.

26. The method of claim 20, wherein the method further comprises:

receiving the selected one or more data sets;

associating each of the one or more data sets with an item container by adding the data set to the item container.

27. The method of claim 26, wherein the method further comprises:

receiving an instruction from a data user to create a data processing pipeline in the project container.

28. The method of claim 26, wherein the method further comprises:

receiving a reference to a data processing program;

uploading the data handler in the project container if the data handler does not exist in the system;

adding the data handler in the project container if the data handler exists in the system.

29. The method of claim 26, wherein the method further comprises:

each of the one or more collaborators is associated with an item container by using the permissions.

30. The method of claim 29, wherein the method further comprises:

receiving the selected one or more collaborators;

associating the selected one or more collaborators with a project container by: adding the selected one or more collaborators to the project container, adding the project container to a data user environment of the selected one or more collaborators; for one or more data sets associated with the project container, configuring the collaboration service to add the selected one or more collaborators to the data set.

31. The method of claim 20, wherein the method further comprises:

receiving a data portion or data field selected in the data set to be examined;

receiving a selected data profiling method;

performing the data profiling method for the data portion or data field; and

and generating a data analysis result.

32. The method of claim 20, wherein the method further comprises:

receiving a reference to a data set;

creating a data blood relationship map; wherein the data lineage graph includes one or more ancestor data sets of the data set and one or more descendant data sets of the data set; wherein the data content of the data set is a derivative of the data content of the one or more ancestor data sets; the data content of the offspring data set is a derivative of the data content of the data set.

33. The method of claim 20, wherein the method further comprises:

when a data access request is initiated to a data set by a data user or an application of the data user, a virtual data set corresponding to an original data set related to the data access request is obtained from a virtual data set service subsystem of the data sharing system, and the virtual data set is returned to the data user or the application of the data user.

34. A non-transitory computer readable storage medium, wherein the storage medium stores one or more instructions that when executed by one or more processors implement a collaborative data governance method according to any one of claims 20 to 33.