CN111259006B

CN111259006B - Universal distributed heterogeneous data integrated physical aggregation, organization, release and service method and system

Info

Publication number: CN111259006B
Application number: CN202010020974.1A
Authority: CN
Inventors: 刘峰; 周园春; 韩芳; 沈志宏; 夏景隆
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2019-11-19
Filing date: 2020-01-09
Publication date: 2023-06-27
Anticipated expiration: 2040-01-09
Also published as: CN111259006A

Abstract

The invention relates to a universal distributed heterogeneous data integrated physical aggregation, organization, release and service method and system. The method comprises the following steps: 1) Registering public basic data at a central terminal; 2) The distributed end performs convergent transmission and synchronization of distributed heterogeneous data to the central end; 3) The method comprises the steps that library construction organization and editing are conducted on converged data resources at a center end; 4) Uniformly publishing and auditing the data resources at the center end; 5) And carrying out integrated sharing service of the data resources at the central end. The invention realizes the efficient gathering transmission and synchronization of the distributed heterogeneous entity data, realizes the centralized database establishment, organization management and unified release of the data resources, realizes the integration and sharing of various data release services in the data resource portal, has the characteristics of integration and general customization, ensures the integral communication, high customization and high multiplexing of the data gathering, management, release and service processes, and greatly improves the universality and the flexibility of the data service encapsulation.

Description

Universal distributed heterogeneous data integrated physical aggregation, organization, release and service method and system

Technical Field

The invention relates to the field of data management and shared service, in particular to a universal distributed heterogeneous data integrated physical aggregation, organization, release and service method and system. The user can uniformly realize the physical convergence transmission, organization release and integrated sharing service of the heterogeneous data.

Background

Under the background of rapid development of cloud computing, big data and artificial intelligence technology in the current society, a large number of diverse data resources are generated in each field, the importance of the data resources is widely accepted by the society, and the importance of the data resources is promoted to the level of important strategic resources of the country. Meanwhile, with the development of open acquisition and data sharing requirements, more and more data resources are required to be shared and used. Under the promotion of various informatization engineering construction at home and abroad, various field information (data) resource sharing service platforms are continuously emerging.

Data resource sharing of a conventional data sharing service platform is provided in the form of data sets by most organizations, and only includes metadata and data files. For structured data, most common storage modes, namely a relational data table, are used for providing services in the form of table files (such as excel and csv), or simply used for providing sharing in the form of a data table, and lack of data integration organization and metadata description. The deficiencies that are emphasized are:

(1) Unified sharing service of heterogeneous data resources (relational and file type) cannot be realized, and entity data only provides a single file form. The advantages of the online service of the relational structured data are weakened, the advantages of the relational data and file data association fusion service are weakened, and the advantages of the interrelation service among relational database tables are weakened.

(2) Traditional distributed data aggregation and exchange are mainly in a file form, and remote transmission aggregation and synchronous management of relational data are not supported.

(3) The platform system in the past only supports a limited service subset of a certain process or a certain processes, is developed according to the specialized design of construction requirements, lacks customized and generalized decoupling design, reduces the development and realization efficiency, generates a large number of repeated operations, and increases the development cost.

(4) The introduction of internationally accepted unique identifications and the introduction of normalized data references are lacking in terms of shared data organization models.

(5) The method comprises the steps of lacking full-text retrieval facing to entity data file contents, lacking full-field customized retrieval of relational data, lacking relational data fusion service integration (such as association with files, images, videos, with data sub-tables, various URL presentation associations, enumeration list associations and the like), lacking multiple association recommendation modes of a data set, lacking packaging services of a data resource API, lacking user-oriented personalized service support, lacking internationalization support of a platform and the like in terms of service forms.

Disclosure of Invention

Aiming at the defects in the aspects of distributed data management and shared service, the invention provides a general distributed heterogeneous data integrated physical aggregation (entity data centralized aggregation storage organization), organization, release and service method and system design.

The technical scheme adopted by the invention is as follows:

a general distributed heterogeneous data integrated physical aggregation, organization, release and service method comprises the following steps:

1) Registering public basic data at a center end, wherein the registering comprises registering data nodes of a distribution end, registering metadata extension elements, registering a classification system and registering a license agreement;

2) The distributed end performs convergent transmission and synchronization of distributed heterogeneous data to the central end;

3) The method comprises the steps that library construction organization and editing are conducted on converged data resources at a center end;

4) Uniformly publishing and auditing the data resources at the center end;

5) And carrying out integrated sharing service of the data resources at the central end.

Further, the data node registration realizes registration management of data node information and node administrator authentication information of a distributed end;

the metadata extension element registration supports customized configuration management of extension metadata items, and the configuration items of the metadata comprise: metadata Chinese name, metadata English name, field type, whether to fill in items, whether to repeat, sequence number and remark;

The classification system registration supports registration, editing and deleting operations of the tree data classification system, the classification system information comprises classification names, classification codes and classification descriptions, and a user can perform new addition, editing, insertion and deleting operations on node information of any tree classification system;

the license agreement registration supports standard license agreements, and supports the operations of registering, editing and deleting custom license contents, wherein the registration information comprises an agreement identification code, an agreement name, an agreement identification picture and an agreement description text.

Further, the aggregate transmission and synchronization of the distributed heterogeneous data includes:

2.1 Registering heterogeneous data sources, including unified registration connection management of relational data sources and file data sources;

2.2 Performing data transmission task construction, including relational data task construction and file data task construction;

2.3 Performing transmission task operation management, and remotely and efficiently transmitting the data task of the distributed end to the central end stably;

2.4 And (3) carrying out synchronous management on the relational data, and synchronizing each record in a relational table or a logic table in a transmission task of the distributed end to a relational library table of the central end at fixed time.

Further, the database construction organization and editing of the aggregated data resources include:

3.1 A) creating a relational database, including creating a new relational database by Excel template import, or creating a new table by associating existing and described relational data tables;

3.2 Carrying out description and field fusion configuration of the structure information of the relational library table; the description of the relational database table structure information comprises description of a relational data table name and description of a relational data table field name; the field fusion configuration is realized by setting a certain field display type of a relational data table, and comprises a text type, a URL type, an enumeration type, a sub-table type and a file type;

3.3 Data management of all relation library tables at the center end is carried out, and data viewing, adding, editing and deleting operations are supported;

3.4 File type data management, including network disk management of all data files and directories at the central end.

Further, the uniformly publishing and auditing the data resources comprises the following steps:

4.1 Based on the built-in metadata and the extended metadata, dynamically realizing the online filling and batch filling of the metadata of the data set one by one;

4.2 Based on the relation library table and the file system of the central end, the selection of the online relation type entity data table and the selection of the entity data file based on the file directory system are realized, and the online instant uploading selection of the file is supported;

4.3 Editing, submitting and publishing the data set;

4.4 Checking the content of the data set to be released, wherein the key points include checking and checking whether metadata information filling is standard or not and checking whether entity data is accurate or not; and the selected data set is able to authorize a user range of access.

Further, the integrated sharing service of the data resource includes:

5.1 Data set retrieval, including two data retrieval modes of keyword and classified navigation, and API interface encapsulation supporting multiple data retrieval modes;

5.2 The data set filtering and sorting comprises data resource tag cloud display and multi-condition step-by-step filtering service, and multi-condition re-sorting display of data resource retrieval results is supported;

5.3 Data set access and evaluation, including on-line browsing and playing display of typical entity data files in the data resources; supporting online customized query and result downloading and fusion integrated display of relational table entity data; supporting full text retrieval of text entity files; supporting metadata online downloading and API access service encapsulation; supporting a data social service;

5.4 Data set recommendation, including recommendation services based on data set metadata content associative computation, supporting data recommendation services based on user access behavior statistics;

5.5 Data set service record and statistics, including user data access behavior full log record management, supporting data set access, downloading condition statistics and display;

5.6 User personalized services including user access and presentation of download history, support user collection, evaluation and tagging management.

The universal distributed heterogeneous data integrated physical aggregation, organization, release and service system comprises a central end and a distribution end, wherein the distribution end deploys a data aggregation transmission software module, the central end deploys a data management and release software module and a data sharing and service portal module, and a public basic data registration and service sub-module is integrated in the data management and release software module;

the data aggregation transmission software module is responsible for aggregation transmission and synchronization of distributed heterogeneous data from a distribution end to a central end;

the data management and release software module is responsible for registering public basic data, constructing, organizing and editing the converged data resources, and uniformly releasing and auditing the data resources;

the data sharing and service portal module is responsible for carrying out integrated sharing service of data resources.

The key innovation of the invention comprises:

1) A general distributed heterogeneous data (relational and file) integrated physical convergence, organization release and integration service method and system design are provided. The framework is easy to expand, and users can expand other needed relational data sources by themselves. In terms of files, the invention realizes a local file system and an FTP file data source, and a user can extend other file data sources such as Samba file data sources and the like. In addition, the user can also self-expand the NoSQL data source, such as: mongoDB, etc.

2) The method realizes the decoupling of the whole process of physical aggregation, organization, release and integrated service of heterogeneous data resources (especially supporting relational data), fully considers the requirements of high customization and high multiplexing in the design of the method, effectively improves the universality and the flexibility of the invention, and has universal scene applicability. The user can complete the effective physical convergence, release and service of the distributed data only through customized configuration, so that the design and development efficiency of the distributed data sharing service system is improved, and the development period of software is shortened.

3) And the customized remote transmission convergence and synchronous management of the relational data are realized.

4) Full text retrieval of text entity data file content and full field customization retrieval service of relational data are realized.

5) The fusion configuration and service functions (such as association with files, images, videos, association with data sub-tables, association with various URL displays, association with enumeration lists and the like) among heterogeneous data resources are realized.

6) The method realizes the effective integration of a plurality of advanced data service functions, is convenient for users to quickly discover, acquire, share and use data resources, and is connected with internationalized service. The method comprises the steps of realizing multiple data retrieval modes of a data set, multiple associated recommendation modes, realizing tag cloud step-by-step filtering and sequencing, realizing automatic packaging service of a data resource API, realizing personalized service support for a user, realizing bilingual support of a platform, realizing unique identification and standardized data reference service, and realizing data license protocol customization service.

The beneficial effects of the invention are as follows:

the invention realizes the efficient gathering, transmitting and synchronizing of distributed heterogeneous entity data (file type and relational data), realizes the centralized database construction, organization management and unified release of data resources (note: taking a data set as a release organization model, the method comprises three parts of PID, metadata and entity data, wherein the PID is a continuous data object identifier, and refers to internationally approved global unique identifier codes such as handle codes or DOI identifiers, and the like), finally realizes the integration and sharing of various data release services at a data resource portal.

Drawings

FIG. 1 is a diagram of the overall functional logic framework of the present invention.

FIG. 2 illustrates the general method steps and relationship diagram of the present invention.

FIG. 3 is a diagram of a public underlying data registration refinement flow structure.

FIG. 4 is a diagram showing the structure of the distributed heterogeneous data convergence transmission and synchronous refinement flow.

FIG. 5 is a diagram of a prototype interface of a new relational data source.

FIG. 6 illustrates a prototype interface diagram constructed by a relational data task.

FIG. 7 illustrates a prototype interface diagram of a file-type data task construction.

FIG. 8 is a diagram showing the structure of the data resource centralized database organization and editing refinement flow.

FIG. 9 is a diagram of an import data template for import type building a table.

FIG. 10. Imported creates a new table prototype interface diagram.

FIG. 11. Associated creation of a new table prototype interface diagram.

FIG. 12 illustrates a relational library table describing a configuration prototype interface diagram with field fusion.

FIG. 13 illustrates a relational library table data management prototype interface diagram.

FIG. 14 is a diagram of a file data management prototype interface.

FIG. 15 is a structural diagram of a unified publishing and auditing refinement flow of data resources.

FIG. 16. Data set metadata on-line fill-in sample graph.

FIG. 17 illustrates a sample graph of data set PID identifications and reference elements.

Fig. 18. Data set entity data selection pattern diagram.

FIG. 19 is a diagram of a data resource integration sharing service refinement flow structure.

FIG. 20 is a block diagram of the overall system software of the present invention.

FIG. 21 is a block diagram of a system software deployment architecture of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the following examples and drawings, so that the above objects, features and advantages of the present invention can be more clearly understood.

The overall functional logic framework of the present invention is shown in fig. 1. The overall method steps and relationships are shown in fig. 2. The overall process is totally divided into five major steps (or subsystems): 1. public basic data registration management, 2, distributed heterogeneous data convergence transmission and synchronization, 3, data resource database construction organization and editing, 4, data unified publishing and auditing, and 5, data resource integration sharing service.

Step 1 can be understood as an initialization process of the whole invention, and mainly completes the registration work of public basic data; step 2, realizing the synchronous management of physical convergence transmission and relation data of the distributed heterogeneous data resource; step 3, realizing database creation management, organization description and editing management of the converged heterogeneous data; and step 4, unified publishing organization and auditing authorization management of the data are realized, and step 5, integrated sharing service and management of the (published) data resources are realized. The step 2 is completed at the distribution end, the rest steps are completed at the center end, and the specific flow and functions of each step of the method are mainly described below.

1. Public underlying data registration

The parallel registration function of the public basic data is realized, the method comprises the registration management of basic operation data such as data resource nodes, metadata extension elements, classification systems, license agreements and the like. The step is used by a system administrator, and identity authentication of a user is required to be managed through the system before the step is started.

The main flow structure of this step is shown in fig. 3. The implementation details of each step in fig. 3 are described in detail below.

1.1 data resource node registration

And realizing registration management of distributed end data node information and node administrator authentication information. The method specifically comprises the following steps: registration filling and editing management of attribute information such as data node names, node codes, node profiles, node contacts, contact phones, email, node administrator accounts, node administrator passwords, data node creation time, serial numbers and the like.

The node administrator account and the password are used for performing node management user authentication when the distributed end device performs step 2 distributed heterogeneous data convergence transmission and synchronization start. Meanwhile, the Vsftp service of installation and deployment is started at the center end, the FTP account password is built by initializing the same account password, the remote transmission of data is realized by adopting the FTP protocol at the bottom layer of the system (compared with the traditional http protocol, the system is more efficient and stable and is convenient to realize breakpoint continuous transmission), meanwhile, the distributed data node is skillfully supported to automatically transmit files by adopting a third party FTP tool and using the account password, and the wide compatibility of a transmission tool is effectively realized. Meanwhile, the registration of the data resource nodes fully embodies the invention and has universal customizability.

The distributed data node adopts a third party FTP tool to automatically transmit the file by using the account password, which means that the file type entity data can be realized through the following 2.3 transmission task operation management, the relational entity data can only be realized by using the 2.3 transmission task operation management, but as the central end adopts the general Vftp service, a user can not adopt the distributed heterogeneous data gathering transmission tool of the following part 2 for the file type entity file, and adopts any third party FTP tool software, and the transmission file is completely compatible after the user logs in by directly using the FTP account password provided in the node information.

1.2 metadata extension element registration

The customized configuration management supporting the extended metadata items comprises adding, editing, inquiring and the like, and the specific metadata configuration items comprise: metadata chinese name, metadata english name, field type, whether to fill in items, whether to repeat, sequence number, remark.

1) The metadata extension element registration is to realize the function of user-defined extension metadata structure, and is a representation with universal customizability.

2) In the present invention, the core metadata of the data set has built-in metadata elements (metadata extension elements are relative to built-in core elements) comprising:

Table 1 description of core metadata elements built into a dataset of the present invention

3) The necessary entry in the above table represents that the metadata element must be filled, the uniqueness represents whether the metadata element allows multiple fills, and the field types include: character strings, integer types, multi-precision types, time dates, enumeration, accessories and the like, wherein the field types determine the display control styles of the metadata interface to be input in the future, such as controls of single-line texts, multi-line texts, date controls, drop-down lists, uploading controls and the like, and the field types have stronger customizable types; the verification rule of the metadata element provides basic format verification format definition, a user can define the rule by himself, then the rule is realized by self-analysis, and the rule can also be defined by using a regular expression and realized by regular verification.

4) The English name elements in the table are closely related to the English version data release and English version portals supported by the invention, and the metadata filling part in the data resource release can supplement the description.

1.3 data taxonomy registration

Registration, editing and deletion operations of a tree data classification system are supported, and classification system information includes but is not limited to: the user can add, edit, insert and delete the node information of any tree classification system.

The data classification system should support the automatic custom extension of multi-stage classification, it is closely related with the classified navigation type search of the data set in the data resource integrated sharing service, through the aforesaid built-in metadata element "classified coding" to make association, make association selection and report by the issuing user during data issuing. Data taxonomy registration is one embodiment of the present invention's general customizable.

1.4 license agreement registration

And supporting CC, ODC, PDDL and other standard license protocols, and simultaneously supporting operations of registering, editing, deleting and the like of custom license contents, wherein registration information mainly comprises a protocol identification code, a protocol name, a protocol identification picture, a protocol description text and the like.

The license agreement is a protection way for data acquisition, multiplexing and propagation. The registered license agreement is associated with the data set overview presentation in the data resource integration sharing service, and is associated through the built-in metadata element license agreement described above, and the associated selection is filled by a release user during data release. License agreement registration is also an embodiment of the present invention's general customizable nature.

2. Distributed heterogeneous data convergence transmission and synchronization

The unified registration and connection management of the relational and file data sources are realized; the construction of customized data transmission tasks is supported, and heterogeneous data physical aggregation is realized; the method supports breakpoint continuous transmission of transmission tasks, customized timing, automatic and manual synchronization of relational data, and log management of the whole process of data transmission and synchronization.

The step is applied to the distributed end data nodes for the node administrators, and the identity authentication of the node administrators is needed before the step is started.

The main flow structure is shown in fig. 4. The implementation details of each step in fig. 4 are described in detail below.

2.1 heterogeneous data Source registration

And unified registration connection management of the relational data source and the file type data source is realized.

Relation type data source: registration and connection testing of database connection information is supported. The data source information should at least include a data source name, a database type, a host address, a port number, a user name, a password, etc., wherein the database type should at least support a main stream relational database such as MySQL, oracle, SQLServer, etc., and other relational databases can be extended. The prototype interface of the new relational data source is shown in figure 5.

File-type data source: the definition and management of address information for file-type data storage is supported. The data source information at least comprises a data source name and a file access protocol (when the access protocol is a local file system, the follow-up information needs to comprise data file path information, when the access protocol is FTP, the follow-up information needs to comprise an FTP account number, an FTP password, FTP path information of a data file and the like); support the extension of Samba et al protocols.

In the implementation of the method, connectivity test is required to be realized for both the relational data source and the file data source, so that the validity of registration information of the data source is ensured. The connectivity test can be checked when the data source information is stored, and registered users need to be fed back in time when the problem of incapability of connectivity occurs.

The data source registration is the basis for shielding heterogeneous data resources, and in the subsequent data task transmission data implementation, the relational database table structure and the data reading are converted into standard SQL implementation through the adaptation of different database types; the file-type data is a direct-read file implementation.

2.2 data Transmission task construction

And the management such as construction, editing, viewing and deleting of the relational and file data tasks is realized.

And (3) constructing a relational data task: and acquiring a related data table by connecting the relational data sources, and selecting a related entity data table or a logic data table formed by SQL to form a data transmission task. See fig. 6 for a specific prototype interface.

File type data task construction: and determining a related file directory system by connecting the data sources of the description files, selecting related entity files or directories, and selecting the target transmission directory position of the central end to form a file type data transmission task. See fig. 7 for a specific prototype interface.

2.3 transport task operation management

And the remote efficient stable transmission management of the distributed-end data task to the central end is realized.

Breakpoint resume supporting data transfer tasks

Supporting data encryption compression transmission

Support for presentation of transmission progress

Log record management supporting transport full process

As described above, the Vsftp service based on the central end of the entity data file transmission adopts the FTP protocol, and supports complete compatibility with a third party FTP tool.

In terms of data transmission of relational entities, a certain type of relational database cluster constructed based on a central terminal comprises: mySQL maps different types of relation table structures such as a distribution end MySQL, oracle, SQLServer and data extraction into a table building SQL statement and a data insertion SQL statement which are consistent with a central end library structure, then the table building SQL statement and the data insertion SQL statement are packaged and transmitted to a central end in a compressed file form, and the central end uniformly executes the table building SQL and the data insertion SQL in a cloud relation database after decompressing, so that remote transmission of the relation data table and the data is realized.

2.4 relational data synchronization management

And the distributed end relation data is synchronously managed to the central end relation library in a timing way.

Synchronization is here only for relational data, which means that each record in a relational table or logical table in a certain transmission task at the distribution end is synchronized into a relational library table at the central end at a timing. Mainly considering the situation that certain records are added or changed periodically or irregularly to support certain relation table data of a distributed end, a user can directly customize the synchronous frequency without adding a transmission task again, and the system can synchronously update the table data related in the transmission task and the table data of a central end periodically to ensure the consistency of the records in the relation library tables of the distributed end and the central end.

Compared with the relational data, the file type entity data can be retransmitted through a newly built transmission task in consideration of low change frequency, so that the method does not support synchronization of the file type entity data.

Support timing synchronization of relational data tables, manual synchronization to achieve data synchronization (where timing synchronization supports user-customized synchronization frequencies, such as 1 hour, 12 hours, 1 day, 1 week, etc.).

The manual synchronization means that a user clicks an immediate synchronization button in a transmission task, so that the data record of the relational library table in the current task is immediately and synchronously transmitted to the library table at the central end, and the consistency of the data is ensured.

The timing synchronization means that a user sets a synchronization period setting of a transmission task, such as 1 hour, 12 hours, 1 day and 1 week, and the background process of the system matches the user period setting, and when the period time arrives, the system automatically realizes that the data record of the relational database table in the current task is synchronously transmitted to the database table at the central end, so as to keep the consistency of the records in the relational database tables at the distribution end and the central end.

The detailed operation log information record supporting the data synchronization process ensures that the data synchronization process can trace back.

3. Data resource centralized database construction organization and editing

The method realizes the construction and management of the relational database, supports the construction of online new table structures and the importing and editing of table data, and provides online database construction and data management service for users. The method realizes the disk management of the file data network, and supports the management operations of uploading, downloading, copying, moving, deleting and the like of file data resources.

The step is used by a node administrator, and identity authentication of a user needs to be managed through the node before the step is started.

The main flow structure is shown in fig. 8. The implementation details of each step of fig. 8 are described in detail below.

3.1 relational database building

The creation of a new relational database by Excel template importation or a new table by associating existing and described relational data tables is realized.

Leading-in type list establishment: creating a new table through the Excel template and storing the data in the template into a database. Excel template rules: each sheet page in Excel represents a data table, the sheet page name is the name of the data table to be built, the first row must be field description information, the second row is field name, the third row is data type (including: varchar, text, integer, float, double, datetime, etc.), and the fourth row starts as actual data. The pattern is shown in fig. 9.

The association creates a new table: there are two ways.

The importation creates a new table: the connection fields of the table A and the table B and the formation fields of the new table are respectively selected through the interface, the new table is constructed after the new table name is filled in, and the data of the new table can be previewed, as shown in fig. 10.

The association creates a new table: the SQL statement connected by multiple tables defines new table names to form new tables, the verification of the SQL statement is supported, the result of the SQL statement, namely new table data, can be previewed, and the synchronous update of the customization frequency of the table data is supported. The association creation new table prototype interface is shown in fig. 11.

3.2 relational library table description and field fusion configuration

And the description and the fusion configuration of the structural information of the relational library table selected by the center end are realized.

1. Including describing the relational data table names, describing the relational data table field names, see fig. 12 for prototypes.

2. The fusion configuration is realized by setting a certain field display type of the relational data table, and the method specifically comprises the following steps:

text type (default display type)

URL type (further options settings include FTP, HTTP, email, picture links etc.)

Enumeration type (further option to set enumeration strings such as: rule=man, female=woman; or set SQL statements including storage columns, display columns such as: select user_id, user_name from user)

The sub-table type (further selecting and setting the table name and the associated field of the associated sub-table; a plurality of sub-tables can be increased or decreased and set)

File type (main path to further select set files, pictures, video, and set file locations, multiple file association record separator)

3.3 relational library table data management

And the data management of all the relational library tables of the center end is realized, and the operations of data viewing, adding, editing and deleting are supported.

The user can check all the data tables under the managed database, update, add, check data, delete and the like, and support the retrieval of all the fields in the relation table. The prototype graph is shown in fig. 13.

3.4 File-type data management

And the network disk type management of all data files and catalogues of the central terminal is realized. Prototype diagrams such as fig. 14 shows the same.

File and directory basic operations, right key operations file rename, move, copy, delete.

Searching files and directories, and deep searching files and directories containing specified names by taking the current path as the root path.

Uploading files, supporting file uploading to the current path and selecting the specified path to upload files.

Download files, support double click selected file download and right click selected file download.

New directory, create folder under current path.

4. Unified publishing and auditing of data resources

The data resource release is realized and the data resource release is realized, and supporting uniform metadata description, data range selection and release management of heterogeneous data. The unified auditing of the data resources is realized, a batch auditing mode is supported, and user permission setting and fusion configuration are supported. The main flow structure is shown in fig. 15.

In the step, the steps of publishing, filling, editing and submitting are used by node administrators, and the identity authentication of the user needs to be managed through the nodes before the corresponding functions of the step are started; in this step, the audit and authorization are issued for the system administrator, and the identity authentication of the user needs to be managed through the system before the corresponding function of this step is started.

Details of the implementation of each step of figure 15 are as follows, emphasis is given.

4.1 data set metadata filling

Based on the built-in metadata and the extension metadata, the method dynamically realizes the online filling and batch filling of the metadata of the data set one by one.

1) In terms of filling, based on the necessary filling items, uniqueness, element types and check rules defined by the built-in and extended metadata elements: (1) automatically generating an online metadata filling page to realize online metadata filling one by one (see fig. 16 for an example), wherein a classification system and a data permission protocol can provide enumeration list user selection based on the definition of a basic public data registration part, and the system correspondingly stores related enumeration item numbers; (2) the batch data filling template can be automatically generated, and batch import type filling is realized. The data template can be in the forms of Excle, XML, json and the like;

Both filling modes should automatically check the necessary filling items and the check rules. In addition, table 1 lists the remark identification of the elements automatically filled in the system in the built-in metadata, when the filling is realized, part of the elements are automatically filled after being selected by a user on line (such as classification system selection, license agreement selection and the like), other elements are automatically filled when the system is stored in the background (such as PID is acquired through a background PID automatic registration interface and then filled, the data set release time is automatically filled by the system according to the current time, the reference format is defined by a reference format character string, and then the filling is automatically spliced, and the total number of files, the total storage capacity and the like are automatically counted by the background and then filled without online and batch filling of the user.

2) In aspect of filling, supporting effective docking with a globally unique data persistent identifier allocation interface, and automatically generating the PID of the current data set; and according to the definition of the data reference format, the data reference text of the current data set is automatically realized, and the automatic filling of the built-in data reference metadata elements is realized. Examples of PID data identification and data referencing are shown in fig. 17.

3) As mentioned above, the invention supports the realization of Chinese and English bilingual. The English names of the built-in and expanded metadata can be utilized on the metadata element display, in terms of metadata content, automatic translation of the filled Chinese metadata into English (realized by an open translation interface of hundred degrees or google) is supported after online filling and batch filling of the metadata, manual verification of a translation result by a user is supported, and finally Chinese and English metadata are synchronously stored in a system background.

4.2 data set entity data selection

Based on the relation library table and the file system of the central end, the selection of the online relation type entity data table and the selection of the entity data file based on the file directory system are realized (the independent selection and the simultaneous selection of the heterogeneous entity data table and the entity file are supported), and the online instant uploading selection of the file is supported. A data set entity data selection sample is shown, for example, in fig. 18.

4.3 data set editing and submission publishing

Metadata populating of 4.1 and entity data selection of 4.2 are two important steps of data set organization publishing. Two steps of re-editing and selection again are supported at the time of data set editing. When confirming after the error is avoided, the method comprises the steps of, the data set can be submitted to an auditor for release audit.

When the data set submits and issues the audit, the background should realize automatic text content extraction on all text entity files (such as txt, doc, pdf and the like) under the data set, and construct a full text database of related entity files, realize file content index, and support full text retrieval based on the text entity files in the integrated sharing service.

4.4 data set auditing and authorization publication

The method comprises the steps of conducting content auditing on a data set to be published, wherein the key points include checking and auditing whether metadata information filling is standard and checking whether entity data is accurate; and selecting a user range to which the data set is authorized to access, comprising: either fully public to all users or public to some user(s) (user group).

In the aspect of data set auditing, besides the support of online auditing function, the data set batch export offline auditing is supported. In the realization, the method supports the batch export of the metadata of the data set to Excel, supports the encapsulation of access interfaces of entity data files and relation data based on HTTP or FTP, automatically associates the interfaces with the entity data metadata of the data set, furthermore, offline checking of metadata of the metadata existence file based on the batch data sets is supported, entity data is accessed, and auditing results are selected and comments are input; and supporting the Excel metadata auditing result batch importing system.

The data set auditing and authorization issuing operation is closely related to the data resource integration sharing service in the step 5, the data set which is audited and issued by the data resource integration sharing service can be queried and checked by a user in the step of sharing service; and the users (user groups) in the authorization range of the data set can acquire the complete access right of the entity data of the data set after logging in the system.

5. Data resource integration sharing service

The discovery and access service of the data resource are realized through integration, and the Chinese and English bilingual service and automatic switching are supported. Unified classified retrieval and keyword retrieval of data resources are supported, tag cloud filtering and various ordering organizations are supported, customized query of all fields of an entity relation library table is supported, full text content retrieval of text type entity files is supported, and online preview and playing of multi-format data files such as documents, pictures, videos and audios are supported; the recommendation and acquisition service of the data resources are realized, a plurality of data association recommendation modes based on content and user behaviors are supported, a plurality of data acquisition modes such as online downloading of the data resources, API interface access and the like are supported, and management-oriented data access classification statistics is supported; the personalized management service of the data resources is realized, and the services of collection, recommendation, downloading, evaluation, labeling and the like of personalized requirements are supported.

In the step, the data set is searched, filtered, sequenced, accessed and recommended for anonymous users; in the step, the data set downloads evaluation and personalized service for authorized users, and the identity authentication of the users is required before the corresponding function of the step is started.

The main flow structure is shown in fig. 19. The implementation details of each step in fig. 19 are described in detail below.

5.1 dataset retrieval

And supporting two data retrieval modes of keyword and classified navigation (when a user-defined extension metadata comprises a data set of longitude and latitude metadata, online map retrieval is supported), and supporting API interface encapsulation of multiple data retrieval modes.

Keyword retrieval, supporting a full text search of a certain keyword based on data metadata, and sorting the searched dataset information by relevance.

And (3) performing classified navigation and retrieval, and displaying related data resources according to related classifications or searching data set information in specified classifications according to a globally set classification system.

5.2 dataset Filter ordering

And supporting the data resource tag cloud display and the multi-condition step-by-step filtering service thereof, and supporting the multi-condition re-ordering display of the data resource retrieval result.

The combination of tag cloud and the like is filtered step by step, so that a user is supported to dynamically generate tag cloud based on a data resource retrieval result, and further, the step by step tag cloud filtration of the data resource is supported; support for filtering based on the combination screening of the classified navigation keywords.

Comprehensive ordering, supporting dynamic ordering of data resources according to time, file type, user access hotness, etc.

5.3 dataset Access and evaluation

The user demand is oriented, and the online browsing, playing and displaying of typical entity data files in the data resources are realized; supporting online customized query and result downloading and fusion integrated display of relational table entity data; supporting full text retrieval of text entity files; supporting metadata (entity data) online download and API access service encapsulation; and supporting user-defined data social services such as tagging, evaluation, sharing and the like.

The entity data file is browsed online, supported file formats include, but are not limited to, main stream data file types such as doc, xls, pdf, mp3 and csv, avi, txt, and the entity data file can be dynamically expanded, and preview display and play of other expansion formats can be supported.

On-line inquiring and displaying of table data, supporting full-field customized retrieval (such as combination of customized field retrieval conditions), displaying (such as customized display columns and ordering columns) and result downloading of relation table data, and supporting relation table row data association sub-tables, files, videos and picture displaying based on relation library table fusion configuration; and supporting a connection service of an association enumeration dictionary, wherein the URL (which refers to a link form that URL text automatically displays and can be clicked, and supporting link formats including http, ftp, email and the like).

Text class file full text retrieval, text class file (including but not limited to txt, doc, docx, pdf, etc.) content extraction and indexing based on dataset submission and publication, supporting text class entity data file full text retrieval functionality.

The data downloading service provides selective downloading of data entities of different layers and ranges of data files facing to the data set and based on the query result for the login user, and provides downloading of metadata. In addition to the online downloading based on the interface, the downloading mode based on the API interface is supported in the downloading form.

Data social service: the method supports the scoring evaluation of the data resources provided by the logged-in download users, supports the tagging function of the data sets by the access users, and enables background administrators to audit, manage and filter the tags of the users and supplement and correct the existing data set tag settings. The method and the system support the user to conveniently share the URL of the data set to social media such as WeChat, microblog and the like.

5.4 dataset recommendation

A recommendation service based on the data set metadata content correlation calculation is supported, and a data recommendation service based on user access behavior statistics is supported.

Metadata content associated recommendation, which supports other data sets with higher similarity with the current data set based on metadata element description information content recommendation, and is convenient for a user to quickly find the other data sets with higher associated similarity.

User access behavior analysis recommendation, which supports statistical analysis of access conditions of access user groups to other data sets based on the current data set, recommends similar data sets which are possibly interested by the current user, and is convenient for the user to quickly find similar data resources.

5.5 data set service records and statistics

And the system supports the full log record management of the user data access behaviors and the statistics and the display of the data set access and downloading conditions.

User access log management, supporting access behavior full log records for user login, access, download, etc.

Data resource and service statistics, statistics and ranking supporting data set viewing, collection, download scenarios.

Statistical presentation of data sets, supporting presentation of statistical results in various presentation forms such as bar graphs, graphs and the like.

5.6 user personalized services

The method supports the user access and the display of the downloading history, and supports the user collection, evaluation and labeling management.

My access and download, support users to quickly search, view self-accessed, downloaded data resources.

My ratings, support users to quickly search for, and view, data resources that have been rated by themselves.

My tags, support the user to quickly search for, and view, own tags to the data resources.

My collection, supporting collection operations for data resources, facilitating the user to conveniently view and acquire data resources of interest to himself.

6. System integration description

In the system implementation, the steps of the method of the present invention are properly combined, and the overall system software structure is shown in fig. 20. The invention comprises three software systems of data aggregation transmission software, data management and release software and data sharing and service portal from bottom to top.

The overall deployment architecture of the system is shown in fig. 21. The system implementation mode can adopt the widely used Web development technology at present, and adopts the MVC design pattern based on the B/S framework. Wherein: the Model (Model) is the part of the application program that is used to handle the application program data logic, the Controller (Controller) is the part of the application program that is used to handle the user interaction, and the View (View) is the part of the application program that is used to handle the display of data.

7. Summary

The invention has the beneficial effects of providing a general distributed heterogeneous data (relational and file type) integrated physical convergence, organization release and integration service method and system design.

The method realizes the decoupling of the whole process of heterogeneous (relational and file type) data resource physical aggregation, organization, release and integration fusion service, fully considers the requirements of high customization and high multiplexing in the design of the method, effectively improves the universality and the flexibility of the invention, and has universal scene applicability. The user can complete the effective convergence, release and service of the distributed data only through customized configuration, so that the design and development efficiency of the distributed data sharing service system is greatly improved, and the development period of software is shortened.

Meanwhile, the method considers the advancement of service, realizes the centralized physical high-efficiency gathering transmission and synchronization of heterogeneous (relational and file type) data, realizes the modes of batch filling, organizing and auditing of the data, opens up the data persistent identification access, data reference standard, realizes the support of bilingual publishing, realizes the full text retrieval of text entity data and the full table customized retrieval of a data table, realizes the integration service of structured pre-unstructured data, and realizes the integration and encapsulation of various services such as retrieval, filtration, access, downloading, recommendation, social interaction and the like.

The present invention provides a general method, model and framework that is easily scalable. Wherein in terms of heterogeneous data sources, users can self-expand as needed. If the main stream relational database such as MySQL, oracle, SQLServer is realized in the system of the invention, the user can expand other needed relational data sources by himself. In terms of files, the invention realizes a local file system and an FTP file data source, and a user can extend other file data sources such as Samba file data sources and the like. In addition, the user can also self-expand the NoSQL data source, such as: mongoDB, etc.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the principle and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims

1. The universal distributed heterogeneous data integrated physical aggregation, organization, release and service method is characterized by comprising the following steps of:

4) Uniformly publishing and auditing the data resources at the center end;

5) The method comprises the steps that integrated sharing service of data resources is carried out at a center end;

the aggregation transmission and synchronization of the distributed heterogeneous data comprises the following steps:

2.4 Performing synchronous management of relational data, and synchronizing each record in a relational table or a logic table in a transmission task of a distributed end to a relational library table of a central end at fixed time;

the database construction organization and editing of the converged data resources comprise the following steps:

2. The method according to claim 1, wherein the data node registration realizes registration management of data node information and node administrator authentication information of a distribution end;

3. The method of claim 2, wherein the data node registers, wherein the attribute information of the data node comprises: data node name, node code, node profile, node contact, contact phone, email, node administrator account, node administrator password, data node creation time, sequence number; the node administrator account number and the node administrator password are used for carrying out identity authentication of the node administrator when the distribution end executes the step 2); the metadata extension element registration includes the following metadata elements: the data set unique persistent identification, the data set cover, the data set name, the data set profile, the keywords, the classification code, the start time, the end time, the creation mechanism, the creation personnel, the latest creation/update date, the release mechanism, the contact mail, the contact phone, the latest release date, the license agreement, the reference format, the total storage amount, the total number of files, and the total number of records.

4. The method according to claim 1, wherein step 2.2) the relational data task is constructed by connecting the relational data sources described above to obtain a related data table, selecting a related entity data table or a logical data table formed by SQL to form a data transmission task; the file type data task is constructed by connecting the file data sources to determine a related file directory system, selecting related entity files or directories, and selecting the target transmission directory position of the central end to form a file type data transmission task.

5. The method according to claim 1, wherein step 2.3) the transmission task operation management includes: the physical data file transmission is based on the center end of the Vftp service and adopts the FTP protocol to support the complete compatibility with a third party FTP tool; in the aspect of relational entity data transmission, based on a certain type of relational database cluster constructed by a central terminal, the relational table structures and data of different types of distributed terminals are extracted and mapped into table construction SQL sentences and data insertion SQL sentences which are consistent with the database structures of the central terminal, then the table construction SQL sentences and the data insertion SQL sentences are packed and transmitted to the central terminal in a compressed file mode, and the table construction SQL and the data insertion SQL are uniformly executed in a cloud relational database after the central terminal decompresses, so that the remote transmission of the relational data tables and the data is realized.

6. The method of claim 1, wherein the uniformly publishing and auditing the data resources comprises:

4.3 Editing, submitting and publishing the data set;

7. The method of claim 1, wherein the integrated sharing service of the data resource comprises:

8. The universal distributed heterogeneous data integrated physical aggregation, organization, release and service system is characterized by comprising a center end and a distribution end, wherein the distribution end deploys a data aggregation transmission software module, the center end deploys a data management and release software module and a data sharing and service portal module, and a public basic data registration and service sub-module is integrated in the data management and release software module;

the data sharing and service portal module is responsible for carrying out integrated sharing service of data resources;