CN109710667A

CN109710667A - A kind of shared realization method and system of the multisource data fusion based on big data platform

Info

Publication number: CN109710667A
Application number: CN201811426832.4A
Authority: CN
Inventors: 张帅; 谢莹莹; 郭庆; 宋怀明; 蒋丹东
Original assignee: Zhongke Dawning International Information Industry Co Ltd
Current assignee: Zhongke Dawning International Information Industry Co Ltd; Dawning Information Industry Co Ltd
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2019-05-03

Abstract

The present invention provides a kind of shared realization method and system of the multisource data fusion based on big data platform, the method includes configuring at least one data source information and clocking discipline, and data access operation is executed according to the clocking discipline configured, data access operation is that data or internet data acquisition or change data or loading data are extracted from least one acquired data source to big data platform；Data fusion operation is carried out according to the clocking discipline configured to the data accessed in data access operation；It stores to form repository to a layering point library is carried out through the post-job data of data fusion, and constructs secondary index library on the repository；Data sharing is carried out by the way that unified data exchange interface is arranged in constructed big data platform.The present invention only can need to be greatly improved the online deployment efficiency of project, greatly simplify retrieval of the upper layer application to data in big data platform in face of different scenes and multi-source data by flexibly configuring without developing again.

Description

A kind of shared realization method and system of the multisource data fusion based on big data platform

Technical field

The present invention relates to big data technical field more particularly to a kind of multisource data fusion based on big data platform are shared Realization method and system.

Background technique

In recent years, with the rapid development of the IT such as internet, social networks, cloud computing, search engine and the communication technology, number All a large amount of data are being generated daily with hundred million grades of user.Emerging in large numbers for large-scale data brings valuable machine to many industries Chance, but the adjoint typical characteristics of these data simultaneously, such as extensive, multi-source (multi-source), type and mode various (isomery), High-dimensional and quality is very different etc., so that the expression of data, understand, be calculated and applied over etc. that multiple links all suffer from greatly Challenge.The quality of data is " bottleneck " for restricting data and using, and as the important solution technology for improving the quality of data, data are clear Wash be with data fusion multi-source heterogeneous Data processing hot research field, have important value and meaning.But it is traditional Data cleaning method by hard coded method realize service logic, cause the reusability, scalability and flexibility of system compared with Difference.In addition, many applications in reality are frequently necessary to the integrated isomeric data from different approaches, how to ensure these data Consistency is increasingly becoming one and has to solve the problems, such as, i.e. entity recognition techniques.

At present with traffic service system 31,194, traffic signalization crossing, public security test the speed 66 sections of bayonet, Make a dash across the red light 192 crossings of capturing system, 86 sets of system for traffic guiding, 369 sets of flow monitoring system, road video 652, high-altitude 32 sets of HD video, 45 sets of vehicle-bone 3 G video, 248 sets of event monitoring system, mobile enforcement terminal 273 etc.

" big data " of field of traffic control mainly includes motor vehicle, the driver, road of administrative acquisition from data source The file datas such as road, road surface law enfrocement official acquisition vehicle and driver information, the traffic offence information of investigation, processing traffic The data such as accident, road, traffic data information, video, picture, vehicle flowrate, the GPS rail of road electronic monitoring equipment automatic collection The data such as mark, the public service the relevant fragmentation data of generated all kinds of traffic administrations and same population, insurance, tax The information exchange data of the relevant departments such as business, planning.These data from type, it is including picture, video, bivariate table, Structuring, semi-structured, non-structured data；It include the number such as traditional business window, internet, mobile Internet from channel According to application scenarios.

Therefore, it is necessary to a kind of according to practical business demand, data accumulation, and using advanced big data technology, building is efficient Stablize high performance big data basic platform, collect multi-source heterogeneous data, is provided using unified big data storage processing framework Corresponding data access, data fusion, data storage, data calculating, data sharing etc., for being provided with for all kinds of big datas application The support and guarantee of power.

During IT application in enterprise, due to each operation system build and implement data management system stage, The factors such as technical and other economy and human factor influence, and cause enterprise to have accumulated in development process a large amount of using different The business datum of storage mode, the data management system including use also differ widely, from simple document data bank to complexity Network data base, they constitute the heterogeneous data source of enterprise.

For existing solution usually with high time overhead, runing time can be with attribute dimensions in data set Increase and is exponentially increased；Under big data environment, due to the architectural difference of data is big, data source is wide, value density is lower, It the features such as real-time is updated, brings huge challenge to multisource data fusion technology, and multi-source heterogeneous data are fused to researcher It carries out knowledge acquisition, knowledge organization under big data environment and utilizes to provide very effective means and method.But at present Knowledge fusion method from theory into action, there are also many insufficient.

Summary of the invention

Multisource data fusion provided by the invention based on big data platform shares realization method and system, can be in face of not It with scene and multi-source data, only need to be not necessarily to be developed again by flexibly configuring, greatly improve the online deployment effect of project Rate greatlies simplify retrieval of the upper layer application to data in big data platform.

In a first aspect, the present invention provides a kind of shared implementation method of the multisource data fusion based on big data platform, comprising:

At least one data source information and clocking discipline are configured, and executes data access according to the clocking discipline configured and makees Industry, wherein the data access operation is that extraction data or internet data are adopted from least one acquired data source Collection or change data or loading data are to big data platform；

Data fusion operation is carried out according to the clocking discipline configured to the data accessed in data access operation；

It stores to form repository to a layering point library is carried out through the post-job data of data fusion, and the structure on the repository Build secondary index library；

Data sharing is carried out by the way that unified data exchange interface is arranged in constructed big data platform.

Optionally, described that the data accessed in data access operation are melted according to the clocking discipline progress data configured Cooperating industry includes:

It then include that will remember to the fusion operation of the record rank data when the data accessed are record rank data The data for recording each condition carry out information checking；

It then include field to the fusion operation of the record rank data when the data accessed are field rank data Verification or field conversion.

Optionally, the data fusion operation is treated fused data by ETL method and is handled；Wherein,

ETL is realized that class uses decorative mode in the ETL method, and configures corresponding configuration file to successively real Existing filter course, conversion process and filter course.

Optionally, a layering point library is carried out through the post-job data of data fusion store to form repository for described pair, and in institute Stating building secondary index library on repository includes:

Input data catalogue, data word number of segment, data rowkey field, one or any group in thematic library name parameter It closes；

According to Hbase connection type and thematic library name, instantiation connection；

The data corresponding types newest primary load date or time record are read, between calculating load time last time Every；

Judge whether the time interval is greater than the time cycle configured in the clocking discipline；

When the time interval be greater than the clocking discipline in configured time cycle when, then log recording it is previous or Multiple load failed cycles, then audit log and execute reload operation；

Alternatively, when the time interval is no more than the time cycle configured in the clocking discipline, then according to incoming Separator, one by one split record；

Array length after fractionation compares with incoming field sum, retains the identical data of the two；

According to incoming field subscript, field is integrated into major key；

Data put to hbase；

After execution, records secondary time cycle execution and load successfully.

Optionally, it stores to form repository carrying out a layering point library to the data after convergence analysis, and in the repository After upper building secondary index library, the method also includes:

Configurable script is set, and realizes the automation creation and data load in library and table.

Optionally, described to carry out data sharing packet by establishing standard uniform data Fabric Interface in big data platform It includes:

When the data sharing carried out is shared for data query, provided by JavaAPI or Rest to upper layer application Request the shared process of response modes；

When the data sharing carried out is data retrieval, retrieval permissions are set in access control in system administration and are carried out Constraint, wherein described to retrieve the retrieval data that can return to any request；

When the data sharing carried out is data access, data access log is recorded by external shared interface.

Second aspect, the present invention provide a kind of shared realization system of the multisource data fusion based on big data platform, comprising:

Configuration module, for configuring at least one data source information and clocking discipline；

Data access module, for executing data access operation according to the clocking discipline configured, wherein the data connect Entering operation is to extract data or internet data acquisition or change data from least one acquired data source or load Data are to big data platform；

Data fusion module, for being carried out to the data accessed in data access operation according to the clocking discipline configured Data fusion operation；

Memory module, for through the post-job data of data fusion carry out layering a point library store to form repository, and Secondary index library is constructed on the repository；

Data sharing module, for being carried out by the way that unified data exchange interface is arranged in constructed big data platform Data sharing.

Optionally, the data fusion module includes:

First fusion submodule, for when the data accessed be record rank data when, then to the record number of levels According to fusion operation include will record each condition data carry out information checking；

Second fusion submodule, for when the data accessed be field rank data when, then to the record number of levels According to fusion operation include field verification or field conversion.

Optionally, the memory module includes:

Parameter input submodule, for input data catalogue, data word number of segment, data rowkey field, thematic library name One or any combination in parameter；

Instantiation connection submodule, for according to Hbase connection type and thematic library name, instantiation connection；

Computational submodule is calculated for reading the data corresponding types newest primary load date or time record Load time last time interval；

Judging submodule, for judging whether the time interval is greater than the week time configured in the clocking discipline Phase；

First operation submodule, when the time interval is greater than the time cycle configured in the clocking discipline, then Log recording is previous or multiple load failed cycles, then audit log and executes and reloads operation；

Second operation submodule, for when the time interval is no more than the time cycle configured in the clocking discipline When, then according to incoming separator, record is split one by one；Array length after fractionation compares with incoming field sum, retains The identical data of the two；According to incoming field subscript, field is integrated into major key；Data put to hbase；After execution, note Secondary time cycle execution is recorded to load successfully.

Optionally, the data sharing module includes:

Data query shares submodule, for providing request response modes to upper layer application by JavaAPI or Rest Shared process；

Data retrieval submodule is constrained, wherein institute for retrieval permissions to be arranged in access control in system administration Stating retrieval can return to the retrieval data of any request；

Data access submodule, for recording data access log by external shared interface.

Multisource data fusion provided in an embodiment of the present invention based on big data platform shares realization method and system, described Method is mainly by corresponding to configuration data source information directly flexible in big data platform and each operation of Data processing Clocking discipline, in a first aspect, the method is by directly flexibly configuring at least one data source information, so that institute The method of stating can face different scenes and multi-source data, only need to be by flexibly configuring, without being developed again, data Accessing loading procedure, all automation is realized, greatly improves the online deployment efficiency of project.Second aspect, the method can also Clocking discipline corresponding to each operation is configured, and carries out data access operation, data fusion operation according to the clocking discipline, with Enable established big data platform by using timer-triggered scheduler frame, automatic quantizer input quantization increment accesses multi-source heterogeneous data.Third Aspect, the method is by storing the unified layering point library that carries out of data to form repository, for example, multi-source to be stored is different The configurable unified storage of structure data setting promotes big number in addition, also constructing secondary index library on the unified repository established According to the inquiry velocity of multi-source data under platform.Fourth aspect, the method can also be by being arranged unified data exchange interface It is shared to carry out data query, greatlies simplify upper layer application to the retrieval complexity of data in big data platform.

Detailed description of the invention

Fig. 1 is the flow chart that one embodiment of the invention shares implementation method based on the multisource data fusion of big data platform；

Fig. 2 is the process that another embodiment of the present invention shares implementation method based on the multisource data fusion of big data platform Figure；

Fig. 3 is the flow chart of data fusion operation in one embodiment of the invention；

Fig. 4 is the structural representation that one embodiment of the invention shares realization system based on the multisource data fusion of big data platform Figure.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

The embodiment of the present invention provides a kind of shared implementation method of the multisource data fusion based on big data platform, such as Fig. 1 institute Show, which comprises

S11, clocking discipline corresponding at least one data source information and each operation is configured, and according to being configured Clocking discipline executes data access operation, wherein the data access operation is to take out from least one acquired data source Access according to or internet data acquisition or change data or loading data to big data platform；

S12, data fusion operation is carried out according to the clocking discipline configured to the data accessed in data access operation；

S13, it stores to form repository to carrying out a layering point library through the post-job data of data fusion, and in the repository Upper building secondary index library；

S14, data sharing is carried out by the way that unified data exchange interface is arranged in constructed big data platform.

Multisource data fusion provided in an embodiment of the present invention based on big data platform share implementation method mainly by Clocking discipline corresponding to directly flexible configuration data source information and each operation of Data processing in big data platform, first Aspect, the method is by directly flexibly configuring at least one data source information, so that the method can face Different scenes and multi-source data, only need to be by flexibly configuring, without being developed again, and data access loading procedure is complete Portion's automation is realized, the online deployment efficiency of project is greatly improved.Second aspect, it is right that the method can also configure each operation institute The clocking discipline answered, and data access operation, data fusion operation are carried out according to the clocking discipline, so that the big number established Multi-source heterogeneous data can be accessed by using timer-triggered scheduler frame, automatic quantizer input quantization increment according to platform.The third aspect, the method are logical It crosses and stores the unified layering point library that carries out of data to form repository, for example, multi-source heterogeneous data setting to be stored can be matched The unified storage set promotes multi-source number under big data platform in addition, also constructing secondary index library on the unified repository established According to inquiry velocity.Fourth aspect, the method can also be total by the way that unified data exchange interface progress data query is arranged It enjoys, greatlies simplify upper layer application to the retrieval complexity of data in big data platform.

Specifically, data access operation described in the present embodiment the method is from multiple and different operation systems, Duo Geping Data or internet data acquisition or change data or loading data are extracted in the data source of platform to big data platform；Wherein, The data pick-up is to be acquired extraction, the data source by configuration data, formulation to data using data pick-up client The step of collection rule, carry data pick-up operation, extracts data, and the process of data pick-up is not influencing original system just Often operation；

The data receiver is to provide the reception of source data, receives the data outside the data or system in system Source, additionally it is possible to which two functional modules: data reception service and data collection client are set.

The internet data acquisition is the acquisition URL provided using user

The relevant configuration of (Uniform Resoure Locator, uniform resource locator) address and rule is to internet Webpage data information, and ultimately form Hdfs (Hadoop Distributed File System, distributed file system) text Part.

Optionally, as shown in Fig. 2, it is described to the data accessed in data access operation according to the clocking discipline configured Carrying out data fusion operation includes:

It then include that will remember to the fusion operation of the record rank data when the data accessed are record rank data The data for recording each condition carry out information checking；Wherein, the data format accessed includes non-isomery or isomery；

Optionally, the data fusion operation passes through the ETL (contracting of Extraction-Transformation-Loading Write, i.e., data pick-up (Extract), conversion (Transform), load (Load) process) method treat fused data progress Processing；Wherein,

Specifically, the data fusion operation that data fusion described in the present embodiment the method is configurable by setting, packet Record rank and the other data fusion of field level are included, wherein；The fusion operation of record rank data is included to recording a variety of conditions Cleaning verification etc.；The other data fusion operation of field level includes verifying to field, the operation such as field conversion.Shown in Fig. 3, the number According to fusion operation by the corresponding Hdfs file of data to be fused by TextInputETLMapper frame, TextInputETLReducer frame carries out fusion treatment and ultimately forms new Hdfs file format, and above-mentioned process is related to multiple Call the treatment process of same functions.In addition, ETL is realized that class uses decorative mode, configured in configuration file, such as realize Filter A (FilterA) → filtering B (FilterB) → conversion A (TransferA) → filtering A (FilterA) → filtering B (FilterB) then repetitive operation submits operation operation by job scheduling module.

According to incoming field subscript, field is integrated into major key；

Data put to HBase (Hadoop Database, distributed memory system)；

After execution, records secondary time cycle execution and load successfully.

Specifically, the present embodiment the method is by storing and being formed to through the post-job data hierarchy point library of data fusion Unified repository, wherein the unification repository that is formed by includes base library, thematic library, Full-text Database etc., then by setting Configurable script is set, realizes the automation creation and data load in library and table；And secondary index library is constructed on repository, Guarantee to big data search efficiency.

Specifically, it is unified by setting that data query performed in the present embodiment the method, which shares operation, JavaAPI (Application Programming Interface, application programming interface) and Rest services two kinds of sides Formula provides the shared service of request response modes to upper layer application.Performed retrieval permissions operation is visited in system administration Control is asked to constrain, and can return to the retrieval data of any request by its respective modules default.The performed data access is made Industry is to go to record when data access log is called the above method by external shared interface.

The embodiment of the present invention also provides a kind of shared realization system of the multisource data fusion based on big data platform, such as Fig. 4 It is shown, the system comprises:

Configuration module 11, for configuring at least one data source information and clocking discipline；

Data access module 12, for executing data access operation according to the clocking discipline configured, wherein the data Accessing operation is that data or internet data acquisition or change data or dress are extracted from least one acquired data source Data are carried to big data platform；

Data fusion module 13, for the data accessed in data access operation according to the clocking discipline configured into Row data fusion operation；

Memory module 14, for through the post-job data of data fusion carry out layering a point library store to form repository, and Secondary index library is constructed on the repository；

Data sharing module 15, for by be arranged in constructed big data platform unified data exchange interface into Row data sharing.

The shared realization system of multisource data fusion provided in an embodiment of the present invention based on big data platform, which mainly passes through, matches Set module timing corresponding to directly flexible configuration data source information and each operation of Data processing in big data platform Rule, in a first aspect, the configuration module in the system is by directly flexibly configuring at least one data source information, So that the method can face different scenes and multi-source data, it only need to be by flexibly configuring, without being opened again Hair, data access loading procedure all realize by automation, greatly improves the online deployment efficiency of project.Second aspect, the system Configuration module in system can also be as configuring clocking discipline corresponding to each operation, and by data access module or data fusion mould Block carries out data access operation, data fusion operation according to the clocking discipline, so that the big data platform established can lead to It crosses using timer-triggered scheduler frame, automatic quantizer input quantization increment accesses multi-source heterogeneous data.The third aspect, the memory module in the system are logical It crosses and stores the unified layering point library that carries out of data to form repository, for example, multi-source heterogeneous data setting to be stored can be matched The unified storage set promotes multi-source number under big data platform in addition, also constructing secondary index library on the unified repository established According to inquiry velocity.Fourth aspect, the data sharing module in the system can also be connect by the way that unified data exchange is arranged Mouth carries out data query and shares, and greatlies simplify upper layer application to the retrieval complexity of data in big data platform.

Optionally, the data fusion module includes:

Optionally, the memory module includes:

Parameter input submodule is used for input data catalogue, data word number of segment, data rowkey field, thematic library name One or any combination in parameter；

Optionally, the data sharing module includes:

The device of the present embodiment can be used for executing the technical solution of above method embodiment, realization principle and technology Effect is similar, and details are not described herein again.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of multisource data fusion based on big data platform shares implementation method characterized by comprising

At least one data source information and clocking discipline are configured, and executes data access operation according to the clocking discipline configured, Wherein, the data access operation be from least one acquired data source extract data or internet data acquisition or Change data or loading data are to big data platform；

It stores to form repository to a layering point library is carried out through the post-job data of data fusion, and constructs two on the repository Grade index database；

2. the method according to claim 1, wherein it is described to the data accessed in data access operation according to The clocking discipline configured carries out data fusion operation

It then include that will record respectively to the fusion operation of the record rank data when the data accessed are record rank data The data of condition carry out information checking；

It then include field school to the fusion operation of the record rank data when the data accessed are field rank data It tests or field is converted.

3. according to the method described in claim 2, it is characterized in that, the data fusion operation is by ETL method to be fused Data are handled；Wherein,

ETL is realized that class uses decorative mode in the ETL method, and configures corresponding configuration file successively to realize Filter process, conversion process and filter course.

4. method according to claim 1 to 3, which is characterized in that described pair through the post-job data of data fusion into A row layering point library stores to form repository, and constructs secondary index library on the repository and include:

Input data catalogue, data word number of segment, data rowkey field, one or any combination in thematic library name parameter；

The data corresponding types newest primary load date or time record are read, load time last time interval is calculated；

When the time interval is greater than the time cycle configured in the clocking discipline, then log recording is previous or multiple Load failed cycle, then audit log and execute reload operation；

Alternatively, when the time interval is no more than the time cycle configured in the clocking discipline, then according to incoming point Every symbol, record is split one by one；

According to incoming field subscript, field is integrated into major key；

Data put to hbase；

After execution, records secondary time cycle execution and load successfully.

5. method according to claim 1 to 4, which is characterized in that carrying out layering point to the data after convergence analysis Library stores to form repository, and on the repository after building secondary index library, the method also includes:

6. -5 any method according to claim 1, which is characterized in that described by establishing standard in big data platform Uniform data Fabric Interface carries out data sharing

When the data sharing carried out is shared for data query, request is provided to upper layer application by JavaAPI or Rest The shared process of response modes；

When the data sharing carried out is data retrieval, retrieval permissions are set in access control in system administration and are carried out about Beam, wherein described to retrieve the retrieval data that can return to any request；

7. a kind of multisource data fusion based on big data platform shares realization system characterized by comprising

Data access module, for executing data access operation according to the clocking discipline configured, wherein the data access is made Industry is that data or internet data acquisition or change data or loading data are extracted from least one acquired data source To big data platform；

Data fusion module, for carrying out data according to the clocking discipline configured to the data accessed in data access operation Merge operation；

Memory module, for storing to form repository to carrying out a layering point library through the post-job data of data fusion, and described Secondary index library is constructed on repository；

Data sharing module, for carrying out data by the way that unified data exchange interface is arranged in constructed big data platform It is shared.

8. system according to claim 7, which is characterized in that the data fusion module includes:

First fusion submodule, for when the data accessed are record rank data, then to the record rank data Fusion operation includes the data progress information checking that will record each condition；

Second fusion submodule, for when the data accessed are field rank data, then to the record rank data Fusion operation includes field verification or field conversion.

9. system according to claim 7 or 8, which is characterized in that the memory module includes:

Parameter input submodule, for input data catalogue, data word number of segment, data rowkey field, thematic library name parameter In one or any combination；

Computational submodule calculates last time for reading the data corresponding types newest primary load date or time record Load time interval；

Judging submodule, for judging whether the time interval is greater than the time cycle configured in the clocking discipline；

First operation submodule, when the time interval is greater than the time cycle configured in the clocking discipline, then log Record previous or multiple load failed cycles, then audit log and execute reload operation；

Second operation submodule, for when the time interval is no more than the time cycle configured in the clocking discipline, Then according to incoming separator, record is split one by one；Both array length after fractionation compares with incoming field sum, retain Identical data；According to incoming field subscript, field is integrated into major key；Data put to hbase；After execution, record should The secondary time cycle executes and loads successfully.

10. according to any system of claim 7-9, which is characterized in that the data sharing module includes:

Data query shares submodule, for providing being total to for request response modes to upper layer application by JavaAPI or Rest Enjoy process；

Data retrieval submodule is constrained, wherein the inspection for retrieval permissions to be arranged in access control in system administration Rope can return to the retrieval data of any request；