CN112100265A

CN112100265A - Multi-source data processing method and device for big data architecture and block chain

Info

Publication number: CN112100265A
Application number: CN202010978288.5A
Authority: CN
Inventors: 孙圣力; 赖凯庭; 李青山; 司华友
Original assignee: Nanjing Boya Blockchain Research Institute Co ltd; Boya Chain Beijing Technology Co ltd; Peking University
Current assignee: Nanjing Boya Blockchain Research Institute Co ltd; Boya Chain Beijing Technology Co ltd; Peking University
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2020-12-18

Abstract

The invention provides a multi-source data processing method, a device and a system for a big data architecture and a block chain, wherein the method comprises the following steps: carrying out data acquisition on various data sources and converting the acquired data into data streams with a uniform format; realizing the classified caching of the data stream and providing a data stream output interface; acquiring a data stream through a data stream output interface and calling a big data open source algorithm to consume the acquired data stream; and acquiring the data stream through the data stream output interface and transferring the acquired data stream to the block chain. The invention provides a unified and lightweight data processing platform which can meet various actual service scenes, realizes data acquisition of different data sources, converts the acquired data into data streams with unified formats and facilitates quick reading of various data query and analysis tools. In addition, the classified stored data stream can be rapidly and conveniently transferred to the block chain, so that the block chain application is met.

Description

Multi-source data processing method and device for big data architecture and block chain

Technical Field

The invention relates to the technical field of communication, in particular to a multi-source data processing method and device for a big data architecture and a block chain.

Background

In recent years, with the rapid development of science and technology and the advancement of informatization construction, the data is as small as user cache of each application program background on the mobile terminal and as large as log data of user access and self running state stored on a server cluster, and the data is generated and accumulated at PB level at any time. The increase of data volume brings the increase of data value, a large amount of data plays a vital role in the fields of user behavior analysis, system safety alarm and the like, and under the support of various big data analysis technologies, a large amount of data which are discarded and not taken into consideration in the past begin to show new value.

On the other hand, in the early enterprise development and production environment, the data format is not standard, the data storage is random, a centralized storage means is lacked, and the difficulty is brought to the current big data processing. Numerous data are scattered in various types of databases which are not arranged and have non-uniform formats, and developers need to repeatedly build data pipelines and clean data on a server or a local host for use when obtaining the data, so that the development difficulty, the development time and the labor consumption are greatly increased. Therefore, how to collect and process scattered, non-uniform and data with complex data sources is a difficult problem for data management personnel and developers.

Based on the problem, many large-scale enterprises at home and abroad choose to build a data warehouse or a data center, and the data in the enterprises are intensively stored in the data warehouse or the data center in a uniform format to serve as a uniform data source in actual development. However, the development time of a data warehouse or a data center is long, the labor cost is high, the cluster building is difficult, the architecture is complex, a large amount of actual business data is needed for supporting, and a large number of medium-sized and small enterprises do not have the condition for building the data warehouse or the data center. In view of this, a unified and lightweight data platform capable of applying various actual service scenarios is a more practical technical solution.

The increase in the amount of data also brings about another problem: i.e. data security issues. The traditional database runs on a single-node server or a cluster consisting of a plurality of servers, the cost for data maintenance is high, and the safety is not good. The block chain technology is a distributed account book technology, transaction records are connected in series through the principle of cryptography, and are confirmed among nodes through a consensus mechanism, so that the transaction records are guaranteed not to be tampered, and are public and transparent. Therefore, a new idea is provided for important data encryption, namely, the important data are linked up and encrypted and stored in a consensus encryption mode, and better performance and safety guarantee can be obtained compared with the traditional database encryption mode.

However, the same problem of data conversion exists in the uplink process of data, and since the blockchain database server usually only opens a specific port and requires to send data in a specific HTTP request format, which does not directly correspond to the format of data storage in the database, the problem of conversion of the data format required in the request for communication between the data format stored in the database and the blockchain server is also an urgent problem to be solved.

Disclosure of Invention

In order to solve the above technical problems, a first aspect of the present invention provides a multi-source data processing method for a big data architecture and a block chain, which can acquire heterogeneous data from different data sources and convert the acquired data into a data stream with a uniform format. The specific technical scheme of the invention is as follows:

a multi-source data processing method facing a big data architecture and a block chain comprises the following steps:

carrying out data acquisition on various data sources and converting the acquired data into data streams with a uniform format;

realizing the classified caching of the data stream and providing a data stream output interface;

acquiring data stream through the data stream interface and consuming the acquired data stream; and/or

And acquiring the data stream through the data stream output interface and transferring the acquired data stream to the block chain.

In some embodiments, the plurality of data sources includes at least a relational database and a non-relational database, and the data streams are JSON formatted data streams.

In some embodiments, the obtaining the data stream from the data caching and transmission module and the unloading the data to the block chain comprises: parsing the data stream into data fields; extracting a target data field and packaging the extracted target data field into a message; and transferring the message encapsulated with the target data field to the block chain.

A second aspect of the present invention provides a big data architecture and blockchain oriented multi-source data processing apparatus, including:

the data acquisition module is used for acquiring data of various data sources and converting the acquired data into data streams with a uniform format;

the data caching and transmitting module is used for realizing the classified caching of the data stream and providing a data stream output interface;

the data consumption module acquires data stream through the data stream interface and consumes the acquired data stream; and/or

And the block chain uplink module acquires the data stream through the data stream interface and forwards the acquired data stream to the block chain.

In some embodiments, the plurality of data sources includes at least a relational database and a non-relational database, the data acquisition module includes a plurality of data acquisition components that can run in parallel, the plurality of data acquisition components are connected with the plurality of data sources via a JDBC interface, the plurality of data acquisition components include a Kafka component, a Logstash component, a Canal component, and a Maxwell component, and the data stream is a JSON formatted data stream.

In some embodiments, the data caching and transmission module comprises a Kafka open source platform, and the data stream is classified and cached in a Topic of the Kafka open source platform.

In some embodiments, the data consumption module comprises data query tools Hive, Impala and data analysis tools Spark, Storm.

In some embodiments, the block chain uplink module comprises:

the analysis submodule analyzes the data stream into data fields;

the encapsulation submodule extracts a target data field and encapsulates the extracted target data field into a message;

and the uplink sub-module is used for transferring the message encapsulated with the target data field to the block chain.

In some embodiments, the blockchain is a pre-arranged private, federation, or public chain.

The invention provides a unified and lightweight data processing platform which can meet various actual service scenes, can realize the acquisition of heterogeneous data from different data sources, converts the acquired data into data streams with unified formats for classified storage, and is convenient for various data query and analysis tools to quickly read. In addition, the data stream of the classified storage can be rapidly and conveniently transferred to the block chain.

Drawings

Fig. 1 is a schematic flowchart of a multi-source data processing method for a big data architecture and a block chain according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a multi-source data processing method for a big data architecture and a block chain according to an embodiment of the present invention;

FIG. 3 is a flowchart of a multi-source data processing apparatus for big data architecture and block chain according to an embodiment of the present invention;

FIG. 4 is a flowchart of a multi-source data processing apparatus for big data architecture and block chain according to an embodiment of the present invention;

FIG. 5 is an example of an environment that may be used to implement embodiments of the present invention;

fig. 6 is a flowchart of an application example of a big data architecture and block chain oriented multi-source data processing method according to an embodiment of the present invention;

fig. 7 is a flowchart of another application example of a big data architecture and block chain oriented multi-source data processing method according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Although the present invention provides the method operation steps or apparatus structures as shown in the following embodiments or figures, more or less operation steps or module units may be included in the method or apparatus based on conventional or non-inventive labor. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution order of the steps or the block structure of the apparatus is not limited to the execution order or the block structure shown in the embodiment or the drawings of the present invention. The described methods or modular structures, when applied in an actual device or end product, may be executed sequentially or in parallel according to embodiments or the methods or modular structures shown in the figures.

In order to realize the collection and processing of scattered, non-uniform and complex data sources, a data warehouse or a data center platform is generally required to be built, the development time of the data warehouse or the data center platform is long, the labor cost is high, the cluster building is difficult, the architecture is complex, and a large amount of actual service data is required to be supported.

Aiming at the defects in multi-source data acquisition and processing in the prior art, the invention provides a multi-source data processing method facing a big data architecture and a block chain, which can realize acquisition of heterogeneous data from different data sources and convert the acquired data into a data stream with a uniform format.

Fig. 1 illustrates a multi-source data processing method for a big data architecture and a block chain according to an embodiment of the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown, and detailed descriptions are as follows:

s101, data acquisition is carried out on various data sources, and the acquired data are converted into data streams with a uniform format.

As shown in fig. 5, the data source includes a traditional relational database and a cloud non-relational database, the traditional relational database includes MySQL, SQLite, Oracle, Acess, and the like, and the cloud non-relational database includes mongoOB, Redis, Hadoop, Menbase, and the like, which are deployed in the data source layer.

In the implementation process, data acquisition pipelines are deployed, the data stored in different target databases can be acquired through a compiled JDBC interface comprising the URL of the target database, and the acquired heterogeneous data is converted into data streams with a uniform format.

Optionally, as shown in fig. 5, the deployed data collection pipeline layers include Kafka, Logstach, Canal, and Maxwell components. Wherein: the Kafka component is used for collecting and inputting source data in the database, the Logstach component is used for collecting and conveying database logs, and the Canal component and the Maxwell component are used for analyzing the database logs to read and output the data. After the processing of the components, heterogeneous data stored in different databases are collected and output as a data stream in a unified Json format.

And S102, realizing classification and caching of the data stream and providing a data stream output interface.

Data streams in the Json format are classified and cached into Topic of the Kafka open source platform according to topics. Optionally, the theme of the data stream includes a data source and a data destination of the data. The data source and the data destination are defined according to a specific application service, and are not limited herein.

S103, acquiring data stream through the data stream interface and consuming the acquired data stream.

The step is executed by a big data processing engine, the big data processing engine queries and acquires the data stream in the required Json format from the Kafka open source platform, and analyzes and calculates the data stream, so that the corresponding application service is realized. Optionally, as shown in fig. 5, the big data processing engine includes data query tools Hive, Impala, and data analysis tools Spark, Storm, etc.

And S104, acquiring the data stream through the data stream output interface and transferring the acquired data stream to the block chain.

The step is a block chain processing step for unloading the data stream onto a block chain. The block chain is a private chain, a alliance chain or a public chain which is arranged in advance.

Optionally, as shown in fig. 2, step S104 specifically includes:

and S1041, analyzing the data stream into data fields.

S1042, extracting a target data field and packaging the extracted target data field into a message;

and S1043, transferring the message encapsulated with the target data field to the block chain.

Specifically, after the data stream in the json format is acquired, the data stream is analyzed by compiling an analysis program, and the target data field needing to be uplinked is identified and extracted. And then, packaging the extracted data field needing to be chained into the data field of the request message, and transferring the message to the block chain through a port in a POST form.

It should be noted that, in practical application, one or both of step S103 and step S104 may be executed in the multi-source data processing method of this embodiment. If both are executed, they may be executed in parallel, or one may be selectively executed first and then the other.

The present invention also provides a multi-source data processing apparatus oriented to big data architecture and block chain, as shown in fig. 3, the processing apparatus includes a data acquisition module 201, a data caching and transmission module 202, a data consumption module 203, and a block chain chaining module 204, wherein:

the data acquisition module 201 is configured to perform data acquisition on multiple data sources and convert the acquired data into a data stream with a uniform format.

As mentioned in the above embodiments, the data source is generally divided into a traditional relational database and a cloud non-relational database, as shown in fig. 5, the traditional relational database includes MySQL, SQLite, Oracle, access, etc., and the cloud non-relational database includes mongoOB, Redis, Hadoop, Menbase, etc.

Optionally, as shown in fig. 5, the data acquisition module 201 includes a plurality of data acquisition components that can run in parallel, and the data acquisition components are connected to the databases through JDBC interfaces. Optionally, the data acquisition component includes a Kafka component, a Logstash component, a Canal component, and a Maxwell component deployed at the data pipeline layer. Wherein: the Kafka component is used for collecting and inputting source data in the database, the Logstach component is used for collecting and conveying database logs, and the Canal component and the Maxwell component are used for analyzing the database logs to read and output the data. After the processing of the components, heterogeneous data stored in different databases are collected and output as a data stream in a uniform JSON format.

And the data caching and transmitting module 202 is configured to implement classified caching of data streams and provide a data stream output interface.

Optionally, as shown in fig. 5, the data caching and transmitting module 202 includes a Kafka open source platform deployed at the data pipe layer. The JSON data stream collected and output by the data collection module 201 is classified and cached in the Topic of the Kafka open source platform. Optionally, the theme of the data stream includes a data source and a data destination of the data. The data source and the data destination are defined according to a specific application service, and are not limited herein.

And the data consumption module 203 is configured to obtain a data stream through the data stream interface and call a big data open source algorithm to perform calculation analysis on the obtained data stream. Optionally, as shown in fig. 5, the data consumption module includes data query tools Hive, Impala and data analysis tools Spark, Storm deployed in the data consumption layer. The big data processing engines inquire and acquire the data stream in the required Json format from the Kafka open source platform, and analyze and calculate the data stream, so that the corresponding application service is realized.

A block chain uplink module 204, configured to acquire a data stream through the data stream interface and forward the acquired data stream to a block chain.

As shown in fig. 4 and 5, optionally, the block chain uplink module 204 includes:

the parsing submodule 2041 parses the data stream into data fields.

The encapsulating submodule 2042 extracts a target data field and encapsulates the extracted target data field into a message.

The uplink sub-module 2043 forwards the message encapsulated with the target data field to the block chain.

As shown in fig. 5, these functional modules are deployed within the data consumption layer.

The invention provides a unified and lightweight data processing platform which can meet various actual service scenes, can realize the acquisition of heterogeneous data from different data sources, and converts the acquired data into data streams with unified formats for classified storage, thereby facilitating the rapid reading and consumption of various data query and analysis tools. In addition, the classified stored data stream can be quickly and conveniently transferred to the block chain, so that the application requirements of certain service scenes with high requirements on safety and non-falsification are met.

In order to show the implementation of the present invention more clearly, the present invention will be described in more detail in the following two views of big data application, block chain application.

Fig. 6 shows a specific implementation flow of the present invention in a big data application, and for convenience of description, only a part of the flow is mainly described below, and the rest may refer to the related description in the foregoing.

As shown in fig. 5 and fig. 6, the present example adopts the way that the Kafka middleware reads the database data, and there are three ways to read, which are respectively described as follows:

the first way is by Kafka-connect-JDBC. Kafka-connect-JDBC is a third-party Kafka plug-in sourced by the confluent platform, supports copying of tables using various JDBC data types, dynamically synchronizes the state of the database, and supports addition and deletion operations on the database. It has three main modes: bulk import mode, increment mode, and Timestamp & increment combined with auto-increment mode.

The data acquisition plug-in is very simple to deploy, can be realized by adding the URL of the target database in the configuration file, supports various database sources to input data, and is easy to expand. The plug-in will output the data to the console under topic of Kafka in JSON format according to the mode selected in the configuration file, facilitating the subsequent multi-component consumption.

The second way is realized by a special data pipeline component, and the main technical selection types are Canal and Maxwell, wherein:

the Canal is an open-source data pipeline, based on the analysis of incremental logs of a database, the component can simulate an interactive protocol of the MySQL Slave and pretend to be the MySQL Slave when in use, so that the dump protocol is sent to the MySQL master. After receiving the dump request, the MySQL master starts to push the binary log to the slave, then the cancer receives the binary log and starts to analyze, so that the MySQL database is synchronized, and finally the Kafka is responsible for caching and outputting the cancer data, so that the data transition from the MySQL to the Kafka is realized.

The Maxwell has the advantages that MySQL data can be directly converted into json format for output, the use is simpler, and then the MySQL data can be directly read by Kafka.

After the data are cached to a topic of the Kafka through the three parallel data acquisition channels, the data acquisition and reading work is finished, and then the Kafka outputs the data stream in the json format to the data consumption layer through the data output interface.

The data consumption layer mainly comprises two parts, one is a data query module consisting of Hive and Impala, and the operation of the database including various tasks such as addition, deletion, modification, query and the like is completed through sql-like statements; the other part is a data calculation processing part consisting of Spark and Storm, can deal with tasks such as data calculation of an actual scene, and can support the training of models such as machine learning completed in Spark and Storm, and the marking result required to be returned to the database is returned to a specific field of the database.

The method for reading the data under Kafka specific topic by the data consumption layer is described in detail below.

Since Spark already provides a sufficiently rich interface or component to facilitate large amounts of data streaming batch processing. In the invention, a direct connection mode is used when Spark is butted against Kafka, which is different from a traditional Receiver mode in a mode of calling high-order api, the direct connection mode has no Receiver hierarchy, and the latest offsets in each partition in a specific topic in Kafka are periodically obtained based on Spark Streaming, and then each section of transmitted incoming data packet is processed according to the set maxRatePerpartition, so that the Spark is used for reading the Kafka data.

Storm provides a Storm-Kafka module for reading data in Kafka, and the specific construction mode comprises the following two steps: firstly, configuring mapping information of a Kafka browser host and a partition by using Brokerhosts interfaces, wherein the step supports two modes, one mode is realized based on zookeeper management, the other mode is directly connected with an open port, and the two modes are realized; the second is to configure output information related to Kafka, such as the amount of data output per unit time, port access timeout time, and the like, using Kafka Config.

Hive cannot directly synchronize Kafka data, but data communication between Hive and Kafka is increasingly emphasized with the appearance of actual scenes such as log processing. In the invention, two schemes are mainly considered, namely a camus component and a gobblin, the former is merged into a subset of the latter in 2015, the implementation is basically the same, Kafka data is extracted into HDFS by executing MapReduce task, and then the transition from HDFS to Kafka is carried out through shell script, and the scheme can realize a relatively simple data pipeline scene and achieve a relatively excellent extraction rate and capacity in an actual service scene.

The Impala is a big data real-time query analysis engine realized based on Hive, a Metadata base Metadata of Hive is directly used, meaning that Impala Metadata is stored in metastore of Hive, and the Impala is compatible with analysis of SQL-like statements of Hive, so that the Impala can be synchronized into Kafka only by operating Hive.

Fig. 7 shows a specific implementation flow of the present invention in a blockchain application, and for convenience of description, only a part of the flow is mainly described below, and the rest can refer to the related description in the foregoing.

In this example, during the data uplink process, the manager needs to manually mark an identifier of an important data field to mark the data that needs to be uplink-operated.

In the invention, in consideration of security, a simple private chain is built based on the Hyperleger Fabric hyper-book architecture, ports such as data chaining, contract certificate returning and the like are opened for use by examples, and the system can be subsequently butted with other external public chains only by opening and butting corresponding ports.

After the JSON format data stream output by the data transmission layer is obtained, a program is written to analyze the JSON format, the mark field is identified, and the data needing to be chained is determined. And then packaging the data in a data field of the request message, sending a data uplink request to the port in a POST form, and receiving a return message. It should be noted that, the manager may actively inquire the status to confirm whether the certificate is successfully stored, and may perform the data uplink request again if the failure information is returned. If the status code returned by the server is 0, the uplink is successful, if the status code is-1, the uplink is failed, three failure reasons are adopted, and-3 indicates illegal transaction, at this time, the manager needs to re-verify the identity and other information, if the status code is-2, the hash value is wrong, the manager needs to re-check the data integrity, and if the status code is 4000, the manager needs to check whether the error occurs in the storage system.

According to the above, after receiving the status code 0 returned by the server to indicate that the uplink is successfully completed, starting the service to return the contract certificate for the data manager, the specific implementation method is as follows: and the client sends a GET message to the server port to request transaction details, and after receiving the GET message, the client returns a transaction ID to a data manager as a data certificate of the data uplink. After the manager takes the data certificate, the manager can send a POST request to the block chain server by means of the certificate information, the uplink condition is inquired through a hash value generated by the ID, the POST request can be compared with original data in the database, data transmission is ensured to be correct, the whole process that the data are transferred to the block chain and the data uplink and returned to the certificate is completed, and then the realization that the data transfer module goes to the block chain module is completed.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

The invention has been described above with a certain degree of particularity. It will be understood by those of ordinary skill in the art that the description of the embodiments is merely exemplary and that all changes that come within the true spirit and scope of the invention are desired to be protected. The scope of the invention is defined by the appended claims rather than by the foregoing description of the embodiments.

Claims

1. A multi-source data processing method oriented to a big data architecture and a block chain is characterized by comprising the following steps:

2. The multi-source data processing method of claim 1, wherein the plurality of data sources comprise at least a relational database and a non-relational database, and the data stream is a JSON formatted data stream.

3. The multi-source data processing method of claim 1, wherein the obtaining the data stream from the data cache and transfer module and the dumping the data to the blockchain comprises:

parsing the data stream into data fields;

extracting a target data field and packaging the extracted target data field into a message;

and transferring the message encapsulated with the target data field to the block chain.

4. A big-data architecture and blockchain oriented multi-source data processing apparatus, the processing apparatus comprising:

5. The multi-source data processing apparatus of claim 4, wherein the plurality of data sources comprises at least a relational database and a non-relational database, the data acquisition module comprises a plurality of data acquisition components capable of running in parallel, the plurality of data acquisition components are connected with the plurality of data sources via a JDBC interface, the plurality of data acquisition components comprise a Kafka component, a Logstash component, a Canal component and a Maxwell component, and the data stream is a JSON format data stream.

6. The multi-source data processing apparatus of claim 5, wherein the data caching and transmission module comprises a Kafka open source platform, and wherein data streams are cached in a classification within a Topic of the Kafka open source platform.

7. The multi-source data processing apparatus of claim 4, wherein the data consumption module comprises data query tools Hive, Impala, and data analysis tools Spark, Storm.

8. The multi-source data processing apparatus of claim 1, wherein the block-chain uplink module comprises:

the analysis submodule analyzes the data stream into data fields;

9. The multi-source data processing method of claim 1, wherein the blockchain is a pre-arranged private chain, a federation chain, or a public chain.