CN113282611B

CN113282611B - Method, device, computer equipment and storage medium for synchronizing stream data

Info

Publication number: CN113282611B
Application number: CN202110728683.2A
Authority: CN
Inventors: 孙朝辉; 李凯东; 侯文京; 裘金龙
Original assignee: Shenzhen Pingan Zhihui Enterprise Information Management Co ltd
Current assignee: Shenzhen Pingan Zhihui Enterprise Information Management Co ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2024-04-23
Anticipated expiration: 2041-06-29
Also published as: CN113282611A

Abstract

The application discloses a method, a device, computer equipment and a storage medium for synchronizing stream data, and belongs to the technical field of big data. The application acquires service data and sends the service data to a data warehouse, wherein the service data is streaming data, an execution log corresponding to the service data is acquired from a first service system through l ogstash tools, the execution log is consumed through a kafka message queue to obtain a log consumption result, a streaming calculation engine performs data integration calculation on the service data and the log consumption result to obtain integrated data, and SQL update sentences are executed to synchronize the integrated data to a second service library. In addition, the application also relates to a block chain technology, and service data can be stored in the block chain. According to the application, the stream calculation engine performs data integration calculation on the service data and the log consumption result of the first service system to obtain integrated data, and the integrated data is synchronized to the second service library to realize real-time synchronization of the stream data, so that the stream data synchronization instantaneity is improved.

Description

Method, device, computer equipment and storage medium for synchronizing stream data

Technical Field

The application belongs to the technical field of big data, and particularly relates to a method, a device, computer equipment and a storage medium for synchronizing stream data.

Background

With the rise of micro services, more and more systems use micro service technology, and a micro service system generally includes a plurality of services and a plurality of databases, so in practical application, there are two or more situations that data of two or more systems are associated and synchronized, and for this scenario, there are two common schemes in the industry:

1. And synchronizing the data of the initial service system to the target service system in a synchronous or asynchronous mode, and performing table association in the target service system to realize condition filtering. For example, a personnel selection system needs to send data in a core personnel system, a recruitment system, a performance system, a training system and a false duty system in a synchronous or asynchronous mode, and data receiving is realized in the selection system. In the above synchronization mode, the data synchronization between systems has the risk of data loss, a large amount of redundancy of the data between systems, data storage pressure is brought to a target system, the systems are highly coupled and mutually influenced, and the complexity of the system is higher as the number of micro services is increased.

2. For example, in a personnel selection system, data in a core personnel system, a recruitment system, a performance system, a training system and a false duty system need to be collected into the data warehouse in advance and then processed, the processing process is performed in the data warehouse, the pressure is not brought to a business warehouse, and then the processed data is written into the selection system. However, data collection and data writing of the data warehouse are generally carried out once a day in the early morning, the time of data transmission is T+1, and the time of data is low.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, computer equipment and a storage medium for synchronizing stream data, which are used for solving the technical problems that the timeliness is low and real-time synchronization cannot be realized in the data synchronization scheme of the traditional micro-service system.

In order to solve the above technical problems, the embodiments of the present application provide a method for synchronizing stream data, which adopts the following technical schemes:

a method of stream data synchronization, comprising:

receiving a data synchronization instruction, acquiring service data corresponding to the data synchronization instruction, and sending the service data to the data warehouse;

Collecting an execution log corresponding to the service data from the first service system through logstash tools;

writing the execution log into a kafka message queue, and consuming the execution log through the kafka message queue to obtain a log consumption result;

Transmitting the business data and the log consumption result in the data warehouse to the flow calculation engine, and carrying out data integration calculation on the business data and the log consumption result through the flow calculation engine to obtain integrated data;

executing a preset SQL update statement to synchronize the integrated data to the second service library.

Further, after the receiving the data synchronization instruction, acquiring service data corresponding to the data synchronization instruction, and sending the service data to the data warehouse, the method further includes:

acquiring a data processing identifier corresponding to the service data from the data synchronization instruction;

searching a data processing script corresponding to the data processing identifier in a preset script library;

And processing the business data based on the data processing script.

Further, the step of collecting, by the logstash tool, an execution log corresponding to the service data from the first service system specifically includes:

Monitoring a database of the first business system, and acquiring an execution log of the changed transaction through the logstash tool when the transaction of the database is monitored to be changed;

and establishing a temporary storage area in the database, and writing the execution log into the temporary storage area.

Further, the step of monitoring the database of the first service system, when it is monitored that a database transaction of the database changes, obtaining, by the logstash tool, an execution log of the changed transaction specifically includes:

Determining log data corresponding to the service data, and acquiring a primary key of the log data;

monitoring the transaction in the first business system database based on the primary key of the log data;

when the fact that the transaction with the same main key as the business data in the database of the first business system is changed is monitored, generating a latest timestamp of the changed transaction;

and acquiring the log data of the latest timestamp to obtain an execution log.

Further, the step of writing the execution log into a kafka message queue and consuming the execution log through the kafka message queue to obtain a log consumption result specifically includes:

analyzing the execution log, and generating a category creation instruction based on the content of the execution log;

creating a corresponding topic category in the kafka message queue based on the category creation instruction;

transmitting the execution log from the temporary storage area to the theme class for storage;

Executing a preset log consumption instruction to consume the execution log in the theme category to obtain a log consumption result.

Further, the step of sending the service data and the log consumption result in the data warehouse to the flow calculation engine, and performing data integration calculation on the service data and the log consumption result by the flow calculation engine to obtain integrated data specifically includes:

Collecting business data in the data warehouse according to a preset first scheduling frequency, and generating a first Spark stream object;

collecting the log consumption result according to a preset second scheduling frequency to generate a second Spark stream object;

Performing object conversion on the first Spark stream object and the second Spark stream object respectively, and converting the first Spark stream object and the second Spark stream object into RDD objects;

and merging and converting the first Spark stream object and the second Spark stream object after being converted into RDD objects by the stream calculation engine to obtain integrated data.

Further, the step of executing a preset SQL update statement to synchronize the integrated data to the second service library specifically includes:

calling RDDforeach a tool to divide the data of the integrated data to obtain a plurality of RDD partitions;

creating a connection object between each said RDD partition and said second service library;

And synchronizing the data in each RDD partition into the second service library through the connection object.

In order to solve the above technical problems, the embodiments of the present application further provide a device for synchronizing stream data, which adopts the following technical scheme:

An apparatus for stream data synchronization, comprising:

The instruction receiving module is used for receiving a data synchronization instruction, acquiring service data corresponding to the data synchronization instruction and sending the service data to the data warehouse;

The log acquisition module is used for acquiring an execution log corresponding to the service data from the first service system through logstash tools;

the log consumption module is used for writing the execution log into a kafka message queue, and consuming the execution log through the kafka message queue to obtain a log consumption result;

The data integration module is used for sending the business data and the log consumption result in the data warehouse to the flow calculation engine, and carrying out data integration calculation on the business data and the log consumption result through the flow calculation engine to obtain integrated data;

And the data synchronization module is used for executing a preset SQL update statement so as to synchronize the integrated data to the second service library.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor perform the steps of the method of stream data synchronization as described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of a method of stream data synchronization as described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

The application discloses a method, a device, computer equipment and a storage medium for synchronizing stream data, and belongs to the technical field of big data. The data synchronization platform of the application introduces a stream calculation engine on the basis of a data warehouse, realizes real-time synchronization of stream data by combining the creation of offline calculation and stream calculation, greatly improves the real-time performance of data synchronization.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow chart of one embodiment of a method of stream data synchronization in accordance with the present application;

Fig. 3 shows a schematic structural diagram of an embodiment of an apparatus for stream data synchronization according to the present application;

Fig. 4 shows a schematic structural diagram of an embodiment of a computer device according to the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the method for synchronizing stream data provided by the embodiment of the present application is generally executed by a server, and accordingly, the device for synchronizing stream data is generally disposed in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

In the application, a hive database built by a Hadoop architecture is used as a data warehouse, hadoop is a distributed system infrastructure developed by an Apache foundation, and a user can develop a distributed program without knowing the details of a distributed bottom layer and fully utilize the power of a cluster to perform high-speed operation and storage. Hadoop implements a Distributed file system (Distributed FILE SYSTEM), where one component is HDFS (Hadoop Distributed FILE SYSTEM). The Sqoop tool is used for transferring data between a data warehouse and a traditional database (such as MySQL, postgresql and the like), and the Sqoop is an open source tool, so that data in a relational database (such as MySQL, oracle, postgres and the like) can be imported into the HDFS of Hadoop, and the data of the HDFS can be imported into the relational database.

In a specific embodiment, at present, for a personnel selection system, offline processing is generally realized by configuring a data warehouse, that is, when the system is in a busy stage, data of systems such as core personnel, recruitment, performance, training, false duty and the like are extracted to the data warehouse, and then a scheduling platform executes scripts to clean and process the data at regular time, but in order to save server resources, the scheduling platform schedules the data once a day, that is, when the early morning system enters an idle stage, the scheduling platform generally needs to organize the data for synchronization, and the processed data is synchronized to the personnel selection system for use by an alternative personnel selection system, so that timeliness of the data synchronization is T+1, and real-time synchronization cannot be realized.

The data synchronization platform introduces the stream calculation engine on the basis of the data warehouse, realizes the real-time synchronization of stream data by establishing a mode of combining offline calculation and stream calculation, and greatly improves the real-time performance of data synchronization. The stream data synchronization method of the application can be applied to a personnel selection system to improve timeliness of the personnel selection system, the personnel selection system comprises a first service library, a second service library, a data warehouse, a Logstash tool, a kafka message queue and SPARK STREAMING stream calculation engines, the first service library comprises a core personnel service library, a recruitment service library, a performance service library, a training service library, a false service library and other databases for providing personnel selection evaluation data, the first service library is respectively connected with the Logstash tool and the data warehouse, the Logstash tool is connected with the kafka message queue, the data warehouse and the kafka message queue are respectively connected with the stream calculation engines SPARK STREAMING, SPARK STREAMING is connected with the second service library, and the second service library is the selection service library.

The stream calculation engine is SPARK STREAMING engine, SPARK STREAMING is an extension of Spark core API, and can implement high throughput processing of real-time stream data with fault-tolerant mechanism. SPARK STREAMING support retrieval of data from a variety of data sources, including Kafka, flume, twitter, zeroMQ, kinesis and TCP Sockets. After the data is obtained from the data source, advanced functions such as map, reduce, join and window can be used to process complex algorithms, and finally the processing results can be stored in file systems, databases and field dashboards. As with the other subframes of Spark, SPARK STREAMING are also based on core Spark. SPARK STREAMING the internal processing mechanism is to receive a real-time input data stream, split into a batch of data according to a certain time interval (e.g. 1 second), and then process the batch of data through SPARK ENGINE, so as to finally obtain a processed batch of result data.

With continued reference to fig. 2, a flow chart of one embodiment of a method of stream data synchronization according to the present application is shown. The method for synchronizing stream data comprises the following steps:

s201, receiving a data synchronization instruction, acquiring service data corresponding to the data synchronization instruction, and sending the service data to the data warehouse.

Specifically, the server receives the data synchronization instruction, acquires service data corresponding to the data synchronization instruction, and sends the service data to the data warehouse. The data synchronization instruction carries a corresponding data processing identifier, the data processing identifier is generated according to processing requirements and is used for indicating a data warehouse to pre-process service data to be synchronized according to the processing requirements, and the data warehouse is arranged to realize offline processing of the synchronization data so as to reduce the pressure of the service warehouse. The service data is stream data.

Streaming data, among others, refers to data that is continuously generated by thousands of data sources, typically also sent in the form of data records at the same time, on a small scale (on the order of several kilobytes). The streaming data includes a variety of data such as log files generated by the customer using your mobile or Web application, online shopping data, in-game player activity, social networking site information, financial transaction lobby or geospatial services, and telemetry data from connected devices or instruments within the data center.

In this embodiment, the electronic device (e.g., the server/terminal device shown in fig. 1) on which the method of stream data synchronization operates may receive the data synchronization instruction through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.

S202, collecting an execution log corresponding to the service data from the first service system through logstash tools.

The logstack is a platform for application log, event transmission, processing, management and searching, and can be used for unified collection and management of application log, and provides a Web interface for query and statistics. In brief, logstash serves as a bridge between the data source and the data storage analysis tool, and can greatly facilitate data processing and analysis by combining with the elastic search and Kibana. Logstack is configured with over 200 plug-ins and can accept almost a wide variety of data, including logs, network requests, relational databases, sensors or the internet of things, and so forth. The data processing process of the logstack mainly comprises the following steps: inputs, filters, outputs, and in addition, data formats can be processed using Codecs in Inputs and Outputs. The four parts are all in plug-in mode, and a user sets input, filter, output, codec plug-ins required to be used by defining a pipeline configuration file so as to realize specific functions of data acquisition, data processing, data output and the like.

Specifically, the server collects an execution log corresponding to the service data from the first service system through logstash tools. It should be noted that, when the service data is updated, the update information is recorded in the execution log, the execution log corresponding to the service data is collected and read by the logstash tool, the current state of the service data can be determined, and the current state of the service data is determined by the execution log, so as to determine whether the service data changes in the data synchronization process, if so, the service data stored in the data warehouse is updated.

S203, writing the execution log into a kafka message queue, and consuming the execution log through the kafka message queue to obtain a log consumption result.

Wherein Kafka is an open source stream processing platform developed by the Apache software foundation, written by Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action flow data for consumers in a web site. Such actions (web browsing, searching and other user actions) are a key factor in many social functions on modern networks. Such data is typically addressed by processing logs and log aggregations due to throughput requirements. This is a viable solution for log data and offline analysis systems like Hadoop, but with the limitation of requiring real-time processing. The purpose of Kafka is to unify on-line and off-line message processing through the Hadoop parallel loading mechanism, and also to provide real-time messages through the clusters.

Specifically, the server consumes the execution log acquired by the logstash tool through a preset kafka message queue to obtain a log consumption result, wherein whether the service data changes after being sent to the data warehouse or not and the current state information of the service data can be obtained according to the log consumption result.

S204, sending the business data and the log consumption result in the data warehouse to the flow calculation engine, and carrying out data integration calculation on the business data and the log consumption result through the flow calculation engine to obtain integrated data.

Specifically, the stream computing engine of the present application is SPARK STREAMING, and the server transmits the service data stored in the data warehouse and the log consumption result output by the kafka message queue to the stream computing engine SPARK STREAMING, and performs the integration processing on the service data and the log consumption result through the stream computing engine SPARK STREAMING to obtain the integrated data.

It should be noted that, the server performs data acquisition on the service data according to a preset first scheduling frequency to obtain a first Spark stream object, the server performs data acquisition on the log consumption result according to a preset second scheduling frequency to obtain a second Spark stream object, then the server respectively converts the first Spark stream object and the second Spark stream object into RDD objects, and then the stream calculation engine SPARK STREAMING obtains integrated data on the first Spark stream object and the second Spark stream object after being converted into RDD objects.

The data synchronization platform introduces the stream calculation engine on the basis of the data warehouse, realizes the real-time synchronization of stream data by establishing a mode of combining offline calculation and stream calculation, and greatly improves the real-time performance of data synchronization.

S205, executing a preset SQL update statement to synchronize the integrated data to the second service library.

Specifically, after the server obtains data integration, automatically executing a preset SQL update statement to synchronize the integrated data to the second service library to complete real-time data synchronization, wherein the SQL update statement is used for indicating data synchronization and updating, and the SQL update statement is pre-selected and configured by a developer.

According to the application, the stream calculation engine performs data integration calculation on the service data and the log consumption result of the first service system to obtain integrated data, and the integrated data is synchronized to the second service library to realize real-time synchronization of the stream data, so that the stream data synchronization instantaneity is improved.

And processing the business data based on the data processing script.

Specifically, the server acquires a data processing identifier corresponding to the service data from the data synchronization instruction, searches a data processing script corresponding to the data processing identifier in a preset script library, and processes the service data based on the data processing script. The data synchronization instruction carries a corresponding data processing identifier, the data processing identifier is generated according to processing requirements and is used for indicating a data warehouse to pre-process service data to be synchronized according to the processing requirements, and the data warehouse is arranged to realize offline processing of the synchronization data so as to reduce the pressure of the service warehouse.

In a specific embodiment of the present application, the data processing script is a hive statement preconfigured by a developer, and a data processing script is specifically that "insert into table1 select t2.column1,t3.column2 from table2 t2 join table3 t3 on t2.column1＝t3.column2 group by t2.column1,t3.column2". servers add data in table1 to table2 and add table2 data to table3 when executing the processing script.

In the above embodiment, the data processing identifier is obtained to search the data processing script, and the service data is processed in advance based on the data processing script, so that the offline processing of the synchronous data is realized, and the pressure of the service library is reduced.

The logstack is a platform for application log, event transmission, processing, management and searching, and can be used for unified collection and management of application log, and provides a Web interface for query and statistics.

Specifically, the server monitors the database of the first service system in real time, and when the server monitors that the transaction of the database is changed, the control logstash tool obtains the execution log corresponding to the changed transaction. The server establishes a temporary storage area in the database in advance, and writes the execution log acquired by the logstash tool into the temporary storage area for temporary storage.

and acquiring the log data of the latest timestamp to obtain an execution log.

Wherein the primary key (PRIMARY KEY) of the log data is one or more fields in the table whose values are used to uniquely identify a record in the table, the primary key is used as a reference for monitoring database transactions of the database of the first business system. The time stamp is data generated by using a digital signature technology, and the signed object comprises information such as original file information, signature parameters, signature time and the like. The time stamping system is used for generating and managing time stamps, and digital signature is carried out on the signature object to generate the time stamps so as to prove that the original file exists before the signature time.

Specifically, the server analyzes the service data to determine the log data corresponding to the service data, obtains the primary key in the log data, then searches the database of the first service system based on the primary key of the log data, determines the transaction with the same primary key as the service data in the database of the first service system, monitors the retrieved transaction with the same primary key as the service data in real time, generates the latest timestamp of the changed transaction when the transaction with the same primary key as the service data in the database of the first service system is monitored to be changed, and obtains the log data of the latest timestamp to obtain the execution log.

In the above embodiment, by monitoring the transaction having the same primary key as the service data in the database of the first service system in real time, when it is monitored that the transaction having the same primary key as the service data in the database of the first service system is changed, the logstash tool obtains the execution log of the changed transaction, so as to determine the current state of the service data, and determine the current state of the service data through the execution log, so as to determine whether the service data is changed in the data synchronization process, if so, update the service data stored in the data warehouse.

Specifically, the server analyzes the acquired execution log, acquires the content of the execution log, generates a category creation instruction of the kafka message queue based on the content of the execution log, creates a corresponding topic category in the kafka message queue based on the category creation instruction, sends the execution log from the temporary storage area to the topic category for storage, and consumes the execution log in the topic category by executing a preset log consumption instruction to obtain a log consumption result.

In the above embodiment, the class creation instruction is generated based on the content of the execution log, the corresponding topic class is created in the kafka message queue based on the class creation instruction, the execution log is sent from the temporary storage area to the topic class for storage through the data synchronization workpiece, and then the preset log consumption instruction is executed to consume the execution log in the topic class, so as to obtain the log consumption result, wherein whether the service data changes after being sent to the data warehouse, the current state information of the service data and the like can be obtained according to the log consumption result.

The first scheduling frequency and the second scheduling frequency can be configured in advance according to requirements, and the data synchronization instruction carries the first scheduling frequency and the second scheduling frequency.

Specifically, the server collects service data in the data warehouse according to a preset first scheduling frequency, and generates a first Spark stream object. In a specific embodiment, the business data of the data warehouse is loaded, the second parameter is set StreamingContext to be the batch time, that is, the first dispatch frequency, because the stock data of the data warehouse is not updated frequently, the batch time can be set to 8 hours, and the first Spark stream object is obtained as follows:

“valconf:SparkConf＝new SparkConf().setAppName

("hdfsWD").setMaster("local[2]")；

valsc＝new StreamingContext(conf,Seconds(28800))；

valtextFS:DStream[String]＝sc.textFileStream

("hdfs://node01:8020/data")；”

The server controls logstash the tool to collect the execution log corresponding to the service data from the first service system according to the preset second degree frequency, and consumes the execution log by the kafka message queue to obtain a log consumption result, and the server calls the log consumption result to generate a second Spark stream object. In a specific embodiment, the log consumption result of kafka is loaded, incremental data in the first service system is updated continuously, the batch time is set to 1 minute, that is, the second scheduling frequency, and the second Spark stream object is obtained as follows:

“val sparkConf＝new SparkConf().setMaster

("local[*]").setAppName("StreamingWithKafka")

val ssc＝new StreamingContext(sparkConf,Seconds(60))；”

And converting the first Spark stream object and the second Spark stream object into RDD objects, and merging and converting the RDD objects into the first Spark stream object and the second Spark stream object through a stream calculation engine to obtain integrated data.

The RDD object is a flexible distributed data set (RESILIENT DISTRIBUTED DATASET, RDD) which is a core concept in Spark, the RDD is a generic data object, which can be understood as a data container, and the RDD is a composite data structure, which can realize data structure merging.

In the above embodiment, the service data and the log consumption result of the data warehouse are respectively sampled by setting different scheduling frequencies, so as to obtain the corresponding Spark stream object, and the stream data is divided into batch data and then sequentially processed. The obtained Spark stream object is converted into an RDD object, and the Spark stream object after being converted into the RDD object is synthesized by a stream calculation engine SPARK STREAMING to obtain integrated data, and stream data real-time synchronization is realized by integrating the data by the stream calculation engine SPARK STREAMING.

Specifically, the server calls RDDforeach the tool to traverse the integrated data to divide the integrated data to obtain a plurality of RDD partitions, and then calls foreachPartition the tool to create a connection object between each RDD partition and the second service library, and writes the data in each RDD partition into the second service library through the connection object.

In the above embodiment, by dividing the integrated data into a plurality of RDD partitions and creating a connection object between each RDD partition and the second service library, when the data are synchronized, the data in each RDD partition can be written into the second service library through the connection object, so that the number of created connection objects can be greatly reduced, and the system pressure can be reduced.

The application discloses a stream data synchronization method, and belongs to the technical field of big data. The data synchronization platform of the application introduces a stream calculation engine on the basis of a data warehouse, realizes real-time synchronization of stream data by combining the creation of offline calculation and stream calculation, greatly improves the real-time performance of data synchronization.

It should be emphasized that, to further ensure the privacy and security of the service data, the service data may also be stored in a node of a blockchain.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2 described above, the present application provides an embodiment of an apparatus for synchronizing stream data, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 3, the apparatus for stream data synchronization according to the present embodiment includes:

The instruction receiving module 301 is configured to receive a data synchronization instruction, obtain service data corresponding to the data synchronization instruction, and send the service data to the data warehouse;

The log collection module 302 is configured to collect, by using logstash tools, an execution log corresponding to the service data from the first service system;

The log consuming module 303 is configured to write the execution log into a kafka message queue, and consume the execution log through the kafka message queue to obtain a log consumption result;

the data integration module 304 is configured to send the service data and the log consumption result in the data warehouse to the flow calculation engine, and perform data integration calculation on the service data and the log consumption result through the flow calculation engine to obtain integrated data;

The data synchronization module 305 is configured to execute a preset SQL update statement to synchronize the integrated data to the second service library.

Further, the device for synchronizing stream data further includes:

The identification acquisition module is used for acquiring a data processing identification corresponding to the service data from the data synchronization instruction;

the script acquisition module is used for searching a data processing script corresponding to the data processing identifier in a preset script library;

And the data processing module is used for processing the business data based on the data processing script.

Further, the log collection module 302 specifically includes:

the log acquisition unit is used for monitoring a database of the first service system, and acquiring an execution log of the changed transaction through the logstash tool when the transaction of the database is monitored to be changed;

And the log storage unit is used for establishing a temporary storage area in the database and writing the execution log into the temporary storage area.

Further, the log collection unit specifically includes:

the key word acquisition subunit is used for determining log data corresponding to the service data and acquiring a main key word of the log data;

the transaction monitoring subunit is used for monitoring the transaction in the first business system database based on the primary key of the log data;

a timestamp generation subunit, configured to generate, when it is monitored that a transaction having the same primary key as the service data in the database of the first service system changes, a latest timestamp of the changed transaction;

and the log acquisition subunit is used for acquiring the log data of the latest timestamp to obtain an execution log.

Further, the log consumption module 303 specifically includes:

The creation instruction generation unit is used for analyzing the execution log and generating a category creation instruction based on the content of the execution log;

a topic category creation unit configured to create a corresponding topic category in the kafka message queue based on the category creation instruction;

The transfer storage unit is used for sending the execution log from the temporary storage area to the theme class for storage;

And the log consumption unit is used for executing a preset log consumption instruction to consume the execution log in the theme category to obtain a log consumption result.

Further, the data integration module 304 specifically includes:

The first object acquisition unit is used for acquiring service data in the data warehouse according to a preset first scheduling frequency and generating a first Spark stream object;

The second object acquisition unit is used for acquiring the log consumption result according to a preset second scheduling frequency and generating a second Spark stream object;

the object conversion unit is used for respectively carrying out object conversion on the first Spark stream object and the second Spark stream object and converting the first Spark stream object and the second Spark stream object into RDD objects;

And the data integration unit is used for obtaining integrated data through the first Spark stream object and the second Spark stream object after the stream calculation engine is integrated and converted into the RDD object.

Further, the data synchronization module 305 specifically includes:

the data dividing unit is used for calling RDDforeach tools to divide the data of the integrated data to obtain a plurality of RDD partitions;

A connection object creation unit configured to create a connection object between each of the RDD partitions and the second service library;

And the data synchronization unit is used for synchronizing the data in each RDD partition to the second service library through the connection object.

The application discloses a stream data synchronization device, and belongs to the technical field of big data. The data synchronization platform of the application introduces a stream calculation engine on the basis of a data warehouse, realizes real-time synchronization of stream data by combining the creation of offline calculation and stream calculation, greatly improves the real-time performance of data synchronization.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system installed on the computer device 4 and various application software, such as computer readable instructions of a method for synchronizing streaming data. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the method of synchronizing stream data.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a method for synchronizing stream data as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method of stream data synchronization, comprising:

Receiving a data synchronization instruction, acquiring service data corresponding to the data synchronization instruction, and sending the service data to a data warehouse;

Collecting an execution log corresponding to the service data from a first service system through logstash tools;

The business data and the log consumption result in the data warehouse are sent to a stream calculation engine, and data integration calculation is carried out on the business data and the log consumption result through the stream calculation engine to obtain integrated data;

The step of sending the business data and the log consumption result in the data warehouse to the flow calculation engine, and performing data integration calculation on the business data and the log consumption result by the flow calculation engine to obtain integrated data specifically comprises the following steps:

Integrating and converting the first Spark stream object and the second Spark stream object into RDD objects through the stream calculation engine to obtain integrated data;

2. The method for synchronizing stream data according to claim 1, wherein after said receiving a data synchronization instruction, acquiring service data corresponding to said data synchronization instruction, and transmitting said service data to a data warehouse, further comprising:

And processing the business data based on the data processing script.

3. The method for synchronizing stream data according to claim 1, wherein the step of collecting, by means of logstash tools, an execution log corresponding to the service data from the first service system, specifically comprises:

4. The method for synchronizing stream data as recited in claim 3 wherein said step of monitoring a database of said first business system, when a change in database transaction of said database is monitored, obtaining an execution log of said changed transaction by said logstash tool comprises:

and acquiring the log data of the latest timestamp to obtain an execution log.

5. The method for synchronizing stream data according to claim 3, wherein the step of writing the execution log into a kafka message queue and consuming the execution log through the kafka message queue to obtain a log consumption result comprises:

6. The method for synchronizing stream data according to claim 1, wherein the step of executing a predetermined SQL update statement to synchronize the integrated data to the second service library comprises:

7. An apparatus for synchronizing stream data, comprising:

The instruction receiving module is used for receiving a data synchronization instruction, acquiring service data corresponding to the data synchronization instruction and sending the service data to a data warehouse;

the log acquisition module is used for acquiring an execution log corresponding to the service data from the first service system through a logstash tool;

The data integration module is used for sending the business data in the data warehouse and the log consumption result to a stream calculation engine, and carrying out data integration calculation on the business data and the log consumption result through the stream calculation engine to obtain integrated data;

the data integration module specifically comprises:

the data integration unit is used for obtaining integrated data through the first Spark stream object and the second Spark stream object after the stream calculation engine is integrated and converted into the RDD object;

8. A computer device comprising a memory having stored therein computer readable instructions and a processor which when executed performs the steps of the method of stream data synchronization as claimed in any one of claims 1 to 6.

9. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of the method of stream data synchronization according to any of claims 1 to 6.