CN108804697A

CN108804697A - Method of data synchronization, device, computer equipment based on Spark and storage medium

Info

Publication number: CN108804697A
Application number: CN201810620678.8A
Authority: CN
Inventors: 黄志辉
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2018-11-13

Abstract

This application involves a kind of method of data synchronization, device, computer equipment and storage medium based on Spark.The method includes：Obtain the business datum that Spark tasks generate；The business datum includes a plurality of data record；It generates and records corresponding data summarization per data；The corresponding historical summaries of a plurality of historical record are obtained from source database；Each data summarization and a plurality of historical summaries of storage are compared, newly-increased data summarization is obtained；Message queue is written into the newly-increased corresponding data record of data summarization；When receiving the data pull request of first terminal transmission, asked the data record in the message queue being synchronized to target database according to the data pull.Large-scale data synchronous efficiency can be improved using this method.

Description

Method of data synchronization, device, computer equipment based on Spark and storage medium

Technical field

This application involves field of computer technology, more particularly to a kind of method of data synchronization based on Spark, device, Computer equipment and storage medium.

Background technology

Spark is a kind of computing engines for large-scale data processing, excellent by its is versatile, speed of service is fast etc. Point is increasingly widely applied.Spark is by dividing mass data collection calculating task (hereinafter referred to as " Spark tasks ") It is fitted in multiple stage computers equipment and executes, realize efficient task processing.Spark tasks can generate what multiple operation systems needed Business datum, such as generate e-commerce system need " Related product pushed information ", generate social platform need " birthday wishes Good fortune information " etc..Request of data of the Spark tasks based on operation system, by the synchronizing traffic data write-in corresponding service system of generation System.The business datum that Spark tasks generate is typically large-scale data.However, traditional data method of synchronization is only applicable to Small-scale data, for large-scale data, then synchronous efficiency is relatively low.

Invention content

Based on this, it is necessary in view of the above technical problems, provide a kind of base that can improve large-scale data synchronous efficiency In the method for data synchronization of Spark, device, computer equipment and storage medium.

A kind of method of data synchronization based on Spark, the method includes：Obtain the business datum that Spark tasks generate； The business datum includes a plurality of data record；It generates and records corresponding data summarization per data；It is obtained from source database The corresponding historical summaries of a plurality of historical record；Each data summarization and a plurality of historical summaries of storage are compared, obtained To newly-increased data summarization；Message queue is written into the newly-increased corresponding data record of data summarization；When receiving first terminal When the data pull request of transmission, asked the data record in the message queue being synchronized to target according to the data pull Database.

In one of the embodiments, before the business datum for obtaining the generation of Spark tasks, further include：Receive the The Spark tasks and corresponding Parameter File that two terminals are submitted；The resource point of the Spark tasks is read in the Parameter File With parameter, physical source distributing is carried out according to the resource allocation parameters；The Spark is executed based on the physical resource to appoint Business, monitors the execution efficiency of the Spark tasks；The resource allocation parameters in the Parameter File are carried out according to monitoring result Adjustment；The Spark task schedulings to the physical resource adaptable with the resource allocation parameters after adjustment are executed.

It is described in one of the embodiments, that the resource allocation parameters in the Parameter File are carried out according to monitoring result Adjustment, including：Compare whether the execution efficiency is less than threshold value；Calculate the corresponding task total amount of the Spark tasks and task Duration；If so, calculating remaining task amount according to the task total amount and task execution amount；According to the task duration and currently Timing node calculates residual time length；Newly-increased physical resource is needed according to the remaining task amount and residual time length measuring and calculating；Otherwise, According to the resource using information of the two neighboring timing node of the operational information recording, computing resource utilization rate；According to described Resource utilization measuring and calculating needs the physical resource discharged；The resource allocation parameters are adjusted according to results of measuring.

Described generate records corresponding data summarization per data in one of the embodiments, including：In the data One or more current key words are extracted in record, form current keyword set；A plurality of history note is obtained from source database Record corresponding history keyword set of words；Recognize whether the history keyword word to match with the current keyword set Set；If so, the extraction supplement keyword in the data record；It is built according to the supplement keyword and current key word that extract Vertical keyword index, using the keyword index as the data summarization of the data record.

The data pull request carries system banner and user identifier in one of the embodiments,；The basis Data record in the message queue is synchronized to target database by the data pull request, including：According to the user Mark, detects whether that there are corresponding data records in the message queue；If so, data is called to synchronize script；The number Include multiple labels according to synchronous script；The corresponding configuration file of the system banner is obtained, based on the configuration file to data Label in synchronous script is replaced, and is updated with synchronizing script to data；Foot is synchronized by executing updated data This, the target database is synchronized to by data record corresponding with the user identifier in the message queue.

It includes splitting script that the data, which synchronize script, in one of the embodiments, it is described will be in the message queue Data record corresponding with the user identifier is synchronized to the target database, including：It is corresponding to calculate the user identifier The data volume of data record；Detect whether the data volume is more than target data amount；If so, calling the fractionation script by user It identifies corresponding data record and is split as multiple data groups；Call multithreading that multiple data groups are synchronized to the number of targets According to library.

A kind of data synchronization unit based on Spark, described device include：Data screening module is appointed for obtaining Spark The business datum that business generates；The business datum includes a plurality of data record；It generates and records corresponding data summarization per data； The corresponding historical summaries of a plurality of historical record are obtained from source database；By a plurality of history of each data summarization and storage Abstract is compared, and newly-increased data summarization is obtained；Data memory module, for remembering the corresponding data of the data summarization increased newly Record write-in message queue；Data simultaneous module, for when receive first terminal transmission data pull request when, according to described Data record in the message queue is synchronized to target database by data pull request.

Described device further includes resource distribution module in one of the embodiments, for receiving second terminal submission Spark tasks and corresponding Parameter File；The resource allocation parameters of the Spark tasks are read in the Parameter File, according to The resource allocation parameters carry out physical source distributing；The Spark tasks are executed based on the physical resource, described in monitoring The execution efficiency of Spark tasks；The resource allocation parameters in the Parameter File are adjusted according to monitoring result；It will be described Spark task schedulings to the physical resource adaptable with the resource allocation parameters after adjustment executes.

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the processing Device realizes following steps when executing the computer program：Obtain the business datum that Spark tasks generate；The business data packet Include a plurality of data record；It generates and records corresponding data summarization per data；A plurality of historical record point is obtained from source database Not corresponding historical summaries；Each data summarization and a plurality of historical summaries of storage are compared, newly-increased data is obtained and plucks It wants；Message queue is written into the newly-increased corresponding data record of data summarization；When the data pull for receiving first terminal transmission When request, asked the data record in the message queue being synchronized to target database according to the data pull.

A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor Following steps are realized when row：Obtain the business datum that Spark tasks generate；The business datum includes a plurality of data record；It is raw Corresponding data summarization is recorded at every data；The corresponding historical summaries of a plurality of historical record are obtained from source database； Each data summarization and a plurality of historical summaries of storage are compared, newly-increased data summarization is obtained；Newly-increased data are plucked Want corresponding data record write-in message queue；When receiving the data pull request of first terminal transmission, according to the number The data record in the message queue is synchronized to target database according to request is pulled.

Above-mentioned method of data synchronization, device, computer equipment and storage medium based on Spark are produced based on Spark tasks Raw business datum can generate the corresponding data summarization of a plurality of data record in business datum；By by each data Abstract historical summaries corresponding with a plurality of historical record that source database stores are compared, and newly-increased data can be obtained Abstract；Message queue is written into the newly-increased corresponding data record of data summarization, it can be in the number for receiving first terminal transmission Request is pulled based on message queue response data according to when pulling request, the data record in the message queue is only synchronized to institute State the corresponding target database of operation system.Since only partial data relatively newly-increased in large-scale business datum being synchronized to Target database, rather than by whole synchronizing traffic datas of generation to target database, reduce and need synchronous data volume, to Improve the synchronous efficiency of large-scale data.The data summarization of structure record per data, newly-increased number is carried out based on data summarization According to the screening of record, the data volume for needing to compare is reduced, to improve business datum to specific efficiency, and then data is improved and synchronizes effect Rate.

Description of the drawings

Fig. 1 is the application scenario diagram of the method for data synchronization based on Spark in one embodiment；

Fig. 2 is the flow diagram of the method for data synchronization based on Spark in one embodiment；

Fig. 3 is the flow diagram of Spark task physical source distributing steps in one embodiment；

Fig. 4 is the structure diagram of the data synchronization unit based on Spark in one embodiment；

Fig. 5 is the internal structure chart of one embodiment Computer equipment.

Specific implementation mode

It is with reference to the accompanying drawings and embodiments, right in order to make the object, technical solution and advantage of the application be more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not For limiting the application.

Method of data synchronization provided by the present application based on Spark, can be applied in application environment as shown in Figure 1. Wherein, first terminal 102 is communicated with server 104 by network.Second terminal 106 and server 104 by network into Row communication.Wherein, first terminal 102 can be, but not limited to be various personal computers, notebook electricity with second terminal 106 respectively Brain, smart mobile phone, tablet computer and portable wearable device, server 104 can be the servers of multiple server compositions Cluster is realized.One or more operation systems are deployed on first terminal 102.First terminal 102 and second terminal 106 can be with Same terminal can also be different terminals.Server 104 receives the Spark tasks that second terminal 106 is submitted, and passes through execution Spark tasks generate the business datum needed for operation system.Business datum includes a plurality of data record.Server 104 generates often Data records corresponding data summarization.Server 104 deploys the corresponding database (hereinafter referred to as " source data of Spark tasks Library ").Source database stores the corresponding historical summaries of a plurality of historical record.Server 104 record will be corresponded to per data Data summarization and a plurality of historical summaries of storage compared, obtain newly-increased data summarization.According to newly-increased data summarization, Server 104 obtains corresponding data record, creates message queue, and message queue is written in the data record got.When with When family is needed using business datum, it can be pulled to 104 transmission data of server based on corresponding service system in first terminal 102 Request.Operation system has corresponding database (hereinafter referred to as " target database ").Data pull request carries system banner And user identifier.Server 104 extracts corresponding data record according to user identifier in message queue, the data that will be extracted Recording synchronism is to the corresponding target database of system banner.Above-mentioned data synchronization process, due to only by large-scale business datum In relatively newly-increased partial data be synchronized to target database, rather than by whole synchronizing traffic datas of generation to target data Library reduces and needs synchronous data volume, to improve the synchronous efficiency of large-scale data.

In one embodiment, it as shown in Fig. 2, providing a kind of method of data synchronization based on Spark, answers in this way For being illustrated for the server in Fig. 1, include the following steps：

Step 202, the business datum that Spark tasks generate is obtained；Business datum includes a plurality of data record.

The server cluster of multiple server compositions, including host node Master and multiple working node Worker.Task Spark tasks are committed to host node by dispatcher in second terminal by spark-submit orders.It is deployed on host node Task scheduling platform, for being scheduled execution to multiple Spark tasks that multiple second terminals are submitted.Task scheduling platform is Each Spark one corresponding Driver process of task start is based on Driver process initiation Spark tasks, and appoints for Spark The physical resources such as business storage allocation, CPU.In other words, task scheduling platform starts a certain number of on each working node of cluster Executor processes execute Spark tasks based on multiple Executor processes.

Spark tasks generate business datum according to pre-set business logic, such as Related product pushed information, birthday greeting language. Business datum has corresponding system banner, for identifying the business datum is suitable for which operation system, in other words which industry Business system, which has permission, uses the business datum.Business datum includes a plurality of data record.Different data record is respectively provided with correspondence User identifier, for identify the data record suitable for corresponding service system which user, in other words which user have Permission uses the data record.

Step 204, it generates and records corresponding data summarization per data.

The business datum that Spark tasks generate is typically large-scale.For the ease of carrying out retrieval analysis to business datum, Server generates the data summarization of the record per data.Data summarization is the brief information for identifying respective data record, can To be cryptographic Hash or keyword index etc..

In one embodiment, generation records corresponding data summarization per data and extracts multiple keys in data record Word；Calculate the cryptographic Hash of each keyword extracted；Logic of propositions operation is carried out to the cryptographic Hash of multiple keywords, by operation As a result as the data summarization of data record.Logic of propositions operation can also be arithmetic etc. with Hash operation.

Step 206, the corresponding historical summaries of a plurality of historical record are obtained from source database.

Spark tasks have corresponding source database, can be Hive databases.The business datum that Spark tasks generate It stores to source database.It is historical record to store to the data record of source database, and corresponding data summarization is that history is plucked It wants.Historical summaries can also be to generate in the manner described above.It is readily appreciated that, source database is for storing Spark tasks not The whole business datums generated with the time.

Step 208, each data summarization and a plurality of historical summaries of storage are compared, obtains newly-increased data and plucks It wants.

Server is compared each data summarization in business datum with each historical summaries in source database one by one, It will be determined as newly-increased data summarization with the data summarization that historical summaries do not match that.Match refers to that historical summaries are plucked with data The content wanted is same or similar.It is appreciated that in order to improve data comparison efficiency, it can be in advance according to historical comparison result to source Multiple historical summaries sequences in database.For example, being arranged the passing historical summaries repeatedly to match with data summarization preferential The sequence of comparison.

Step 210, message queue is written into the newly-increased corresponding data record of data summarization.

Spark tasks have corresponding message queue.Message queue is responsible for the reception, storage and forwarding of business datum.Clothes Business device obtains the newly-increased corresponding data record of data summarization, data record is stored to message queue, and according to newly-increased number It is stored to source database according to abstract and its corresponding data record, i.e. the historical record to source database storage and corresponding history is plucked Carry out full dose update or incremental update.

Step 212, it when receiving the data pull request of first terminal transmission, is asked message team according to data pull Data record in row is synchronized to target database.

When user needs using business datum, corresponding service system can be based in first terminal and send number to server It is asked according to pulling.Data pull request carries system banner and user identifier.Operation system has corresponding target database, Can be Sql Server databases, oracle database or MySql databases.Server is according to user identifier in message team Corresponding data record is extracted in row, and the data record extracted is synchronized to the corresponding target database of system banner.

In the present embodiment, based on the business datum that Spark tasks generate, a plurality of data record in business datum can be generated Corresponding data summarization；By going through each data summarization is corresponding with a plurality of historical record that source database stores History abstract is compared, and newly-increased data summarization can be obtained；Message is written into the newly-increased corresponding data record of data summarization Queue can pull request, only in the data pull request for receiving first terminal transmission based on message queue response data Data record in message queue is synchronized to the corresponding target database of operation system.Due to only by large-scale business datum In relatively newly-increased partial data be synchronized to target database, rather than by whole synchronizing traffic datas of generation to target data Library reduces and needs synchronous data volume, to improve the synchronous efficiency of large-scale data.Structure per data, pluck by the data of record It wants, the screening of newly-increased data record is carried out based on data summarization, the data volume for needing to compare is reduced, to improve business datum To specific efficiency, and then improve data synchronization efficiency.

In one embodiment, further include Spark task physics money before obtaining the business datum that Spark tasks generate The step of source is distributed.As shown in figure 3, the step of Spark task physical source distributings, includes：

Step 302, the Spark tasks and corresponding Parameter File that second terminal is submitted are received.

The corresponding service logic script of Spark tasks includes Shell scripts.Task scheduling personnel are by the money of spark tasks Source allocation of parameters is recorded in Parameter File, and the preset call back function to Parameter File in Shell scripts.Resource allocation is joined Number can be that task scheduling personnel estimate in advance according to the task amount of Spark tasks.Task scheduling personnel are logical in second terminal It crosses spark-submit orders and Spark tasks and corresponding configuration file is committed to host node.Being disposed on host node for task Dispatching platform is individually stored Parameter File independently of Spark tasks, and corresponding for each Spark task starts one Driver processes.According to preset deployment mode (deploy-mode), Driver processes in the local boot Spark tasks or Certain working node starts Spark tasks to person in the cluster.

Step 304, the resource allocation parameters that Spark tasks are read in Parameter File carry out object according to resource allocation parameters Manage resource allocation.

Task scheduling platform is based on Driver process initiation Spark tasks, and distributes physical resource for Spark tasks.Tool Body, Driver processes call the corresponding Shell scripts of Spark tasks, generate a callback instruction to configuration file, according to Callback instruction reads the resource allocation parameters in Parameter File.Driver processes are according to the resource allocation parameters read, to collection Group's manager application operation Spark tasks need physical resource to be used.Cluster manager dual system can be Spark Standalone Cluster or YARN resource management clusters etc..Physical resource refers to memory and CPU etc..Cluster manager dual system exists according to resource allocation parameters Start a certain number of Executor processes on each working node of cluster.It is readily appreciated that, Driver processes and each Executor Process itself can also occupy certain physical resource.

Step 306, it is based on physical resource and executes Spark tasks, monitor the execution efficiency of Spark tasks.

Applying to after the physical resource needed for Spark task executions, task scheduling platform is opened based on Driver processes Begin scheduling execution Spark tasks.Specifically, Spark tasks are split as the task groups of multiple asynchronous executions by Driver processes Stage, each task groups stage include multiple asynchronous executions and/or the background task task concurrently executed.Driver processes will Multiple background task task of one task groups stage, which are assigned in multiple Executor processes, to be executed.Background task task is Minimum execution unit.The implementing result of each background task task is stored to the corresponding memory of Executor processes or place work Make in the disk file of node.When all background task task of current task group stage are carried out and finish, Driver processes exist Intermediate result, and management and running next task group stage are written in the disk file of each working node local.So cycle Back and forth, until all having executed Spark tasks.

During Spark task executions, task scheduling platform executes effect based on Driver monitoring the process Spark tasks Rate calculates the execution speed of background task task.It is readily appreciated that, execution speed and the corresponding Executor of background task task The physical resources such as the CPU core number of process are directly related.In general, a CPU same times execute a thread.Physical resource is enough In the case of, as the multiple background task task being assigned in Executor processes, multi-thread concurrent can be called to execute more A background task task, to improve the execution efficiency of Spark tasks.

Step 308, the resource allocation parameters in Parameter File are adjusted according to monitoring result.

Task scheduling platform compares whether execution efficiency is less than threshold value based on Driver processes.Threshold value can be according to practical need Ask free setting, can also dynamic change, it is without limitation.If execution efficiency is less than threshold value, indicate that current Spark tasks are deposited In the insufficient risk of physical resource, task scheduling platform, which generates, to be stopped executing instruction, and will stop executing instruction being sent to corresponding work Make node, to terminate corresponding Driver processes and Executor processes.The measuring and calculating of task scheduling platform needs newly-increased physics money Source, the resource allocation parameters for corresponding to Parameter File record to Spark tasks according to results of measuring are adjusted.If execution efficiency is big In or equal to threshold value, indicating current Spark tasks, there is no the insufficient risk of physical resource or risk are relatively low.Task scheduling Platform judges whether the allocated physical resource of Spark tasks there is idling-resource, measuring and calculating to need the physical resource discharged, according to The resource allocation parameters that results of measuring corresponds to Spark tasks Parameter File record are adjusted.

Step 310, Spark task schedulings to the physical resource adaptable with the resource allocation parameters after adjustment are executed.

Task scheduling platform based on the resource allocation parameters after adjustment, again for one Driver of Spark task starts into Journey, it is that Spark tasks distribute physical resource again in the manner described above to call the Driver processes of the new startup, i.e., more in cluster A working node restarts a certain number of Executor processes.Driver processes by Spark task schedulings to after adjustment The adaptable physical resource of resource allocation parameters execute, i.e., multiple background task task that Spark tasks are split are sent It is executed to the multiple Executor processes redistributed.Task scheduling platform continues to monitor Spark tasks based on Driver processes Execution efficiency, and the adjustment of resource allocation parameters is carried out according to execution efficiency, until Spark task executions finish.

Traditional resource allocation parameters are fixedly arranged in the Shell scripts of Spark tasks so that only until Spark tasks carry out that when version updating resource allocation parameters change could be carried out so that resource allocation parameters modification is inconvenient, in turn Influence Spark task runs efficiency and operation result.

In the present embodiment, due to individually being stored resource allocation parameters in a manner of Parameter File, independently of Spark Resource allocation parameters are flexibly freely changed in task itself, the limitation so as to break away from Spark task version updatings；Monitoring in real time Spark task execution efficiency, and according to the physical resource of execution efficiency dynamic adjustment distribution, Spark tasks are adapted to object The actual demand of resource is managed, and then Spark task execution efficiency can be improved.

In one embodiment, the resource allocation parameters in configuration file are adjusted according to monitoring result, including：Than Whether it is less than threshold value compared with execution efficiency；Calculate the corresponding task total amount of Spark tasks and task duration；If so, total according to task Amount and task execution amount calculate remaining task amount；Residual time length is calculated according to task duration and current timing node；According to surplus Remaining task amount and residual time length measuring and calculating need newly-increased physical resource；Otherwise, according to the two neighboring time of operational information recording The resource using information of node, computing resource utilization rate；The physical resource discharged is needed according to resource utilization measuring and calculating；According to survey Calculate result adjustresources allocation of parameters.

Task scheduling platform can be automatically right according to the monitoring result to Spark task execution efficiency based on Driver processes Resource allocation parameters are adjusted.Specifically, Driver processes compare whether execution efficiency is less than threshold value.If so, Driver into Journey calculates remaining task amount, and according to the execution of measuring and calculating according to the task total amount and task execution amount of the Spark tasks of measuring and calculating The task duration and current timing node that Spark tasks need, calculate residual time length.Driver processes are according to remaining task amount And residual time length, calculate the target execution efficiency of Spark tasks.Driver processes read the resource allocation of configuration file record Parameter, the Spark tasks obtained according to monitoring are determined in the physical resource of current time actual execution efficiency and corresponding distribution Reach the target physical resource of target execution efficiency needs.It is readily appreciated that, target physical resource and allocated physical resource Difference is to need newly-increased physical resource.The resource allocation that Driver processes record Parameter File according to target physical resource Parameter is adjusted.

The operation information of Spark tasks based on the acquisition of preset task run monitor component further includes Spark tasks Resource using information, such as CPU usage, memory remaining space capacity etc..If execution efficiency be greater than or equal to threshold value, Driver into Journey calculates the resource utilization of physical resource, according to resource according to the resource using information of the two neighboring timing node of acquisition Utilization rate judges that allocated physical resource whether there is free physical resource.Driver processes read the money of configuration file record Source allocation of parameters determines the free physical resource for needing to discharge according to resource allocation parameters and resource utilization.Driver into Journey is adjusted according to the resource allocation parameters that free physical resource records Parameter File.

In the present embodiment, resource allocation parameters are adjusted automatically according to the monitoring result to Spark task execution efficiency It is whole, carry out that physical resource is newly-increased in time when execution efficiency is less than threshold value, with ensure the execution efficiency of Spark tasks and execute at Power；Even if carrying out physical resource release when execution efficiency is greater than or equal to threshold value, physical resource utilization rate can be improved, is subtracted Few waste to physical resource.

In one embodiment, it generates and records corresponding data summarization per data, including：One is extracted in data record A or multiple current key words form current keyword set；It is corresponding that a plurality of historical record is obtained from source database History keyword set of words；Recognize whether the history keyword set of words to match with current keyword set；If so, in data Extraction supplement keyword in record；Keyword index is established according to the supplement keyword and current key word that extract, it will be crucial Data summarization of the glossarial index as data record.

Traditional approach in correction data by data respectively one by one with compared than data, but by the number than data To reduce specific efficiency according to this way of contrast when measuring bigger.In order to solve the technical problem, the present embodiment service Device, which is directed to record per data in business datum, establishes corresponding keyword index, and data comparison is carried out based on keyword index. Specifically, server extracts one or more current key words in data record, formation records corresponding current per data Set of keywords.Source database stores the corresponding keyword index of a plurality of historical record.The corresponding key of historical record Glossarial index is history keyword set of words.Server recognizes whether the history keyword word set to match with current keyword set It closes.If in the presence of the extraction supplement keyword in data record, to be distinguished with history keyword set of words.Supplementing keyword can To be the vocabulary being different from Message Record except current key word.Server is according to the supplement keyword extracted and current pass Keyword establishes keyword index, using keyword index as the data summarization of data record.

In the present embodiment, the keyword index of the record per data is built, newly-increased data are carried out based on keyword index The screening of record reduces the data volume for needing to compare, to improve business datum to specific efficiency；Based on supplement keyword and extraction The current key word structure keyword index arrived, can carry the mark action for ensureing keyword index to respective data record.

In one embodiment, data pull request carries system banner and user identifier；It is asked according to data pull Data record in message queue is synchronized to target database, including：According to user identifier, detected whether in message queue There are corresponding data records；If so, data is called to synchronize script；It includes multiple labels that data, which synchronize script,；Acquisition system mark Know corresponding configuration file, synchronizing the label in script to data based on configuration file is replaced, to synchronize script to data It is updated；Script is synchronized by executing updated data, data record corresponding with user identifier in message queue is same Walk target database.

In traditional approach, using data synchronization means between different business systems carry out data synchronization before, user It needs to write different data synchronization scripts for different operation systems in advance.But in fact, different business systems are corresponding Data synchronization script is similar, and the operation system for if desired carrying out data synchronization is more, then user needs to carry out largely repeating to move Make, waste of manpower also reduces data synchronization efficiency.In order to reduce user's operation, the present embodiment is write a set of general in advance Data synchronize script, and general data are synchronized script and are stored to server.It includes first that the general data, which synchronize script, Synchronous script and second synchronizes script.

First synchronization script includes the label of at least one preset format.Preset format refer in label both sides at least Side is equipped with default mark.Default mark can be " # ", and "@", " * " etc. can be " #ABC# " to the label of preset format, "@DEF " or " GHI* " etc..When user needs to carry out data synchronization, the mode of page configuration may be used in first terminal base Configuration information is set in operation system, and generating configuration information based on configuration information generates configuration file, and configuration is sent to clothes Business device.Configuration information includes multiple labels and its corresponding replacement information.Replacement information includes Spark task identifications, Yong Hubiao Know, the connection string etc. of system banner or target database.

The general data of server calls synchronize script, identify that first synchronizes the label in script, root according to default mark The corresponding replacement information of each label is inquired according to configuration information.Server obtains the corresponding message queue of Spark task identifications Connection string, verification mark etc..Verification mark can be username and password etc..Server will be marked each according to configuration information Label replace with corresponding replacement information, to be updated to the first synchronization script.

Second synchronization script includes building table script and synchronous script.Server according to the connection string of target database with Target database establishes connection, the field information of reads data log in the corresponding message queue of Spark task identifications.Field Information is different, corresponding to build table script and synchronous script difference.Server according to field information generate it is corresponding build table statement and Synchronization statements, a plurality of by generation build that table statement write-in is corresponding to build table script, and a plurality of synchronization statements of generation are written and are corresponded to Synchronization script, with to second synchronization script be updated.Server synchronizes script by executing updated data, from message Queue is by synchronizing traffic data to target database.

In the present embodiment, general data are write in advance and synchronize script, and is directed to and needs to carry out the not of the same trade or business of data synchronization Business system is added to corresponding configuration file so that, only need to be according to right when needing synchronous service data to different business systems The configuration file answered synchronizes script to general data and is updated, can be by industry by executing updated data synchronization script Business data are synchronized to target database from message queue, it is possible to reduce user's operation also improves data synchronization efficiency.

In one embodiment, it includes splitting script that data, which synchronize script, will be corresponding with user identifier in message queue Data record is synchronized to target database, including：Calculate the data volume of the corresponding data record of user identifier；Detection data amount is No is more than target data amount；The corresponding data record of user identifier is split as multiple data groups if so, calling and splitting script；It adjusts Multiple data groups are synchronized to target database with multithreading.

Second synchronization script further includes splitting script.Server calculates data note corresponding with user identifier in message queue The data volume of record compares whether data volume is more than threshold value.A plurality of data record is split if so, server calls split script In other words a plurality of data record is grouped for multiple data groups.Specifically, server obtains preset target data amount. Target data amount can be preset, can also be to be generated temporarily according to the current load monitoring result of server.Clothes Business device determines the fractionation position of each data group according to target data amount.For example, it is assumed that target data amount is 80M, then by 80M The position mark of size is first fractionation position, and the position mark of 160M sizes is second fractionation position, and so on.

Whether each position that splits of server detection is between adjacent separator.It is located at a separator when splitting position When place, server is split in fractionation position；When splitting position between adjacent separator, appoint in adjacent separator It is split, i.e., is torn open at the previous separator in the adjacent separator or the latter separator at one separator of meaning Point, obtain multiple data groups.Multiple data groups are synchronized to target database by server calls multithreading, to improve business datum Synchronous efficiency.

In the present embodiment, the larger business datum of data volume is split, be split as multiple data groups and is called more Thread synchronizes, and can improve data synchronization efficiency；It is determined based on target data amount and separator and splits position, it can be to avoid Same data record is split to different data group, is guaranteed data integrity.

It should be understood that although each step in the flow chart of Fig. 2 and Fig. 3 is shown successively according to the instruction of arrow, But these steps are not the inevitable sequence indicated according to arrow to be executed successively.Unless expressly state otherwise herein, these There is no stringent sequences to limit for the execution of step, these steps can execute in other order.Moreover, in Fig. 2 and Fig. 3 At least part step may include that either these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps One moment executed completion, but can execute at different times, and the execution in these sub-steps or stage sequence is also not necessarily Be carry out successively, but can with other steps either the sub-step of other steps or at least part in stage in turn or Alternately execute.

In one embodiment, as shown in figure 4, providing a kind of data synchronization unit based on Spark, including：Data Screening module 402, data memory module 404 and data simultaneous module 406, wherein：

Data screening module 402, the business datum for obtaining the generation of Spark tasks；Business datum includes a plurality of data Record；It generates and records corresponding data summarization per data；Obtain that a plurality of historical record is corresponding to be gone through from source database History is made a summary；Each data summarization and a plurality of historical summaries of storage are compared, newly-increased data summarization is obtained.

Data memory module 404, for message queue to be written in the corresponding data record of the data summarization increased newly.

Data simultaneous module 406, for when receive first terminal transmission data pull request when, according to data pull Data record in message queue is synchronized to target database by request.

In one embodiment, which further includes resource distribution module 408, for receiving second terminal submission Spark tasks and corresponding Parameter File；The resource allocation parameters of Spark tasks are read in Parameter File, according to resource allocation Parameter carries out physical source distributing；Spark tasks are executed based on physical resource, monitor the execution efficiency of Spark tasks；According to prison Result is surveyed to be adjusted the resource allocation parameters in Parameter File；By Spark task schedulings to the resource allocation after adjustment The adaptable physical resource of parameter executes.

In one embodiment, resource distribution module 408 is additionally operable to compare whether execution efficiency is less than threshold value；It calculates The corresponding task total amount of Spark tasks and task duration；If so, calculating remaining task according to task total amount and task execution amount Amount；Residual time length is calculated according to task duration and current timing node；Calculate needs according to remaining task amount and residual time length Newly-increased physical resource；Otherwise, according to the resource using information of the two neighboring timing node of operational information recording, computing resource Utilization rate；The physical resource discharged is needed according to resource utilization measuring and calculating；According to results of measuring adjustresources allocation of parameters.

In one embodiment, data screening module 402 is additionally operable to extract one or more current passes in data record Keyword forms current keyword set；The corresponding history keyword set of words of a plurality of historical record is obtained from source database； Recognize whether the history keyword set of words to match with current keyword set；If so, extracting supplement in data record Keyword；Keyword index is established according to the supplement keyword and current key word that extract, using keyword index as data The data summarization of record.

In one embodiment, data pull request carries system banner and user identifier；Data simultaneous module 406 is also For according to user identifier, detecting whether that there are corresponding data records in message queue；If so, data is called to synchronize foot This；It includes multiple labels that data, which synchronize script,；The corresponding configuration file of system banner is obtained, data are synchronized based on configuration file Label in script is replaced, and is updated with synchronizing script to data；Script is synchronized by executing updated data, it will Data record corresponding with user identifier is synchronized to target database in message queue.

In one embodiment, it includes splitting script that data, which synchronize script, and data simultaneous module 406 is additionally operable to calculate user Identify the data volume of corresponding data record；Whether detection data amount is more than target data amount；It will be used if so, calling and splitting script Family identifies corresponding data record and is split as multiple data groups；Call multithreading that multiple data groups are synchronized to target database.

Specific restriction about the data synchronization unit based on Spark may refer to above for the number based on Spark According to the restriction of synchronous method, details are not described herein.Modules in the above-mentioned data synchronization unit based on Spark can all or It is realized by software, hardware and combinations thereof part.Above-mentioned each module can be embedded in or be set independently of computer in the form of hardware It in processor in standby, can also in a software form be stored in the memory in computer equipment, in order to which processor calls Execute the corresponding operation of the above modules.

In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 5.The computer equipment include the processor connected by system bus, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The database of machine equipment is used for storing history and corresponding historical summaries.The network interface of the computer equipment be used for it is outer The terminal in portion is communicated by network connection.To realize a kind of data based on Spark when the computer program is executed by processor Synchronous method.

It will be understood by those skilled in the art that structure shown in Fig. 5, is only tied with the relevant part of application scheme The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme, specific computer equipment May include either combining certain components than more or fewer components as shown in the figure or being arranged with different components.

In one embodiment, a kind of computer equipment, including memory and processor are provided, which is stored with Computer program, the processor realize following steps when executing computer program：Obtain the business datum that Spark tasks generate； Business datum includes a plurality of data record；It generates and records corresponding data summarization per data；It is obtained from source database a plurality of The corresponding historical summaries of historical record；Each data summarization and a plurality of historical summaries of storage are compared, obtained new The data summarization of increasing；Message queue is written into the newly-increased corresponding data record of data summarization；It is sent when receiving first terminal Data pull request when, asked data record in message queue being synchronized to target database according to data pull.

In one embodiment, following steps are also realized when processor executes computer program：Second terminal is received to submit Spark tasks and corresponding Parameter File；The resource allocation parameters of Spark tasks are read in Parameter File, according to resource point Physical source distributing is carried out with parameter；Spark tasks are executed based on physical resource, monitor the execution efficiency of Spark tasks；According to Monitoring result is adjusted the resource allocation parameters in Parameter File；Spark task schedulings are divided to the resource after adjustment It is executed with the physical resource that parameter is adapted.

In one embodiment, following steps are also realized when processor executes computer program：Whether compare execution efficiency Less than threshold value；Calculate the corresponding task total amount of Spark tasks and task duration；If so, according to task total amount and task execution amount Calculate remaining task amount；Residual time length is calculated according to task duration and current timing node；According to remaining task amount and residue Duration measuring and calculating needs newly-increased physical resource；Otherwise, it is used according to the resource of the two neighboring timing node of operational information recording Information, computing resource utilization rate；The physical resource discharged is needed according to resource utilization measuring and calculating；According to results of measuring adjustresources Allocation of parameters.

In one embodiment, following steps are also realized when processor executes computer program：It is extracted in data record One or more current key words form current keyword set；A plurality of historical record is obtained from source database to correspond to respectively History keyword set of words；Recognize whether the history keyword set of words to match with current keyword set；If so, in number According to extraction supplement keyword in record；Keyword index is established according to the supplement keyword and current key word that extract, will be closed Keyword indexes the data summarization as data record.

In one embodiment, data pull request carries system banner and user identifier；Processor executes computer Following steps are also realized when program：According to user identifier, detect whether that there are corresponding data records in message queue；If It is that data is called to synchronize script；It includes multiple labels that data, which synchronize script,；The corresponding configuration file of system banner is obtained, is based on Configuration file synchronizes the label in script to data and is replaced, and is updated with synchronizing script to data；By executing update Data afterwards synchronize script, and data record corresponding with user identifier in message queue is synchronized to target database.

In one embodiment, it includes splitting script that data, which synchronize script, and processor is also realized when executing computer program Following steps：Calculate the data volume of the corresponding data record of user identifier；Whether detection data amount is more than target data amount；If It is to call fractionation script that the corresponding data record of user identifier is split as multiple data groups；Call multithreading by multiple data Group is synchronized to target database.

In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program realizes following steps when being executed by processor：Obtain the business datum that Spark tasks generate；Business datum includes a plurality of Data record；It generates and records corresponding data summarization per data；A plurality of historical record is obtained from source database to correspond to respectively Historical summaries；Each data summarization and a plurality of historical summaries of storage are compared, newly-increased data summarization is obtained；It will be new Message queue is written in the corresponding data record of data summarization of increasing；When receiving the data pull request of first terminal transmission, It is asked the data record in message queue being synchronized to target database according to data pull.

In one embodiment, following steps are also realized when computer program is executed by processor：Second terminal is received to carry The Spark tasks of friendship and corresponding Parameter File；The resource allocation parameters of Spark tasks are read in Parameter File, according to resource Allocation of parameters carries out physical source distributing；Spark tasks are executed based on physical resource, monitor Spark task execution efficiency；According to Monitoring result is adjusted the resource allocation parameters in Parameter File；Spark task schedulings are divided to the resource after adjustment It is executed with the physical resource that parameter is adapted.

In one embodiment, following steps are also realized when computer program is executed by processor：Comparing execution efficiency is It is no to be less than threshold value；Calculate the corresponding task total amount of Spark tasks and task duration；If so, according to task total amount and task execution Amount calculates remaining task amount；Residual time length is calculated according to task duration and current timing node；According to remaining task amount and remain Remaining duration measuring and calculating needs newly-increased physical resource；Otherwise, made according to the resource of the two neighboring timing node of operational information recording With information, computing resource utilization rate；The physical resource discharged is needed according to resource utilization measuring and calculating；It is adjusted and is provided according to results of measuring Source allocation of parameters.

In one embodiment, following steps are also realized when computer program is executed by processor：It is carried in data record One or more current key words are taken, current keyword set is formed；It is right respectively that a plurality of historical record is obtained from source database The history keyword set of words answered；Recognize whether the history keyword set of words to match with current keyword set；If so, Extraction supplement keyword in data record；Keyword index is established according to the supplement keyword and current key word that extract, it will Data summarization of the keyword index as data record.

In one embodiment, data pull request carries system banner and user identifier；Computer program is handled Device also realizes following steps when executing：According to user identifier, detect whether that there are corresponding data records in message queue；If It is that data is called to synchronize script；It includes multiple labels that data, which synchronize script,；The corresponding configuration file of system banner is obtained, is based on Configuration file synchronizes the label in script to data and is replaced, and is updated with synchronizing script to data；By executing update Data afterwards synchronize script, and data record corresponding with user identifier in message queue is synchronized to target database.

In one embodiment, it includes splitting script that data, which synchronize script, and reality is gone back when computer program is executed by processor Existing following steps：Calculate the data volume of the corresponding data record of user identifier；Whether detection data amount is more than target data amount；If It is to call fractionation script that the corresponding data record of user identifier is split as multiple data groups；Call multithreading by multiple data Group is synchronized to target database.

One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Instruct relevant hardware to complete by computer program, computer program can be stored in a non-volatile computer readable It takes in storage medium, the computer program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, this Shen Any reference to memory, storage, database or other media used in each embodiment please provided, may each comprise Non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above example can be combined arbitrarily, to keep description succinct, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield is all considered to be the range of this specification record.

Above example only expresses the several embodiments of the application, the description thereof is more specific and detailed, but can not Therefore it is construed as limiting the scope of the patent.It should be pointed out that for those of ordinary skill in the art, Under the premise of not departing from the application design, various modifications and improvements can be made, these belong to the protection domain of the application. Therefore, the protection domain of the application patent should be determined by the appended claims.

Claims

1. a kind of method of data synchronization based on Spark, the method includes：

Obtain the business datum that Spark tasks generate；The business datum includes a plurality of data record；

Generate the corresponding data summarization of every data record；

The corresponding historical summaries of a plurality of historical record are obtained from source database；

Each data summarization and a plurality of historical summaries are compared, newly-increased data summarization is obtained；

Message queue is written into the corresponding data record of the newly-increased data summarization；

It, will be in the message queue according to data pull request when receiving the data pull request of first terminal transmission Data record be synchronized to target database.

2. according to the method described in claim 1, it is characterized in that, it is described obtain Spark tasks generate business datum before, Further include：

Receive the Spark tasks and corresponding Parameter File that second terminal is submitted；

The resource allocation parameters of the Spark tasks are read in the Parameter File, object is carried out according to the resource allocation parameters Manage resource allocation；

The Spark tasks are executed based on the physical resource, monitor the execution efficiency of the Spark tasks；

The resource allocation parameters in the Parameter File are adjusted according to monitoring result；

The Spark task schedulings to the physical resource adaptable with the resource allocation parameters after adjustment are executed.

3. according to the method described in claim 2, it is characterized in that, it is described according to monitoring result to the money in the Parameter File Source allocation of parameters is adjusted, including：

Compare whether the execution efficiency is less than threshold value；

Calculate the corresponding task total amount of the Spark tasks and task duration；

If so, calculating remaining task amount according to the task total amount and task execution amount；According to the task duration and currently Timing node calculates residual time length；Newly-increased physical resource is needed according to the remaining task amount and residual time length measuring and calculating；

Otherwise, according to the resource using information of the two neighboring timing node of the operational information recording, computing resource utilization rate； The physical resource discharged is needed according to resource utilization measuring and calculating；

The resource allocation parameters are adjusted according to results of measuring.

4. according to the method described in claim 1, it is characterized in that, described generate records corresponding data summarization per data, Including：

One or more current key words are extracted in the data record, form current keyword set；

The corresponding history keyword set of words of a plurality of historical record is obtained from source database；

Recognize whether the history keyword set of words to match with the current keyword set；

If so, the extraction supplement keyword in the data record；

Establish keyword index according to the supplement keyword and the current key word that extract, using the keyword index as The data summarization of the data record.

5. according to the method described in claim 1, it is characterized in that, data pull request carries system banner and user Mark；It is described to be asked the data record in the message queue being synchronized to target database according to the data pull, including：

According to the user identifier, detect whether that there are corresponding data records in the message queue；

If so, data is called to synchronize script；It includes multiple labels that the data, which synchronize script,；

The corresponding configuration file of the system banner is obtained, synchronizing the label in script to data based on the configuration file carries out It replaces, is updated with synchronizing script to data；

Script is synchronized by executing updated data, by data record corresponding with the user identifier in the message queue It is synchronized to the corresponding target database of the system banner.

6. according to the method described in claim 5, it is characterized in that, it includes fractionation script, the general that the data, which synchronize script, Data record corresponding with the user identifier is synchronized to the target database in the message queue, including：

Calculate the data volume of the corresponding data record of the user identifier；

Detect whether the data volume is more than target data amount；

If so, calling the fractionation script that the corresponding data record of the user identifier is split as multiple data groups；

Call multithreading that multiple data groups are synchronized to the target database.

7. a kind of data synchronization unit based on Spark, which is characterized in that described device includes：

Data screening module, the business datum for obtaining the generation of Spark tasks；The business datum includes a plurality of data note Record；It generates and records corresponding data summarization per data；The corresponding history of a plurality of historical record is obtained from source database Abstract；Each data summarization and a plurality of historical summaries of storage are compared, newly-increased data summarization is obtained；

Data memory module, for message queue to be written in the corresponding data record of the data summarization increased newly；

Data simultaneous module, for when receiving the data pull request of first terminal transmission, being asked according to the data pull It asks and the data record in the message queue is synchronized to target database.

8. device according to claim 7, which is characterized in that described device further includes resource distribution module, for receiving The Spark tasks and corresponding Parameter File that second terminal is submitted；The resource of the Spark tasks is read in the Parameter File Allocation of parameters carries out physical source distributing according to the resource allocation parameters；The Spark is executed based on the physical resource to appoint Business, monitors the execution efficiency of the Spark tasks；The resource allocation parameters in the Parameter File are carried out according to monitoring result Adjustment；The Spark task schedulings to the physical resource adaptable with the resource allocation parameters after adjustment are executed.

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In when the processor executes the computer program the step of any one of realization claim 1 to 6 the method.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method according to any one of claims 1 to 6 is realized when being executed by processor.