CN110008242A

CN110008242A - One kind being based on Spark streaming program generator and program data processing method

Info

Publication number: CN110008242A
Application number: CN201910186601.9A
Authority: CN
Inventors: 郑文康; 冯达
Original assignee: Guangzhou Yamei Information Science & Technology Co Ltd
Current assignee: Yamei Zhilian Data Technology Co ltd
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2019-07-12

Abstract

The present invention discloses a kind of based on Spark streaming program generator and program data processing method.It should be based on Spark streaming program generator, including Spark program initialization module and processing module；Wherein, the Spark program initialization module after configuration information verification passes through, generates configuration file according to the configuration information for obtaining the configuration information inputted from the Spark information configuration page of Web mode；The processing module generates Spark stream calculation program for the configuration file according to the Spark program initialization module.Technical solution provided by the invention can reduce business development access difficulty, reduce development cost and improve project treatment effeciency.

Description

One kind being based on Spark streaming program generator and program data processing method

Technical field

The present invention relates to computer big data technical fields, and in particular to one kind based on Spark streaming program generator and Program data processing method.

Background technique

Currently, as the technologies such as Internet of Things, social networks, cloud computing constantly incorporate people's lives and existing calculating Ability, memory space, network bandwidth high speed development, the mankind accumulation data in internet, communication, finance, business, medical treatment etc. Numerous areas constantly increases and accumulates.Internet is propagated as information and regenerated platform, " information overflow ", " number occurs According to explosion " phenomena such as, the data information of magnanimity makes people be difficult to quickly make one's choice.

The problems such as big, requirement of real-time is high in face of data processing amount, introduces Spark technology in the prior art and is solved. Spark is a kind of computing engines of Universal-purpose quick for aiming at large-scale data processing and designing.The server of present mainstream, it is several hundred The memory of GB or a few TB are typical, so that memory database is achieved, Spark also exactly utilizes this meter for the development of memory It calculates resource and designs.Spark Streaming (Spark stream) is the module that Spark is used to handle stream data, is Spark core An extension of heart API (Application Programming Interface, application programming interface), may be implemented The processing of the real-time streaming data for having fault tolerant mechanism of high-throughput is supported to obtain data from multiple data sources, from data source After obtaining data, the processing that various high-level functions carry out complicated algorithm can be used.

The example that Spark Streaming real-time technique is applied to big data analysis in the prior art is much, but Be, at present to the use of Spark Streaming frame be only rest on frame simply using upper.For example, only stopping In the calling of framework function, the inner working principle of function and the meaning of parameters are not understood in depth, not over code Multiple functions are packaged into common platform to use for non-developer such as business personnel.Existing big data quantity can be used Spark Streaming is handled, but be different developer will for different real-time calculating business write it is different Spark Streaming program.Respectively have not in terms of the code capacity of different developers, understandability and experience in enterprise Together, it will lead to the Spark Streaming program for writing out and the problems such as operation fails, data lose occur, so existing skill Art also needs to carry out the needs of second of Renewal and development is to meet actual production on the basis of original technology sometimes, such as needs to lead to Modification code or file are crossed to provide Spark information configuration.

Therefore, the scheme of the prior art needs the real-time calculation procedure of developer's secondary development Spark, business personnel Developer, which can only be relied on, just can be carried out data processing, cause the development cost of project higher, and project treatment effeciency is relatively low.

Summary of the invention

In view of this, it is an object of the invention to propose a kind of be based at Spark streaming program generator and program data Reason method can reduce business development access difficulty, reduce development cost and improve project treatment effeciency.

According to an aspect of the present invention, it provides a kind of based on Spark streaming program generator:

Including Spark program initialization module and processing module；Wherein,

The Spark program initialization module is matched for obtaining from what the Spark information configuration page of Web mode inputted Confidence breath generates configuration file according to the configuration information after configuration information verification passes through；

The processing module generates Spark flowmeter for the configuration file according to the Spark program initialization module Calculate program.

Preferably, the Spark program initialization module includes:

Module is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode；

Correction verification module, the configuration information obtained for verifying the acquisition module；

Execution module, for returning to error prompting to Web page, in the school after correction verification module verification failure It tests after module verifies successfully, configuration file is generated according to the configuration information.

Preferably, the Spark program initialization module further include:

Authentication module, when for obtaining the configuration information by setting account input in the acquisition module, to institute It states setting account and carries out Authority Verification；

The execution module is generated according to the configuration information and is configured after authentication module Authority Verification success File.

Preferably, the processing module includes:

Spark configuration file obtains module, for obtaining the configuration file of the Spark program initialization module；

Spark program generating module, it is raw for obtaining the configuration file that module obtains according to the Spark configuration file At Spark stream calculation program；

Spark program submits module, and the Spark stream calculation program for generating the Spark program generating module mentions Big data cluster is given to be handled.

Preferably, the Spark program submission module includes:

Spark program sending module, the Spark stream calculation program hair for generating the Spark program generating module Give big data cluster；

Spark program running optimizatin module runs feelings for collecting Spark stream calculation program in the big data cluster Condition, and executive plan is adjusted according to the Spark stream calculation program operating condition, it is adjusted according to executive plan adjusted Distribution and data processing amount of the Spark stream calculation program in big data cluster.

Preferably, the Spark information configuration page from the Web mode that the Spark program initialization module obtains The configuration information of input includes at least one of the following:

Process data information, the metadata information of table, Selective type field information, packet type field information, summation class Type-word segment information and counting type field information.

According to another aspect of the present invention, a kind of program data processing method is provided, comprising:

By the Spark program initialization module of program generator, the Spark information configuration page from Web mode is obtained The configuration information of input generates configuration file according to the configuration information after configuration information verification passes through；

By the processing module of program generator, configuration file is obtained from the Spark program initialization module, according to institute It states configuration file and generates Spark stream calculation program.

Preferably, described to obtain the configuration information inputted from the Spark information configuration page of Web mode, in configuration information After verification passes through, configuration file is generated according to the configuration information, comprising:

Obtain the configuration information inputted from the Spark information configuration page of Web mode；

Verify the configuration information of the acquisition；

After verification failure, error prompting is returned to Web page, after verifying successfully, is generated according to the configuration information Configuration file.

Preferably, the method also includes:

When obtaining the configuration information by setting account input, Authority Verification is carried out to the setting account；

After Authority Verification success, configuration file is generated according to the configuration information.

Preferably, after the generation Spark stream calculation program according to the configuration file, further includes: by the generation Spark stream calculation program submit to big data cluster and handled；Or,

After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark of the generation Stream calculation program is submitted to big data cluster and is handled, and collects Spark stream calculation program in the big data cluster and runs feelings Condition, and executive plan is adjusted according to the Spark stream calculation program operating condition, it is adjusted according to executive plan adjusted Distribution and data processing amount of the Spark stream calculation program in big data cluster.

Through the above it can be found that scheme provided by the embodiment of the present invention, provides a kind of based on Spark streaming Program generator, the program generator include Spark program initialization module and processing module；Wherein, at the beginning of the Spark program Beginningization module is matched for obtaining from the Spark information of Web (World Wide Web, i.e. global wide area network or WWW) mode The configuration information for setting page input generates configuration file according to the configuration information after configuration information verification passes through；The place Module is managed for the configuration file according to the Spark program initialization module, generates Spark stream calculation program.As it can be seen that this hair Bright is to provide the Spark information configuration page by way of Web service and can carry out configuration information verification, can be easily The configuration of various information is carried out, business personnel is preferably facilitated and carries out data processing according to their own needs, business personnel is not Data processing can only just be can be carried out by the developer calculated in real time again, so as to reduce business development access difficulty, mentioned High project treatment effeciency；In addition unify not needing different exploitations using the general based on Spark streaming program generator of exploitation Personnel write different Spark Streaming programs, also do not need developer by modification code or file to provide Spark information configuration, so as to reduce project development cost.

Further, the present invention can verify the configuration information of acquisition, after verification failure, return to error prompting to Web The page generates configuration file according to the configuration information after verifying successfully.

Further, the present invention can be when obtaining the configuration information by setting account input, to the setting account Authority Verification is carried out, after Authority Verification success, configuration file is generated according to the configuration information.

Further, the Spark stream calculation program of generation can be sent to big data cluster by the present invention；Furthermore, it is possible to It is advanced optimized, for example, collecting Spark stream calculation program operating condition in the big data cluster, and according to described Spark stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and exists Distribution and data processing amount in big data cluster.

Detailed description of the invention

Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label Typically represent same parts.

Fig. 1 is an a kind of schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention；

Fig. 2 is a kind of another schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention；

Fig. 3 is an a kind of flow diagram of program data processing method of the embodiment of the present invention；

Fig. 4 is a kind of another flow diagram of program data processing method of the embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.

Although showing the preferred embodiment of the disclosure in attached drawing, however, it is to be appreciated that may be realized in various forms The disclosure is without that should be limited by the embodiments set forth herein.On the contrary, thesing embodiments are provided so that the disclosure more Add thorough and complete, and the scope of the present disclosure can be completely communicated to those skilled in the art.

The present invention provides one kind based on Spark streaming program generator, can reduce business development access difficulty, reduce exploitation Cost and raising project treatment effeciency.

Below in conjunction with the technical solution of attached drawing the present invention is described in detail embodiment.

Fig. 1 is an a kind of schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention.

Shown in referring to Fig.1, one kind of the invention is based on Spark streaming program generator:

Including Spark program initialization module 10 and processing module 20；Wherein,

The Spark program initialization module 10, for obtaining from the input of the Spark information configuration page of Web mode Configuration information generates configuration file according to the configuration information after configuration information verification passes through；

The processing module 20 generates Spark for the configuration file according to the Spark program initialization module 10 Stream calculation program.

Wherein, the Spark information configuration page from the Web mode that the Spark program initialization module obtains is defeated The configuration information entered includes at least one of the following: process data information, the metadata information of table, Selective type field information, divides Set type field information, sum-type field information and counting type field information.

Wherein, the configuration information further includes at least one of following: the name of Spark stream calculation program, window time are big It is small, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing logic Sql (structured query language, structured query language) information.

From the embodiment it can be found that the embodiment of the invention provides one kind to be based on Spark streaming program generator, the journey Sequence generator includes Spark program initialization module and processing module；Wherein, the Spark program initialization module, for obtaining The configuration information inputted from the Spark information configuration page of Web mode is taken, after configuration information verification passes through, is matched according to described Confidence breath generates configuration file；The processing module is used for the configuration file according to the Spark program initialization module, generates Spark stream calculation program.As it can be seen that the present invention is to provide the Spark information configuration page by way of Web service and can carry out Configuration information verification, can easily carry out the configuration of various information, preferably facilitate business personnel according to oneself need Carry out data processing is asked, business personnel no longer can only just can be carried out data processing by the developer calculated in real time, so as to To reduce business development access difficulty, project treatment effeciency is improved；In addition unify using the general based on Spark streaming of exploitation Program generator does not need different developers and writes different Spark Streaming programs, and it is logical not need developer yet Modification code or file are crossed to provide Spark information configuration, so as to reduce project development cost.

Fig. 2 is a kind of another schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention. The structural framing of the invention based on Spark streaming program generator is described in more detail relative to Fig. 1 in Fig. 2.

The present invention provides a kind of general (to be referred to as calculating in real time and be generated based on Spark streaming program generator Device), more different real-time processing logics, different data types can be calculated in real time.Base provided by the invention In Spark streaming program generator, real-time computation model can simplify, reduce business development and access difficulty, calculate tool in real time There are the characteristics such as low latency, high-performance, distribution, expansible, fault-tolerant.

The solution of the present invention is provided the Spark information configuration page by way of Web service and can carry out configuration information Verification, can easily carry out the configuration of various information, preferably facilitate business personnel and count according to their own needs According to processing.Enterprise is difficult to recruit the data mining personnel of profession at present, and data mining personnel will also learn Hadoop (one Distributed system infrastructure developed by apache foundation), the various the relevant technologies such as Spark；In addition, also by these The content of open source is combined together to form a solution, therefore tool acquires a certain degree of difficulty；And business personnel, it can only be by real-time The developer of calculating just can be carried out data processing.But apply the solution of the present invention, so that it may more convenient generation real-time streams Calculation procedure can reduce the threshold of knowledge and the threshold of data developer, reduce business development and access difficulty, reduce exploitation Cost, improve development efficiency.

It is provided by the invention to be based on Spark streaming program generator referring to shown in Fig. 2, at the beginning of specifically including that Spark program Beginningization module 10 and processing module 20.

Wherein, the Spark program initialization module 10 may further include: obtain module 101, correction verification module 102, Execution module 103, authentication module 104.

Wherein, the processing module 20 may further include: Spark configuration file obtains module 201, Spark program Generation module 202, Spark program submit module 203, and the Spark program submits module 203 further and may include Spark program sending module 2031 and Spark program running optimizatin module 2032.

Wherein, module 101 is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode.

Correction verification module 102, the configuration information obtained for verifying the acquisition module 101.

Execution module 103 is used for after the correction verification module 102 verification failure, return error prompting to Web page, After the correction verification module 102 verifies successfully, configuration file is generated according to the configuration information.

Authentication module 104, for obtaining the configuration information by setting account input in the acquisition module 101 When, Authority Verification is carried out to the setting account；The execution module 103 104 Authority Verification of authentication module at After function, configuration file is generated according to the configuration information.

Specifically, the Spark information configuration page from the Web mode that Spark program initialization module 10 obtains is defeated The configuration information entered may include at least one of following: process data information, the metadata information of table, Selective type field letter Breath, packet type field information, sum-type field information and counting type field information.

That is, Spark program initialization module 10, can by the Spark information configuration page of Web mode, into Row for example the configuration of process data information, the configuration of the metadata information of table, the configuration of select (Selective type field) information, The configuration of groupby (packet type field) information, the configuration of sum (sum-type field) information and count (counting type word Section) information, where (operation field being filtered) information, multilist Join (table name that selection needs to carry out Join) information and The configuration of sort (field for needing to be ranked up) etc..In this way, business personnel can be according to the business demand of oneself in Web mode The Spark information configuration page in be simply easily manually entered very much the configurations of various information.

Wherein, when the configuration information can also include at least one of the following: the name of Spark stream calculation program, window Between size, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing Logic sql information.

Spark configuration file obtains module 201, for obtaining the configuration file of the Spark program initialization module 10.

Spark program generating module 202, for obtaining the configuration text that module 201 obtains according to the Spark configuration file Part generates Spark stream calculation program.Spark program generating module 202 can be in the life in the Spark information configuration page After being operated at key, Spark stream calculation program is generated.

That is, the present invention can generate Spark stream calculation program with a key.For example, by pressing Web mode Generation key in the Spark information configuration page triggers background program, and Spark program generating module 202 detects generation key After operating, Spark stream calculation program is generated, a key also can be thus achieved and generate Spark stream calculation program, be more convenient business people Member uses, and substantially reduces business development access difficulty.

Spark program submits module 203, the Spark stream calculation for generating the Spark program generating module 202 Program is submitted to big data cluster and is handled.Wherein, big data cluster refers to yarn (yarn) resource management system.

Wherein, Spark program submits the Spark program sending module 2031 in module 203, is used for the Spark journey The Spark stream calculation program that sequence generation module 202 generates is sent to big data cluster.

Spark program submits the Spark program running optimizatin module 2032 in module 203, for collecting the big data Spark stream calculation program operating condition in cluster, and executive plan is adjusted according to the Spark stream calculation program operating condition, Distribution and data processing amount of the Spark stream calculation program in big data cluster are adjusted according to executive plan adjusted.

It is above-mentioned describe in detail the embodiment of the present invention based on Spark streaming program generator, accordingly introduce this hair below It is bright to utilize the program data processing method based on Spark streaming program generator.

Fig. 3 is an a kind of flow diagram of program data processing method of the embodiment of the present invention.

Referring to shown in Fig. 3, the method for the present invention includes:

In step 301, by the Spark program initialization module of program generator, the Spark from Web mode is obtained The configuration information of information configuration page input generates configuration file according to the configuration information after configuration information verification passes through.

May include: in the step

Verify the configuration information of the acquisition；

It should be noted that the present invention can also be when obtaining the configuration information by setting account input, to the setting Account carries out Authority Verification；After Authority Verification success, configuration file is generated according to the configuration information.

In the step, the Spark information configuration page from the Web mode of the Spark program initialization module acquisition The configuration information of face input includes at least one of the following: process data information, the metadata information of table, Selective type field letter Breath, packet type field information, sum-type field information and counting type field information.

The configuration information can also include at least one of the following: that the name of Spark stream calculation program, window time are big It is small, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing logic Sql information.

The present invention provides the Spark information configuration page by way of Web service, can easily carry out various information Configuration, preferably facilitate business personnel and carry out data processing according to their own needs.

In step 302, by the processing module of program generator, match from Spark program initialization module acquisition File is set, Spark stream calculation program is generated according to the configuration file.

In the step, module can be obtained by the Spark configuration file in the processing module, obtain the Spark The configuration file of program initialization module；By the Spark program generating module in the processing module, according to the Spark Configuration file obtains the configuration file that module obtains, and generates Spark stream calculation program.

It should be noted that the present invention it is described Spark stream calculation program is generated according to the configuration file after, can be with It include: that the Spark stream calculation program of the generation is submitted to big data cluster to handle；Or, matching according to described It can also include: to submit to the Spark stream calculation program of the generation greatly after setting file generated Spark stream calculation program Data cluster is handled, and collects Spark stream calculation program operating condition in the big data cluster, and according to the Spark Stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and is counting greatly According in cluster distribution and data processing amount.

From the embodiment it can be found that the present invention is to provide the Spark information configuration page by way of Web service and can To carry out configuration information verification, the configuration of various information can be easily carried out, preferably facilitates business personnel according to certainly Oneself demand carries out data processing, and business personnel no longer can only just can be carried out data processing by the developer calculated in real time, So as to reduce business development access difficulty, project treatment effeciency is improved；In addition unify using exploitation it is general based on Spark streaming program generator does not need different developers and writes different Spark Streaming programs, do not need yet Developer provides Spark information configuration by modification code or file, so as to reduce project development cost.

Fig. 4 is a kind of another flow diagram of program data processing method of the embodiment of the present invention.Fig. 4 is relative to Fig. 3 Program data processing method of the invention is described in more detail.

Referring to shown in Fig. 4, the method for the present invention includes:

In step 401, the configuration information inputted from the Spark information configuration page of Web mode is obtained, in configuration information After verification and setting account Authority Verification pass through, configuration file is generated according to the configuration information.

The step can be obtained and be believed from the Spark of Web mode by the acquisition module in Spark program initialization module The configuration information of breath configuration page input；By the correction verification module in Spark program initialization module, for verifying the acquisition The configuration information that module obtains；By the authentication module in Spark program initialization module, permission is carried out to setting account Verifying；By the execution module in Spark program initialization module, the correction verification module verify successfully with the authentication After the success of module Authority Verification, configuration file is generated according to the configuration information；Fail in addition, being verified in the correction verification module Afterwards, error prompting can be returned to Web page.

In the step, configuration information, example can be inputted in the Spark information configuration page of Web mode by setting account Such as input the configuration of process data information, the configuration of the metadata information of table, the configuration of select (Selective type field) information, The configuration of groupby (packet type field) information, the configuration of sum (sum-type field) information, count (counting type word Section) information, where (operation field being filtered) information, multilist Join (table name that selection needs to carry out Join) information and The configuration of sort (field for needing to be ranked up) etc..

Furthermore it is also possible to input name (i.e. the name of Spark stream calculation program), the window time that Spark is calculated in real time Whether size needs window, whether needs sliding window, the time of sliding window, data source topic (theme), target The configuration informations such as topic, processing logic sql information.Wherein, data source topic can be the topic of consumption Kafka, target Topic can be the topic of production Kafka.Kafka is put down by an open source stream process of Apache Software Foundation exploitation Platform.Consumption is an opposite movement with production, and consumption refers to obtaining data inside the topic of Kafka, and production refers to Treated, data are put into the topic of Kafka.

In the step, configuration information verification can be carried out to the configuration information of the above-mentioned various inputs of acquisition.

For example, above-mentioned configuration information can using Json (JavaScript Object Notation, JS object numbered musical notation, A kind of data interchange format of lightweight) storage, and by Gson (Google provide be used in Java object and Json data Between the java class library that is mapped, a Json character can be changed into a Java object, or a Java is converted For Json character string) it is parsed and is verified.

The process that the present invention carries out configuration information verification can include but is not limited to following manner:

The corresponding value of attribute-name for reading Spark program, judges Spark journey by database (such as mysql database) The attribute-name of sequence whether there is, and if so, being judged as illegal (verify and do not pass through), will return to error prompting to Web The page, namely user's input error is reminded back to Web page；If there is no being then judged as legal (i.e. verification pass through).

In the case where the attribute-name of Spark program is not present and is judged as legal, data source is further judged Field name inside the configuration informations such as topic, target topic, select, groupby, sum, count, where and sort is No presence, and if so, be judged as legal, if there is no being then judged as illegal.

In the case that field name inside above-mentioned configuration information exists and is judged as legal, sliding window attribute is read Value and window time attribute value, judge whether sliding time is less than window time and whether the sliding time time is greater than given threshold Such as 1 hour；If all met, it is judged as that final verification passes through, and data is stored in inside database, while can be given birth to It is checked at the information configuration file of excel format for business personnel.

It should be noted that the present invention can also be when obtaining the configuration information by setting account input, to the setting Account carries out Authority Verification.For example, being provided with different login accounts to different personnel, and different accounts are arranged as needed Therefore different operating rights when inputting configuration information by setting login account, can inquire database to the setting Account carries out Authority Verification, if data base querying, which arrives, has corresponding account, and operator closing operation permission, then it is assumed that verifying is led to It crosses.To sum up, Spark program initialization module obtains the configuration information of above-mentioned various inputs, carries out initialization process.It has initialized At configuration file is generated later, configuration file is saved on linux system.The configuration file wherein generated can be text lattice The format of the configuration file of formula, configuration file the inside content can be as follows but not limited to this: configuration name=configuration content, with funny Number separate.

In step 402, configuration file is obtained, the processing logic sql of configuration information in configuration file is carried out at judgement Reason.

The present invention can obtain module by the Spark configuration file in processing module, and it is initial to obtain the Spark program Change the configuration file of module, and can be by database come the field information of the data source topic in query configuration information；Into one Step judges whether the field in the processing logic sql information in the field information of data source topic is correct.It should be noted that place Field information in reason logic sql needs to be present in the field information of data source topic, but the field letter of data source topic Breath not necessarily exists in the field information in processing logic sql.

If the field information of lane database storage is not identical as the name of the field in processing logic sql information, just sentence Break incorrect for the field in processing logic sql information, will remind user that input is wrong；If the field of lane database storage Information is identical as the name of field in processing logic sql information to be judged as correctly, and correctly just further judgement processing is patrolled Whether the grammer for collecting sql is wrong.

About processing logic sql grammer whether Cuo Wu judgement includes: the field occurred in select clause or category Property, if not in aggregate function, then needing to be put into inside groupby clause, if not being put into inside groupby clause, Then think syntax error；In turn, the field or attribute in groupby clause are not appeared in, can be only present in aggregate function In, if not being present in aggregate function, then it is assumed that syntax error.

If processing logic sql there is no syntax error, think at this time handle logic sql there is no problem (i.e. processing logic Sql does not have field errors and syntax error problem simultaneously), further judge whether window time and sliding time are more than threshold Value.Wherein, the sliding time in configuration information is generally require less than window time, and the threshold value of sliding time and window time can be with For 1 hour but not limited to this.

Finally, according to the setting rule such as sql grammer judging result, the size of the table of Join and data volume per second to sql It optimizes, the sql sentence for ultimately generating an optimization is deposited into database.

In step 403, the execution configuration file Shell script of Spark stream calculation program is generated.

In the step, holding for Spark stream calculation program can be generated by the Spark program generating module in processing module Row configuration file Shell script (shell).

In the step, can in reading database corresponding Spark stream calculation program name, window time size, be It is no to need window, whether need information and the generations such as sliding window, the time of sliding window, data source topic, target topic Sql sentence, and driver-memory (driving memory), executor-memory are calculated according to the size of table and data volume The sizes such as (executing memory), executor-cores (CPU executes nucleus number) and memoryOverhead (out-pile memory), and obtain In addition the parameter settings such as the parameter of some fixations such as time-out time, queue name also obtain degree of parallelism, memory accounting, JVM Parameters such as (Java Virtual Machine, Java Virtual Machines).Then, corresponding generation Spark stream calculation program is generated Execution configuration file Shell script (shell).Shell script refers to the shell of linux system, usually generation resource distribution text Shell script is regenerated after part.

The present invention can pass through number when judging that window time in configuration information and sliding time are not above threshold value Whether repeated according to the name of the current Spark stream calculation program of library inquiry.

If the name of current Spark stream calculation program does not repeat, current Spark stream can be found out by following formula Driver-memory (driving memory), the executor-memory (executing memory), executor- of calculation procedure needs Cores (executing nucleus number), memoryOverhead (out-pile memory).

The driver-memory (driving memory) of the current Spark stream calculation program needs of present invention calculating generation, Executor-memory (execute memory), executor-cores (executing nucleus number), memoryOverhead (out-pile memory) Formula can be such that

Executor-cores required for Spark stream calculation program (executing nucleus number)=(data source data strip per second Ten thousand) the CPU core number required for * 20,000 data per second of number/2

Driver-memory required for Spark stream calculation program (driving memory)=(data source data strip per second Ten thousand) number/2 drives memory required for * 20,000 data per second

Executor-memory required for Spark stream calculation program (executing memory)=(data source data strip per second Ten thousand) number/2 executes memory required for * 20,000 data per second

MemoryOverhead required for Spark stream calculation program (out-pile memory)=(data source data strip per second Ten thousand) the out-pile memory required for * 20,000 data per second of number/2

Wherein, the data source can refer to consumption Kafka.

In step 404, the execution configuration file Shell script of Spark stream calculation program is executed, Spark flowmeter is generated Calculate program.

In the step, holding for Spark stream calculation program can be executed by the Spark program generating module in processing module Row configuration file Shell script generates Spark stream calculation program.

The step can remotely execute Spark by SSH2 (Secure Shell 2,2.0 version of Secure Shell) agreement The execution configuration file Shell script of stream calculation program, parameter is input in Spark composer, and Spark composer is by connecing Parameter is received, is separated by comma, parameter is passed to corresponding function, generate corresponding Spark stream calculation program.

In step 405, Spark stream calculation program big data cluster is submitted to handle.

The step can submit the Spark program sending module in module by Spark program, by Spark stream calculation journey Sequence is submitted to big data cluster and is handled.

In the step, Spark stream calculation program can be submitted in big data cluster by Spark program sending module, greatly Data cluster refers to yarn (yarn) resource management system.

In a step 406, Spark stream calculation program operating condition in the big data cluster is collected, and according to described Spark stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and exists Distribution and data processing amount in big data cluster.

In the step, the Spark program running optimizatin module in module can be submitted by Spark program, described in collection Spark stream calculation program operating condition in big data cluster, for example, according to reduce (reduction) task handle data volume, Map (is exported and is transmitted to reducer as input) data volume that pulls, the data distribution in each execution stage and greatly by shuffle Situations such as data volume of small, each subregion, adjusts executive plan, adjusts Spark stream calculation journey according to executive plan adjusted The distribution and data processing amount of sequence in big data cluster, such as processing data skew, the adjustment number of partitions etc., after then handling Data write back in target topic.

In conclusion the solution of the present invention has the advantages that

Scheme provided by the invention breaches traditional company and needs to recruit the developer's progress Spark calculated in real time Real-time calculation procedure exploitation and business personnel can only just can be carried out the quagmire of data processing by the developer calculated in real time. The present invention provides Spark and calculates generator (based on Spark streaming program generator) in real time, preferably facilitates business personnel's root Data processing is carried out according to the demand of oneself, while the exploitation for decreasing calculating developer in real time is lack of standardization, resource submission accounts for With the various problems such as too many and unreasonable, the development efficiency of project is also greatly improved simultaneously, saves project duration.

Above it is described in detail according to the technique and scheme of the present invention by reference to attached drawing.

In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention Machine program code instruction.

Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code), When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.

Those skilled in the art will also understand is that, the various example logic data in conjunction with described in disclosure herein Block, mould data block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.

The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent a modulus evidence A part of block, program segment or code, a part of the mould data block, program segment or code include one or more for real The executable instruction of logic function as defined in existing.It should also be noted that in some implementations as replacements, being marked in box Function can also be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually substantially simultaneously It executes capablely, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that frame The combination of figure and/or each box in flow chart and the box in block diagram and or flow chart, can be as defined in executing The dedicated hardware based systems of functions or operations is realized, or can be come using a combination of dedicated hardware and computer instructions It realizes.

Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art Other those of ordinary skill can understand each embodiment disclosed herein.

Claims

1. one kind is based on Spark streaming program generator, it is characterised in that:

Including Spark program initialization module and processing module；Wherein,

The Spark program initialization module, for obtaining from the input of the Spark information configuration page of Web mode with confidence Breath generates configuration file according to the configuration information after configuration information verification passes through；

The processing module generates Spark stream calculation journey for the configuration file according to the Spark program initialization module Sequence.

2. according to claim 1 be based on Spark streaming program generator, which is characterized in that the Spark program is initial Changing module includes:

Execution module, for returning to error prompting to Web page, in the calibration mode after correction verification module verification failure After block check success, configuration file is generated according to the configuration information.

3. according to claim 2 be based on Spark streaming program generator, which is characterized in that the Spark program is initial Change module further include:

Authentication module is set when for obtaining the configuration information by setting account input in the acquisition module to described Determine account and carries out Authority Verification；

The execution module generates configuration text after authentication module Authority Verification success, according to the configuration information Part.

4. according to claim 1 be based on Spark streaming program generator, which is characterized in that the processing module includes:

Spark program generating module is generated for obtaining the configuration file that module obtains according to the Spark configuration file Spark stream calculation program；

Spark program submits module, and the Spark stream calculation program for generating the Spark program generating module is submitted to Big data cluster is handled.

5. according to claim 4 be based on Spark streaming program generator, which is characterized in that the Spark program is submitted Module includes:

Spark program sending module, the Spark stream calculation program for generating the Spark program generating module are sent to Big data cluster；

Spark program running optimizatin module, for collecting Spark stream calculation program operating condition in the big data cluster, and Executive plan is adjusted according to the Spark stream calculation program operating condition, Spark flowmeter is adjusted according to executive plan adjusted Calculate distribution and data processing amount of the program in big data cluster.

6. according to any one of claims 1 to 5 be based on Spark streaming program generator, it is characterised in that:

What the Spark program initialization module obtained matches confidence from the input of the Spark information configuration page of the Web mode Breath includes at least one of the following:

Process data information, the metadata information of table, Selective type field information, packet type field information, sum-type word Segment information and counting type field information.

7. a kind of program data processing method characterized by comprising

By the Spark program initialization module of program generator, obtains and inputted from the Spark information configuration page of Web mode Configuration information, configuration information verification pass through after, according to the configuration information generate configuration file；

By the processing module of program generator, configuration file is obtained from the Spark program initialization module, is matched according to described Set file generated Spark stream calculation program.

8. the method according to the description of claim 7 is characterized in that the Spark information configuration page obtained from Web mode The configuration information of input generates configuration file according to the configuration information after configuration information verification passes through, comprising:

Verify the configuration information of the acquisition；

After verification failure, error prompting is returned to Web page, after verifying successfully, is generated and is configured according to the configuration information File.

9. according to the method described in claim 8, it is characterized in that, the method also includes:

10. method according to any one of claims 7 to 9, it is characterised in that:

After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark flowmeter of the generation Calculation program is submitted to big data cluster and is handled；Or,

After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark flowmeter of the generation Calculation program is submitted to big data cluster and is handled, and Spark stream calculation program operating condition in the big data cluster is collected, and Executive plan is adjusted according to the Spark stream calculation program operating condition, Spark flowmeter is adjusted according to executive plan adjusted Calculate distribution and data processing amount of the program in big data cluster.