CN110008242A - One kind being based on Spark streaming program generator and program data processing method - Google Patents

One kind being based on Spark streaming program generator and program data processing method Download PDF

Info

Publication number
CN110008242A
CN110008242A CN201910186601.9A CN201910186601A CN110008242A CN 110008242 A CN110008242 A CN 110008242A CN 201910186601 A CN201910186601 A CN 201910186601A CN 110008242 A CN110008242 A CN 110008242A
Authority
CN
China
Prior art keywords
spark
program
module
information
configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910186601.9A
Other languages
Chinese (zh)
Inventor
郑文康
冯达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamei Zhilian Data Technology Co ltd
Original Assignee
Guangzhou Yamei Information Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Yamei Information Science & Technology Co Ltd filed Critical Guangzhou Yamei Information Science & Technology Co Ltd
Priority to CN201910186601.9A priority Critical patent/CN110008242A/en
Publication of CN110008242A publication Critical patent/CN110008242A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Stored Programmes (AREA)

Abstract

The present invention discloses a kind of based on Spark streaming program generator and program data processing method.It should be based on Spark streaming program generator, including Spark program initialization module and processing module;Wherein, the Spark program initialization module after configuration information verification passes through, generates configuration file according to the configuration information for obtaining the configuration information inputted from the Spark information configuration page of Web mode;The processing module generates Spark stream calculation program for the configuration file according to the Spark program initialization module.Technical solution provided by the invention can reduce business development access difficulty, reduce development cost and improve project treatment effeciency.

Description

One kind being based on Spark streaming program generator and program data processing method
Technical field
The present invention relates to computer big data technical fields, and in particular to one kind based on Spark streaming program generator and Program data processing method.
Background technique
Currently, as the technologies such as Internet of Things, social networks, cloud computing constantly incorporate people's lives and existing calculating Ability, memory space, network bandwidth high speed development, the mankind accumulation data in internet, communication, finance, business, medical treatment etc. Numerous areas constantly increases and accumulates.Internet is propagated as information and regenerated platform, " information overflow ", " number occurs According to explosion " phenomena such as, the data information of magnanimity makes people be difficult to quickly make one's choice.
The problems such as big, requirement of real-time is high in face of data processing amount, introduces Spark technology in the prior art and is solved. Spark is a kind of computing engines of Universal-purpose quick for aiming at large-scale data processing and designing.The server of present mainstream, it is several hundred The memory of GB or a few TB are typical, so that memory database is achieved, Spark also exactly utilizes this meter for the development of memory It calculates resource and designs.Spark Streaming (Spark stream) is the module that Spark is used to handle stream data, is Spark core An extension of heart API (Application Programming Interface, application programming interface), may be implemented The processing of the real-time streaming data for having fault tolerant mechanism of high-throughput is supported to obtain data from multiple data sources, from data source After obtaining data, the processing that various high-level functions carry out complicated algorithm can be used.
The example that Spark Streaming real-time technique is applied to big data analysis in the prior art is much, but Be, at present to the use of Spark Streaming frame be only rest on frame simply using upper.For example, only stopping In the calling of framework function, the inner working principle of function and the meaning of parameters are not understood in depth, not over code Multiple functions are packaged into common platform to use for non-developer such as business personnel.Existing big data quantity can be used Spark Streaming is handled, but be different developer will for different real-time calculating business write it is different Spark Streaming program.Respectively have not in terms of the code capacity of different developers, understandability and experience in enterprise Together, it will lead to the Spark Streaming program for writing out and the problems such as operation fails, data lose occur, so existing skill Art also needs to carry out the needs of second of Renewal and development is to meet actual production on the basis of original technology sometimes, such as needs to lead to Modification code or file are crossed to provide Spark information configuration.
Therefore, the scheme of the prior art needs the real-time calculation procedure of developer's secondary development Spark, business personnel Developer, which can only be relied on, just can be carried out data processing, cause the development cost of project higher, and project treatment effeciency is relatively low.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of be based at Spark streaming program generator and program data Reason method can reduce business development access difficulty, reduce development cost and improve project treatment effeciency.
According to an aspect of the present invention, it provides a kind of based on Spark streaming program generator:
Including Spark program initialization module and processing module;Wherein,
The Spark program initialization module is matched for obtaining from what the Spark information configuration page of Web mode inputted Confidence breath generates configuration file according to the configuration information after configuration information verification passes through;
The processing module generates Spark flowmeter for the configuration file according to the Spark program initialization module Calculate program.
Preferably, the Spark program initialization module includes:
Module is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode;
Correction verification module, the configuration information obtained for verifying the acquisition module;
Execution module, for returning to error prompting to Web page, in the school after correction verification module verification failure It tests after module verifies successfully, configuration file is generated according to the configuration information.
Preferably, the Spark program initialization module further include:
Authentication module, when for obtaining the configuration information by setting account input in the acquisition module, to institute It states setting account and carries out Authority Verification;
The execution module is generated according to the configuration information and is configured after authentication module Authority Verification success File.
Preferably, the processing module includes:
Spark configuration file obtains module, for obtaining the configuration file of the Spark program initialization module;
Spark program generating module, it is raw for obtaining the configuration file that module obtains according to the Spark configuration file At Spark stream calculation program;
Spark program submits module, and the Spark stream calculation program for generating the Spark program generating module mentions Big data cluster is given to be handled.
Preferably, the Spark program submission module includes:
Spark program sending module, the Spark stream calculation program hair for generating the Spark program generating module Give big data cluster;
Spark program running optimizatin module runs feelings for collecting Spark stream calculation program in the big data cluster Condition, and executive plan is adjusted according to the Spark stream calculation program operating condition, it is adjusted according to executive plan adjusted Distribution and data processing amount of the Spark stream calculation program in big data cluster.
Preferably, the Spark information configuration page from the Web mode that the Spark program initialization module obtains The configuration information of input includes at least one of the following:
Process data information, the metadata information of table, Selective type field information, packet type field information, summation class Type-word segment information and counting type field information.
According to another aspect of the present invention, a kind of program data processing method is provided, comprising:
By the Spark program initialization module of program generator, the Spark information configuration page from Web mode is obtained The configuration information of input generates configuration file according to the configuration information after configuration information verification passes through;
By the processing module of program generator, configuration file is obtained from the Spark program initialization module, according to institute It states configuration file and generates Spark stream calculation program.
Preferably, described to obtain the configuration information inputted from the Spark information configuration page of Web mode, in configuration information After verification passes through, configuration file is generated according to the configuration information, comprising:
Obtain the configuration information inputted from the Spark information configuration page of Web mode;
Verify the configuration information of the acquisition;
After verification failure, error prompting is returned to Web page, after verifying successfully, is generated according to the configuration information Configuration file.
Preferably, the method also includes:
When obtaining the configuration information by setting account input, Authority Verification is carried out to the setting account;
After Authority Verification success, configuration file is generated according to the configuration information.
Preferably, after the generation Spark stream calculation program according to the configuration file, further includes: by the generation Spark stream calculation program submit to big data cluster and handled;Or,
After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark of the generation Stream calculation program is submitted to big data cluster and is handled, and collects Spark stream calculation program in the big data cluster and runs feelings Condition, and executive plan is adjusted according to the Spark stream calculation program operating condition, it is adjusted according to executive plan adjusted Distribution and data processing amount of the Spark stream calculation program in big data cluster.
Through the above it can be found that scheme provided by the embodiment of the present invention, provides a kind of based on Spark streaming Program generator, the program generator include Spark program initialization module and processing module;Wherein, at the beginning of the Spark program Beginningization module is matched for obtaining from the Spark information of Web (World Wide Web, i.e. global wide area network or WWW) mode The configuration information for setting page input generates configuration file according to the configuration information after configuration information verification passes through;The place Module is managed for the configuration file according to the Spark program initialization module, generates Spark stream calculation program.As it can be seen that this hair Bright is to provide the Spark information configuration page by way of Web service and can carry out configuration information verification, can be easily The configuration of various information is carried out, business personnel is preferably facilitated and carries out data processing according to their own needs, business personnel is not Data processing can only just be can be carried out by the developer calculated in real time again, so as to reduce business development access difficulty, mentioned High project treatment effeciency;In addition unify not needing different exploitations using the general based on Spark streaming program generator of exploitation Personnel write different Spark Streaming programs, also do not need developer by modification code or file to provide Spark information configuration, so as to reduce project development cost.
Further, the present invention can verify the configuration information of acquisition, after verification failure, return to error prompting to Web The page generates configuration file according to the configuration information after verifying successfully.
Further, the present invention can be when obtaining the configuration information by setting account input, to the setting account Authority Verification is carried out, after Authority Verification success, configuration file is generated according to the configuration information.
Further, the Spark stream calculation program of generation can be sent to big data cluster by the present invention;Furthermore, it is possible to It is advanced optimized, for example, collecting Spark stream calculation program operating condition in the big data cluster, and according to described Spark stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and exists Distribution and data processing amount in big data cluster.
Detailed description of the invention
Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label Typically represent same parts.
Fig. 1 is an a kind of schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention;
Fig. 2 is a kind of another schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention;
Fig. 3 is an a kind of flow diagram of program data processing method of the embodiment of the present invention;
Fig. 4 is a kind of another flow diagram of program data processing method of the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.
Although showing the preferred embodiment of the disclosure in attached drawing, however, it is to be appreciated that may be realized in various forms The disclosure is without that should be limited by the embodiments set forth herein.On the contrary, thesing embodiments are provided so that the disclosure more Add thorough and complete, and the scope of the present disclosure can be completely communicated to those skilled in the art.
The present invention provides one kind based on Spark streaming program generator, can reduce business development access difficulty, reduce exploitation Cost and raising project treatment effeciency.
Below in conjunction with the technical solution of attached drawing the present invention is described in detail embodiment.
Fig. 1 is an a kind of schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention.
Shown in referring to Fig.1, one kind of the invention is based on Spark streaming program generator:
Including Spark program initialization module 10 and processing module 20;Wherein,
The Spark program initialization module 10, for obtaining from the input of the Spark information configuration page of Web mode Configuration information generates configuration file according to the configuration information after configuration information verification passes through;
The processing module 20 generates Spark for the configuration file according to the Spark program initialization module 10 Stream calculation program.
Wherein, the Spark information configuration page from the Web mode that the Spark program initialization module obtains is defeated The configuration information entered includes at least one of the following: process data information, the metadata information of table, Selective type field information, divides Set type field information, sum-type field information and counting type field information.
Wherein, the configuration information further includes at least one of following: the name of Spark stream calculation program, window time are big It is small, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing logic Sql (structured query language, structured query language) information.
From the embodiment it can be found that the embodiment of the invention provides one kind to be based on Spark streaming program generator, the journey Sequence generator includes Spark program initialization module and processing module;Wherein, the Spark program initialization module, for obtaining The configuration information inputted from the Spark information configuration page of Web mode is taken, after configuration information verification passes through, is matched according to described Confidence breath generates configuration file;The processing module is used for the configuration file according to the Spark program initialization module, generates Spark stream calculation program.As it can be seen that the present invention is to provide the Spark information configuration page by way of Web service and can carry out Configuration information verification, can easily carry out the configuration of various information, preferably facilitate business personnel according to oneself need Carry out data processing is asked, business personnel no longer can only just can be carried out data processing by the developer calculated in real time, so as to To reduce business development access difficulty, project treatment effeciency is improved;In addition unify using the general based on Spark streaming of exploitation Program generator does not need different developers and writes different Spark Streaming programs, and it is logical not need developer yet Modification code or file are crossed to provide Spark information configuration, so as to reduce project development cost.
Fig. 2 is a kind of another schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention. The structural framing of the invention based on Spark streaming program generator is described in more detail relative to Fig. 1 in Fig. 2.
The present invention provides a kind of general (to be referred to as calculating in real time and be generated based on Spark streaming program generator Device), more different real-time processing logics, different data types can be calculated in real time.Base provided by the invention In Spark streaming program generator, real-time computation model can simplify, reduce business development and access difficulty, calculate tool in real time There are the characteristics such as low latency, high-performance, distribution, expansible, fault-tolerant.
The solution of the present invention is provided the Spark information configuration page by way of Web service and can carry out configuration information Verification, can easily carry out the configuration of various information, preferably facilitate business personnel and count according to their own needs According to processing.Enterprise is difficult to recruit the data mining personnel of profession at present, and data mining personnel will also learn Hadoop (one Distributed system infrastructure developed by apache foundation), the various the relevant technologies such as Spark;In addition, also by these The content of open source is combined together to form a solution, therefore tool acquires a certain degree of difficulty;And business personnel, it can only be by real-time The developer of calculating just can be carried out data processing.But apply the solution of the present invention, so that it may more convenient generation real-time streams Calculation procedure can reduce the threshold of knowledge and the threshold of data developer, reduce business development and access difficulty, reduce exploitation Cost, improve development efficiency.
It is provided by the invention to be based on Spark streaming program generator referring to shown in Fig. 2, at the beginning of specifically including that Spark program Beginningization module 10 and processing module 20.
Wherein, the Spark program initialization module 10 may further include: obtain module 101, correction verification module 102, Execution module 103, authentication module 104.
Wherein, the processing module 20 may further include: Spark configuration file obtains module 201, Spark program Generation module 202, Spark program submit module 203, and the Spark program submits module 203 further and may include Spark program sending module 2031 and Spark program running optimizatin module 2032.
Wherein, module 101 is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode.
Correction verification module 102, the configuration information obtained for verifying the acquisition module 101.
Execution module 103 is used for after the correction verification module 102 verification failure, return error prompting to Web page, After the correction verification module 102 verifies successfully, configuration file is generated according to the configuration information.
Authentication module 104, for obtaining the configuration information by setting account input in the acquisition module 101 When, Authority Verification is carried out to the setting account;The execution module 103 104 Authority Verification of authentication module at After function, configuration file is generated according to the configuration information.
Specifically, the Spark information configuration page from the Web mode that Spark program initialization module 10 obtains is defeated The configuration information entered may include at least one of following: process data information, the metadata information of table, Selective type field letter Breath, packet type field information, sum-type field information and counting type field information.
That is, Spark program initialization module 10, can by the Spark information configuration page of Web mode, into Row for example the configuration of process data information, the configuration of the metadata information of table, the configuration of select (Selective type field) information, The configuration of groupby (packet type field) information, the configuration of sum (sum-type field) information and count (counting type word Section) information, where (operation field being filtered) information, multilist Join (table name that selection needs to carry out Join) information and The configuration of sort (field for needing to be ranked up) etc..In this way, business personnel can be according to the business demand of oneself in Web mode The Spark information configuration page in be simply easily manually entered very much the configurations of various information.
Wherein, when the configuration information can also include at least one of the following: the name of Spark stream calculation program, window Between size, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing Logic sql information.
Spark configuration file obtains module 201, for obtaining the configuration file of the Spark program initialization module 10.
Spark program generating module 202, for obtaining the configuration text that module 201 obtains according to the Spark configuration file Part generates Spark stream calculation program.Spark program generating module 202 can be in the life in the Spark information configuration page After being operated at key, Spark stream calculation program is generated.
That is, the present invention can generate Spark stream calculation program with a key.For example, by pressing Web mode Generation key in the Spark information configuration page triggers background program, and Spark program generating module 202 detects generation key After operating, Spark stream calculation program is generated, a key also can be thus achieved and generate Spark stream calculation program, be more convenient business people Member uses, and substantially reduces business development access difficulty.
Spark program submits module 203, the Spark stream calculation for generating the Spark program generating module 202 Program is submitted to big data cluster and is handled.Wherein, big data cluster refers to yarn (yarn) resource management system.
Wherein, Spark program submits the Spark program sending module 2031 in module 203, is used for the Spark journey The Spark stream calculation program that sequence generation module 202 generates is sent to big data cluster.
Spark program submits the Spark program running optimizatin module 2032 in module 203, for collecting the big data Spark stream calculation program operating condition in cluster, and executive plan is adjusted according to the Spark stream calculation program operating condition, Distribution and data processing amount of the Spark stream calculation program in big data cluster are adjusted according to executive plan adjusted.
It is above-mentioned describe in detail the embodiment of the present invention based on Spark streaming program generator, accordingly introduce this hair below It is bright to utilize the program data processing method based on Spark streaming program generator.
Fig. 3 is an a kind of flow diagram of program data processing method of the embodiment of the present invention.
Referring to shown in Fig. 3, the method for the present invention includes:
In step 301, by the Spark program initialization module of program generator, the Spark from Web mode is obtained The configuration information of information configuration page input generates configuration file according to the configuration information after configuration information verification passes through.
May include: in the step
Obtain the configuration information inputted from the Spark information configuration page of Web mode;
Verify the configuration information of the acquisition;
After verification failure, error prompting is returned to Web page, after verifying successfully, is generated according to the configuration information Configuration file.
It should be noted that the present invention can also be when obtaining the configuration information by setting account input, to the setting Account carries out Authority Verification;After Authority Verification success, configuration file is generated according to the configuration information.
In the step, the Spark information configuration page from the Web mode of the Spark program initialization module acquisition The configuration information of face input includes at least one of the following: process data information, the metadata information of table, Selective type field letter Breath, packet type field information, sum-type field information and counting type field information.
The configuration information can also include at least one of the following: that the name of Spark stream calculation program, window time are big It is small, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing logic Sql information.
The present invention provides the Spark information configuration page by way of Web service, can easily carry out various information Configuration, preferably facilitate business personnel and carry out data processing according to their own needs.
In step 302, by the processing module of program generator, match from Spark program initialization module acquisition File is set, Spark stream calculation program is generated according to the configuration file.
In the step, module can be obtained by the Spark configuration file in the processing module, obtain the Spark The configuration file of program initialization module;By the Spark program generating module in the processing module, according to the Spark Configuration file obtains the configuration file that module obtains, and generates Spark stream calculation program.
It should be noted that the present invention it is described Spark stream calculation program is generated according to the configuration file after, can be with It include: that the Spark stream calculation program of the generation is submitted to big data cluster to handle;Or, matching according to described It can also include: to submit to the Spark stream calculation program of the generation greatly after setting file generated Spark stream calculation program Data cluster is handled, and collects Spark stream calculation program operating condition in the big data cluster, and according to the Spark Stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and is counting greatly According in cluster distribution and data processing amount.
From the embodiment it can be found that the present invention is to provide the Spark information configuration page by way of Web service and can To carry out configuration information verification, the configuration of various information can be easily carried out, preferably facilitates business personnel according to certainly Oneself demand carries out data processing, and business personnel no longer can only just can be carried out data processing by the developer calculated in real time, So as to reduce business development access difficulty, project treatment effeciency is improved;In addition unify using exploitation it is general based on Spark streaming program generator does not need different developers and writes different Spark Streaming programs, do not need yet Developer provides Spark information configuration by modification code or file, so as to reduce project development cost.
Fig. 4 is a kind of another flow diagram of program data processing method of the embodiment of the present invention.Fig. 4 is relative to Fig. 3 Program data processing method of the invention is described in more detail.
Referring to shown in Fig. 4, the method for the present invention includes:
In step 401, the configuration information inputted from the Spark information configuration page of Web mode is obtained, in configuration information After verification and setting account Authority Verification pass through, configuration file is generated according to the configuration information.
The step can be obtained and be believed from the Spark of Web mode by the acquisition module in Spark program initialization module The configuration information of breath configuration page input;By the correction verification module in Spark program initialization module, for verifying the acquisition The configuration information that module obtains;By the authentication module in Spark program initialization module, permission is carried out to setting account Verifying;By the execution module in Spark program initialization module, the correction verification module verify successfully with the authentication After the success of module Authority Verification, configuration file is generated according to the configuration information;Fail in addition, being verified in the correction verification module Afterwards, error prompting can be returned to Web page.
In the step, configuration information, example can be inputted in the Spark information configuration page of Web mode by setting account Such as input the configuration of process data information, the configuration of the metadata information of table, the configuration of select (Selective type field) information, The configuration of groupby (packet type field) information, the configuration of sum (sum-type field) information, count (counting type word Section) information, where (operation field being filtered) information, multilist Join (table name that selection needs to carry out Join) information and The configuration of sort (field for needing to be ranked up) etc..
Furthermore it is also possible to input name (i.e. the name of Spark stream calculation program), the window time that Spark is calculated in real time Whether size needs window, whether needs sliding window, the time of sliding window, data source topic (theme), target The configuration informations such as topic, processing logic sql information.Wherein, data source topic can be the topic of consumption Kafka, target Topic can be the topic of production Kafka.Kafka is put down by an open source stream process of Apache Software Foundation exploitation Platform.Consumption is an opposite movement with production, and consumption refers to obtaining data inside the topic of Kafka, and production refers to Treated, data are put into the topic of Kafka.
In the step, configuration information verification can be carried out to the configuration information of the above-mentioned various inputs of acquisition.
For example, above-mentioned configuration information can using Json (JavaScript Object Notation, JS object numbered musical notation, A kind of data interchange format of lightweight) storage, and by Gson (Google provide be used in Java object and Json data Between the java class library that is mapped, a Json character can be changed into a Java object, or a Java is converted For Json character string) it is parsed and is verified.
The process that the present invention carries out configuration information verification can include but is not limited to following manner:
The corresponding value of attribute-name for reading Spark program, judges Spark journey by database (such as mysql database) The attribute-name of sequence whether there is, and if so, being judged as illegal (verify and do not pass through), will return to error prompting to Web The page, namely user's input error is reminded back to Web page;If there is no being then judged as legal (i.e. verification pass through).
In the case where the attribute-name of Spark program is not present and is judged as legal, data source is further judged Field name inside the configuration informations such as topic, target topic, select, groupby, sum, count, where and sort is No presence, and if so, be judged as legal, if there is no being then judged as illegal.
In the case that field name inside above-mentioned configuration information exists and is judged as legal, sliding window attribute is read Value and window time attribute value, judge whether sliding time is less than window time and whether the sliding time time is greater than given threshold Such as 1 hour;If all met, it is judged as that final verification passes through, and data is stored in inside database, while can be given birth to It is checked at the information configuration file of excel format for business personnel.
It should be noted that the present invention can also be when obtaining the configuration information by setting account input, to the setting Account carries out Authority Verification.For example, being provided with different login accounts to different personnel, and different accounts are arranged as needed Therefore different operating rights when inputting configuration information by setting login account, can inquire database to the setting Account carries out Authority Verification, if data base querying, which arrives, has corresponding account, and operator closing operation permission, then it is assumed that verifying is led to It crosses.To sum up, Spark program initialization module obtains the configuration information of above-mentioned various inputs, carries out initialization process.It has initialized At configuration file is generated later, configuration file is saved on linux system.The configuration file wherein generated can be text lattice The format of the configuration file of formula, configuration file the inside content can be as follows but not limited to this: configuration name=configuration content, with funny Number separate.
In step 402, configuration file is obtained, the processing logic sql of configuration information in configuration file is carried out at judgement Reason.
The present invention can obtain module by the Spark configuration file in processing module, and it is initial to obtain the Spark program Change the configuration file of module, and can be by database come the field information of the data source topic in query configuration information;Into one Step judges whether the field in the processing logic sql information in the field information of data source topic is correct.It should be noted that place Field information in reason logic sql needs to be present in the field information of data source topic, but the field letter of data source topic Breath not necessarily exists in the field information in processing logic sql.
If the field information of lane database storage is not identical as the name of the field in processing logic sql information, just sentence Break incorrect for the field in processing logic sql information, will remind user that input is wrong;If the field of lane database storage Information is identical as the name of field in processing logic sql information to be judged as correctly, and correctly just further judgement processing is patrolled Whether the grammer for collecting sql is wrong.
About processing logic sql grammer whether Cuo Wu judgement includes: the field occurred in select clause or category Property, if not in aggregate function, then needing to be put into inside groupby clause, if not being put into inside groupby clause, Then think syntax error;In turn, the field or attribute in groupby clause are not appeared in, can be only present in aggregate function In, if not being present in aggregate function, then it is assumed that syntax error.
If processing logic sql there is no syntax error, think at this time handle logic sql there is no problem (i.e. processing logic Sql does not have field errors and syntax error problem simultaneously), further judge whether window time and sliding time are more than threshold Value.Wherein, the sliding time in configuration information is generally require less than window time, and the threshold value of sliding time and window time can be with For 1 hour but not limited to this.
Finally, according to the setting rule such as sql grammer judging result, the size of the table of Join and data volume per second to sql It optimizes, the sql sentence for ultimately generating an optimization is deposited into database.
In step 403, the execution configuration file Shell script of Spark stream calculation program is generated.
In the step, holding for Spark stream calculation program can be generated by the Spark program generating module in processing module Row configuration file Shell script (shell).
In the step, can in reading database corresponding Spark stream calculation program name, window time size, be It is no to need window, whether need information and the generations such as sliding window, the time of sliding window, data source topic, target topic Sql sentence, and driver-memory (driving memory), executor-memory are calculated according to the size of table and data volume The sizes such as (executing memory), executor-cores (CPU executes nucleus number) and memoryOverhead (out-pile memory), and obtain In addition the parameter settings such as the parameter of some fixations such as time-out time, queue name also obtain degree of parallelism, memory accounting, JVM Parameters such as (Java Virtual Machine, Java Virtual Machines).Then, corresponding generation Spark stream calculation program is generated Execution configuration file Shell script (shell).Shell script refers to the shell of linux system, usually generation resource distribution text Shell script is regenerated after part.
The present invention can pass through number when judging that window time in configuration information and sliding time are not above threshold value Whether repeated according to the name of the current Spark stream calculation program of library inquiry.
If the name of current Spark stream calculation program does not repeat, current Spark stream can be found out by following formula Driver-memory (driving memory), the executor-memory (executing memory), executor- of calculation procedure needs Cores (executing nucleus number), memoryOverhead (out-pile memory).
The driver-memory (driving memory) of the current Spark stream calculation program needs of present invention calculating generation, Executor-memory (execute memory), executor-cores (executing nucleus number), memoryOverhead (out-pile memory) Formula can be such that
Executor-cores required for Spark stream calculation program (executing nucleus number)=(data source data strip per second Ten thousand) the CPU core number required for * 20,000 data per second of number/2
Driver-memory required for Spark stream calculation program (driving memory)=(data source data strip per second Ten thousand) number/2 drives memory required for * 20,000 data per second
Executor-memory required for Spark stream calculation program (executing memory)=(data source data strip per second Ten thousand) number/2 executes memory required for * 20,000 data per second
MemoryOverhead required for Spark stream calculation program (out-pile memory)=(data source data strip per second Ten thousand) the out-pile memory required for * 20,000 data per second of number/2
Wherein, the data source can refer to consumption Kafka.
In step 404, the execution configuration file Shell script of Spark stream calculation program is executed, Spark flowmeter is generated Calculate program.
In the step, holding for Spark stream calculation program can be executed by the Spark program generating module in processing module Row configuration file Shell script generates Spark stream calculation program.
The step can remotely execute Spark by SSH2 (Secure Shell 2,2.0 version of Secure Shell) agreement The execution configuration file Shell script of stream calculation program, parameter is input in Spark composer, and Spark composer is by connecing Parameter is received, is separated by comma, parameter is passed to corresponding function, generate corresponding Spark stream calculation program.
In step 405, Spark stream calculation program big data cluster is submitted to handle.
The step can submit the Spark program sending module in module by Spark program, by Spark stream calculation journey Sequence is submitted to big data cluster and is handled.
In the step, Spark stream calculation program can be submitted in big data cluster by Spark program sending module, greatly Data cluster refers to yarn (yarn) resource management system.
In a step 406, Spark stream calculation program operating condition in the big data cluster is collected, and according to described Spark stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and exists Distribution and data processing amount in big data cluster.
In the step, the Spark program running optimizatin module in module can be submitted by Spark program, described in collection Spark stream calculation program operating condition in big data cluster, for example, according to reduce (reduction) task handle data volume, Map (is exported and is transmitted to reducer as input) data volume that pulls, the data distribution in each execution stage and greatly by shuffle Situations such as data volume of small, each subregion, adjusts executive plan, adjusts Spark stream calculation journey according to executive plan adjusted The distribution and data processing amount of sequence in big data cluster, such as processing data skew, the adjustment number of partitions etc., after then handling Data write back in target topic.
In conclusion the solution of the present invention has the advantages that
Scheme provided by the invention breaches traditional company and needs to recruit the developer's progress Spark calculated in real time Real-time calculation procedure exploitation and business personnel can only just can be carried out the quagmire of data processing by the developer calculated in real time. The present invention provides Spark and calculates generator (based on Spark streaming program generator) in real time, preferably facilitates business personnel's root Data processing is carried out according to the demand of oneself, while the exploitation for decreasing calculating developer in real time is lack of standardization, resource submission accounts for With the various problems such as too many and unreasonable, the development efficiency of project is also greatly improved simultaneously, saves project duration.
Above it is described in detail according to the technique and scheme of the present invention by reference to attached drawing.
In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention Machine program code instruction.
Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code), When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.
Those skilled in the art will also understand is that, the various example logic data in conjunction with described in disclosure herein Block, mould data block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.
The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent a modulus evidence A part of block, program segment or code, a part of the mould data block, program segment or code include one or more for real The executable instruction of logic function as defined in existing.It should also be noted that in some implementations as replacements, being marked in box Function can also be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually substantially simultaneously It executes capablely, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that frame The combination of figure and/or each box in flow chart and the box in block diagram and or flow chart, can be as defined in executing The dedicated hardware based systems of functions or operations is realized, or can be come using a combination of dedicated hardware and computer instructions It realizes.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art Other those of ordinary skill can understand each embodiment disclosed herein.

Claims (10)

1. one kind is based on Spark streaming program generator, it is characterised in that:
Including Spark program initialization module and processing module;Wherein,
The Spark program initialization module, for obtaining from the input of the Spark information configuration page of Web mode with confidence Breath generates configuration file according to the configuration information after configuration information verification passes through;
The processing module generates Spark stream calculation journey for the configuration file according to the Spark program initialization module Sequence.
2. according to claim 1 be based on Spark streaming program generator, which is characterized in that the Spark program is initial Changing module includes:
Module is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode;
Correction verification module, the configuration information obtained for verifying the acquisition module;
Execution module, for returning to error prompting to Web page, in the calibration mode after correction verification module verification failure After block check success, configuration file is generated according to the configuration information.
3. according to claim 2 be based on Spark streaming program generator, which is characterized in that the Spark program is initial Change module further include:
Authentication module is set when for obtaining the configuration information by setting account input in the acquisition module to described Determine account and carries out Authority Verification;
The execution module generates configuration text after authentication module Authority Verification success, according to the configuration information Part.
4. according to claim 1 be based on Spark streaming program generator, which is characterized in that the processing module includes:
Spark configuration file obtains module, for obtaining the configuration file of the Spark program initialization module;
Spark program generating module is generated for obtaining the configuration file that module obtains according to the Spark configuration file Spark stream calculation program;
Spark program submits module, and the Spark stream calculation program for generating the Spark program generating module is submitted to Big data cluster is handled.
5. according to claim 4 be based on Spark streaming program generator, which is characterized in that the Spark program is submitted Module includes:
Spark program sending module, the Spark stream calculation program for generating the Spark program generating module are sent to Big data cluster;
Spark program running optimizatin module, for collecting Spark stream calculation program operating condition in the big data cluster, and Executive plan is adjusted according to the Spark stream calculation program operating condition, Spark flowmeter is adjusted according to executive plan adjusted Calculate distribution and data processing amount of the program in big data cluster.
6. according to any one of claims 1 to 5 be based on Spark streaming program generator, it is characterised in that:
What the Spark program initialization module obtained matches confidence from the input of the Spark information configuration page of the Web mode Breath includes at least one of the following:
Process data information, the metadata information of table, Selective type field information, packet type field information, sum-type word Segment information and counting type field information.
7. a kind of program data processing method characterized by comprising
By the Spark program initialization module of program generator, obtains and inputted from the Spark information configuration page of Web mode Configuration information, configuration information verification pass through after, according to the configuration information generate configuration file;
By the processing module of program generator, configuration file is obtained from the Spark program initialization module, is matched according to described Set file generated Spark stream calculation program.
8. the method according to the description of claim 7 is characterized in that the Spark information configuration page obtained from Web mode The configuration information of input generates configuration file according to the configuration information after configuration information verification passes through, comprising:
Obtain the configuration information inputted from the Spark information configuration page of Web mode;
Verify the configuration information of the acquisition;
After verification failure, error prompting is returned to Web page, after verifying successfully, is generated and is configured according to the configuration information File.
9. according to the method described in claim 8, it is characterized in that, the method also includes:
When obtaining the configuration information by setting account input, Authority Verification is carried out to the setting account;
After Authority Verification success, configuration file is generated according to the configuration information.
10. method according to any one of claims 7 to 9, it is characterised in that:
After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark flowmeter of the generation Calculation program is submitted to big data cluster and is handled;Or,
After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark flowmeter of the generation Calculation program is submitted to big data cluster and is handled, and Spark stream calculation program operating condition in the big data cluster is collected, and Executive plan is adjusted according to the Spark stream calculation program operating condition, Spark flowmeter is adjusted according to executive plan adjusted Calculate distribution and data processing amount of the program in big data cluster.
CN201910186601.9A 2019-03-12 2019-03-12 One kind being based on Spark streaming program generator and program data processing method Pending CN110008242A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910186601.9A CN110008242A (en) 2019-03-12 2019-03-12 One kind being based on Spark streaming program generator and program data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910186601.9A CN110008242A (en) 2019-03-12 2019-03-12 One kind being based on Spark streaming program generator and program data processing method

Publications (1)

Publication Number Publication Date
CN110008242A true CN110008242A (en) 2019-07-12

Family

ID=67166866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910186601.9A Pending CN110008242A (en) 2019-03-12 2019-03-12 One kind being based on Spark streaming program generator and program data processing method

Country Status (1)

Country Link
CN (1) CN110008242A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625269A (en) * 2020-05-14 2020-09-04 中电工业互联网有限公司 Web-based universal Spark task submission system and method
CN112612514A (en) * 2020-12-31 2021-04-06 青岛海尔科技有限公司 Program development method and device, storage medium and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407472A (en) * 2016-11-01 2017-02-15 广西电网有限责任公司电力科学研究院 Visual editing and management system for big data analysis and calculation task of order model
CN106777101A (en) * 2016-12-14 2017-05-31 深圳天源迪科信息技术股份有限公司 Data processing engine
CN108037919A (en) * 2017-12-01 2018-05-15 北京博宇通达科技有限公司 A kind of visualization big data workflow configuration method and system based on WEB

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407472A (en) * 2016-11-01 2017-02-15 广西电网有限责任公司电力科学研究院 Visual editing and management system for big data analysis and calculation task of order model
CN106777101A (en) * 2016-12-14 2017-05-31 深圳天源迪科信息技术股份有限公司 Data processing engine
CN108037919A (en) * 2017-12-01 2018-05-15 北京博宇通达科技有限公司 A kind of visualization big data workflow configuration method and system based on WEB

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625269A (en) * 2020-05-14 2020-09-04 中电工业互联网有限公司 Web-based universal Spark task submission system and method
CN112612514A (en) * 2020-12-31 2021-04-06 青岛海尔科技有限公司 Program development method and device, storage medium and electronic device
CN112612514B (en) * 2020-12-31 2023-11-28 青岛海尔科技有限公司 Program development method and device, storage medium and electronic device

Similar Documents

Publication Publication Date Title
US10936479B2 (en) Pluggable fault detection tests for data pipelines
He et al. X-SQL: reinforce schema representation with context
Qian et al. Timestream: Reliable stream computation in the cloud
Jankowski et al. Storm Applied: Strategies for real-time event processing
CN106897322A (en) The access method and device of a kind of database and file system
US20100293535A1 (en) Profile-Driven Data Stream Processing
CN109359026A (en) Log reporting method, device, electronic equipment and computer readable storage medium
US20230018975A1 (en) Monolith database to distributed database transformation
CN104536987B (en) A kind of method and device for inquiring about data
CN117033460B (en) Automatic data model construction system and method based on bus matrix
CN110162297A (en) A kind of source code fragment natural language description automatic generation method and system
CN110008242A (en) One kind being based on Spark streaming program generator and program data processing method
CN116601644A (en) Providing interpretable machine learning model results using distributed ledgers
WO2023040145A1 (en) Artificial intelligence-based text classification method and apparatus, electronic device, and medium
Clark et al. Event driven architecture modelling and simulation
CN113378007B (en) Data backtracking method and device, computer readable storage medium and electronic device
WO2017097125A1 (en) Executive code generation method and device
Murakami et al. Predicting next changes at the fine-grained level
EP3907602A1 (en) Trustworthy application integration
Cappellari et al. Optimizing data stream processing for large‐scale applications
US11810022B2 (en) Contact center call volume prediction
CN110309062A (en) Case generation method, device, electronic equipment and storage medium
Wang Clustering in the Cloud: Clustring Algorithms to Hadoop Map/Reduce Framework
Rao Effficient Graph-based Computation and Analytics
Li Performance management of event processing systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230420

Address after: Room 101, No. 227 Gaotang Road, Tianhe District, Guangzhou City, Guangdong Province, 510000 (location: Room 601) (office only)

Applicant after: Yamei Zhilian Data Technology Co.,Ltd.

Address before: 510000 self compiled h, Room 201, No. 1, Hanjing Road, Tianhe District, Guangzhou, Guangdong Province

Applicant before: GUANGZHOU YAME INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20190712

RJ01 Rejection of invention patent application after publication