CN110008242A - One kind being based on Spark streaming program generator and program data processing method - Google Patents
One kind being based on Spark streaming program generator and program data processing method Download PDFInfo
- Publication number
- CN110008242A CN110008242A CN201910186601.9A CN201910186601A CN110008242A CN 110008242 A CN110008242 A CN 110008242A CN 201910186601 A CN201910186601 A CN 201910186601A CN 110008242 A CN110008242 A CN 110008242A
- Authority
- CN
- China
- Prior art keywords
- spark
- program
- module
- information
- configuration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 64
- 238000012795 verification Methods 0.000 claims abstract description 56
- 238000000034 method Methods 0.000 claims description 29
- 238000009826 distribution Methods 0.000 claims description 12
- 238000012937 correction Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 2
- 238000011161 development Methods 0.000 abstract description 21
- 230000006870 function Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000009432 framing Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000003860 storage Methods 0.000 description 5
- 238000004519 manufacturing process Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 239000003638 chemical reducing agent Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013497 data interchange Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Stored Programmes (AREA)
Abstract
The present invention discloses a kind of based on Spark streaming program generator and program data processing method.It should be based on Spark streaming program generator, including Spark program initialization module and processing module;Wherein, the Spark program initialization module after configuration information verification passes through, generates configuration file according to the configuration information for obtaining the configuration information inputted from the Spark information configuration page of Web mode;The processing module generates Spark stream calculation program for the configuration file according to the Spark program initialization module.Technical solution provided by the invention can reduce business development access difficulty, reduce development cost and improve project treatment effeciency.
Description
Technical field
The present invention relates to computer big data technical fields, and in particular to one kind based on Spark streaming program generator and
Program data processing method.
Background technique
Currently, as the technologies such as Internet of Things, social networks, cloud computing constantly incorporate people's lives and existing calculating
Ability, memory space, network bandwidth high speed development, the mankind accumulation data in internet, communication, finance, business, medical treatment etc.
Numerous areas constantly increases and accumulates.Internet is propagated as information and regenerated platform, " information overflow ", " number occurs
According to explosion " phenomena such as, the data information of magnanimity makes people be difficult to quickly make one's choice.
The problems such as big, requirement of real-time is high in face of data processing amount, introduces Spark technology in the prior art and is solved.
Spark is a kind of computing engines of Universal-purpose quick for aiming at large-scale data processing and designing.The server of present mainstream, it is several hundred
The memory of GB or a few TB are typical, so that memory database is achieved, Spark also exactly utilizes this meter for the development of memory
It calculates resource and designs.Spark Streaming (Spark stream) is the module that Spark is used to handle stream data, is Spark core
An extension of heart API (Application Programming Interface, application programming interface), may be implemented
The processing of the real-time streaming data for having fault tolerant mechanism of high-throughput is supported to obtain data from multiple data sources, from data source
After obtaining data, the processing that various high-level functions carry out complicated algorithm can be used.
The example that Spark Streaming real-time technique is applied to big data analysis in the prior art is much, but
Be, at present to the use of Spark Streaming frame be only rest on frame simply using upper.For example, only stopping
In the calling of framework function, the inner working principle of function and the meaning of parameters are not understood in depth, not over code
Multiple functions are packaged into common platform to use for non-developer such as business personnel.Existing big data quantity can be used
Spark Streaming is handled, but be different developer will for different real-time calculating business write it is different
Spark Streaming program.Respectively have not in terms of the code capacity of different developers, understandability and experience in enterprise
Together, it will lead to the Spark Streaming program for writing out and the problems such as operation fails, data lose occur, so existing skill
Art also needs to carry out the needs of second of Renewal and development is to meet actual production on the basis of original technology sometimes, such as needs to lead to
Modification code or file are crossed to provide Spark information configuration.
Therefore, the scheme of the prior art needs the real-time calculation procedure of developer's secondary development Spark, business personnel
Developer, which can only be relied on, just can be carried out data processing, cause the development cost of project higher, and project treatment effeciency is relatively low.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of be based at Spark streaming program generator and program data
Reason method can reduce business development access difficulty, reduce development cost and improve project treatment effeciency.
According to an aspect of the present invention, it provides a kind of based on Spark streaming program generator:
Including Spark program initialization module and processing module;Wherein,
The Spark program initialization module is matched for obtaining from what the Spark information configuration page of Web mode inputted
Confidence breath generates configuration file according to the configuration information after configuration information verification passes through;
The processing module generates Spark flowmeter for the configuration file according to the Spark program initialization module
Calculate program.
Preferably, the Spark program initialization module includes:
Module is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode;
Correction verification module, the configuration information obtained for verifying the acquisition module;
Execution module, for returning to error prompting to Web page, in the school after correction verification module verification failure
It tests after module verifies successfully, configuration file is generated according to the configuration information.
Preferably, the Spark program initialization module further include:
Authentication module, when for obtaining the configuration information by setting account input in the acquisition module, to institute
It states setting account and carries out Authority Verification;
The execution module is generated according to the configuration information and is configured after authentication module Authority Verification success
File.
Preferably, the processing module includes:
Spark configuration file obtains module, for obtaining the configuration file of the Spark program initialization module;
Spark program generating module, it is raw for obtaining the configuration file that module obtains according to the Spark configuration file
At Spark stream calculation program;
Spark program submits module, and the Spark stream calculation program for generating the Spark program generating module mentions
Big data cluster is given to be handled.
Preferably, the Spark program submission module includes:
Spark program sending module, the Spark stream calculation program hair for generating the Spark program generating module
Give big data cluster;
Spark program running optimizatin module runs feelings for collecting Spark stream calculation program in the big data cluster
Condition, and executive plan is adjusted according to the Spark stream calculation program operating condition, it is adjusted according to executive plan adjusted
Distribution and data processing amount of the Spark stream calculation program in big data cluster.
Preferably, the Spark information configuration page from the Web mode that the Spark program initialization module obtains
The configuration information of input includes at least one of the following:
Process data information, the metadata information of table, Selective type field information, packet type field information, summation class
Type-word segment information and counting type field information.
According to another aspect of the present invention, a kind of program data processing method is provided, comprising:
By the Spark program initialization module of program generator, the Spark information configuration page from Web mode is obtained
The configuration information of input generates configuration file according to the configuration information after configuration information verification passes through;
By the processing module of program generator, configuration file is obtained from the Spark program initialization module, according to institute
It states configuration file and generates Spark stream calculation program.
Preferably, described to obtain the configuration information inputted from the Spark information configuration page of Web mode, in configuration information
After verification passes through, configuration file is generated according to the configuration information, comprising:
Obtain the configuration information inputted from the Spark information configuration page of Web mode;
Verify the configuration information of the acquisition;
After verification failure, error prompting is returned to Web page, after verifying successfully, is generated according to the configuration information
Configuration file.
Preferably, the method also includes:
When obtaining the configuration information by setting account input, Authority Verification is carried out to the setting account;
After Authority Verification success, configuration file is generated according to the configuration information.
Preferably, after the generation Spark stream calculation program according to the configuration file, further includes: by the generation
Spark stream calculation program submit to big data cluster and handled;Or,
After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark of the generation
Stream calculation program is submitted to big data cluster and is handled, and collects Spark stream calculation program in the big data cluster and runs feelings
Condition, and executive plan is adjusted according to the Spark stream calculation program operating condition, it is adjusted according to executive plan adjusted
Distribution and data processing amount of the Spark stream calculation program in big data cluster.
Through the above it can be found that scheme provided by the embodiment of the present invention, provides a kind of based on Spark streaming
Program generator, the program generator include Spark program initialization module and processing module;Wherein, at the beginning of the Spark program
Beginningization module is matched for obtaining from the Spark information of Web (World Wide Web, i.e. global wide area network or WWW) mode
The configuration information for setting page input generates configuration file according to the configuration information after configuration information verification passes through;The place
Module is managed for the configuration file according to the Spark program initialization module, generates Spark stream calculation program.As it can be seen that this hair
Bright is to provide the Spark information configuration page by way of Web service and can carry out configuration information verification, can be easily
The configuration of various information is carried out, business personnel is preferably facilitated and carries out data processing according to their own needs, business personnel is not
Data processing can only just be can be carried out by the developer calculated in real time again, so as to reduce business development access difficulty, mentioned
High project treatment effeciency;In addition unify not needing different exploitations using the general based on Spark streaming program generator of exploitation
Personnel write different Spark Streaming programs, also do not need developer by modification code or file to provide
Spark information configuration, so as to reduce project development cost.
Further, the present invention can verify the configuration information of acquisition, after verification failure, return to error prompting to Web
The page generates configuration file according to the configuration information after verifying successfully.
Further, the present invention can be when obtaining the configuration information by setting account input, to the setting account
Authority Verification is carried out, after Authority Verification success, configuration file is generated according to the configuration information.
Further, the Spark stream calculation program of generation can be sent to big data cluster by the present invention;Furthermore, it is possible to
It is advanced optimized, for example, collecting Spark stream calculation program operating condition in the big data cluster, and according to described
Spark stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and exists
Distribution and data processing amount in big data cluster.
Detailed description of the invention
Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its
Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label
Typically represent same parts.
Fig. 1 is an a kind of schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention;
Fig. 2 is a kind of another schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention;
Fig. 3 is an a kind of flow diagram of program data processing method of the embodiment of the present invention;
Fig. 4 is a kind of another flow diagram of program data processing method of the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference
Attached drawing, the present invention is described in more detail.
Although showing the preferred embodiment of the disclosure in attached drawing, however, it is to be appreciated that may be realized in various forms
The disclosure is without that should be limited by the embodiments set forth herein.On the contrary, thesing embodiments are provided so that the disclosure more
Add thorough and complete, and the scope of the present disclosure can be completely communicated to those skilled in the art.
The present invention provides one kind based on Spark streaming program generator, can reduce business development access difficulty, reduce exploitation
Cost and raising project treatment effeciency.
Below in conjunction with the technical solution of attached drawing the present invention is described in detail embodiment.
Fig. 1 is an a kind of schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention.
Shown in referring to Fig.1, one kind of the invention is based on Spark streaming program generator:
Including Spark program initialization module 10 and processing module 20;Wherein,
The Spark program initialization module 10, for obtaining from the input of the Spark information configuration page of Web mode
Configuration information generates configuration file according to the configuration information after configuration information verification passes through;
The processing module 20 generates Spark for the configuration file according to the Spark program initialization module 10
Stream calculation program.
Wherein, the Spark information configuration page from the Web mode that the Spark program initialization module obtains is defeated
The configuration information entered includes at least one of the following: process data information, the metadata information of table, Selective type field information, divides
Set type field information, sum-type field information and counting type field information.
Wherein, the configuration information further includes at least one of following: the name of Spark stream calculation program, window time are big
It is small, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing logic
Sql (structured query language, structured query language) information.
From the embodiment it can be found that the embodiment of the invention provides one kind to be based on Spark streaming program generator, the journey
Sequence generator includes Spark program initialization module and processing module;Wherein, the Spark program initialization module, for obtaining
The configuration information inputted from the Spark information configuration page of Web mode is taken, after configuration information verification passes through, is matched according to described
Confidence breath generates configuration file;The processing module is used for the configuration file according to the Spark program initialization module, generates
Spark stream calculation program.As it can be seen that the present invention is to provide the Spark information configuration page by way of Web service and can carry out
Configuration information verification, can easily carry out the configuration of various information, preferably facilitate business personnel according to oneself need
Carry out data processing is asked, business personnel no longer can only just can be carried out data processing by the developer calculated in real time, so as to
To reduce business development access difficulty, project treatment effeciency is improved;In addition unify using the general based on Spark streaming of exploitation
Program generator does not need different developers and writes different Spark Streaming programs, and it is logical not need developer yet
Modification code or file are crossed to provide Spark information configuration, so as to reduce project development cost.
Fig. 2 is a kind of another schematic diagram of structural framing based on Spark streaming program generator of the embodiment of the present invention.
The structural framing of the invention based on Spark streaming program generator is described in more detail relative to Fig. 1 in Fig. 2.
The present invention provides a kind of general (to be referred to as calculating in real time and be generated based on Spark streaming program generator
Device), more different real-time processing logics, different data types can be calculated in real time.Base provided by the invention
In Spark streaming program generator, real-time computation model can simplify, reduce business development and access difficulty, calculate tool in real time
There are the characteristics such as low latency, high-performance, distribution, expansible, fault-tolerant.
The solution of the present invention is provided the Spark information configuration page by way of Web service and can carry out configuration information
Verification, can easily carry out the configuration of various information, preferably facilitate business personnel and count according to their own needs
According to processing.Enterprise is difficult to recruit the data mining personnel of profession at present, and data mining personnel will also learn Hadoop (one
Distributed system infrastructure developed by apache foundation), the various the relevant technologies such as Spark;In addition, also by these
The content of open source is combined together to form a solution, therefore tool acquires a certain degree of difficulty;And business personnel, it can only be by real-time
The developer of calculating just can be carried out data processing.But apply the solution of the present invention, so that it may more convenient generation real-time streams
Calculation procedure can reduce the threshold of knowledge and the threshold of data developer, reduce business development and access difficulty, reduce exploitation
Cost, improve development efficiency.
It is provided by the invention to be based on Spark streaming program generator referring to shown in Fig. 2, at the beginning of specifically including that Spark program
Beginningization module 10 and processing module 20.
Wherein, the Spark program initialization module 10 may further include: obtain module 101, correction verification module 102,
Execution module 103, authentication module 104.
Wherein, the processing module 20 may further include: Spark configuration file obtains module 201, Spark program
Generation module 202, Spark program submit module 203, and the Spark program submits module 203 further and may include
Spark program sending module 2031 and Spark program running optimizatin module 2032.
Wherein, module 101 is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode.
Correction verification module 102, the configuration information obtained for verifying the acquisition module 101.
Execution module 103 is used for after the correction verification module 102 verification failure, return error prompting to Web page,
After the correction verification module 102 verifies successfully, configuration file is generated according to the configuration information.
Authentication module 104, for obtaining the configuration information by setting account input in the acquisition module 101
When, Authority Verification is carried out to the setting account;The execution module 103 104 Authority Verification of authentication module at
After function, configuration file is generated according to the configuration information.
Specifically, the Spark information configuration page from the Web mode that Spark program initialization module 10 obtains is defeated
The configuration information entered may include at least one of following: process data information, the metadata information of table, Selective type field letter
Breath, packet type field information, sum-type field information and counting type field information.
That is, Spark program initialization module 10, can by the Spark information configuration page of Web mode, into
Row for example the configuration of process data information, the configuration of the metadata information of table, the configuration of select (Selective type field) information,
The configuration of groupby (packet type field) information, the configuration of sum (sum-type field) information and count (counting type word
Section) information, where (operation field being filtered) information, multilist Join (table name that selection needs to carry out Join) information and
The configuration of sort (field for needing to be ranked up) etc..In this way, business personnel can be according to the business demand of oneself in Web mode
The Spark information configuration page in be simply easily manually entered very much the configurations of various information.
Wherein, when the configuration information can also include at least one of the following: the name of Spark stream calculation program, window
Between size, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing
Logic sql information.
Spark configuration file obtains module 201, for obtaining the configuration file of the Spark program initialization module 10.
Spark program generating module 202, for obtaining the configuration text that module 201 obtains according to the Spark configuration file
Part generates Spark stream calculation program.Spark program generating module 202 can be in the life in the Spark information configuration page
After being operated at key, Spark stream calculation program is generated.
That is, the present invention can generate Spark stream calculation program with a key.For example, by pressing Web mode
Generation key in the Spark information configuration page triggers background program, and Spark program generating module 202 detects generation key
After operating, Spark stream calculation program is generated, a key also can be thus achieved and generate Spark stream calculation program, be more convenient business people
Member uses, and substantially reduces business development access difficulty.
Spark program submits module 203, the Spark stream calculation for generating the Spark program generating module 202
Program is submitted to big data cluster and is handled.Wherein, big data cluster refers to yarn (yarn) resource management system.
Wherein, Spark program submits the Spark program sending module 2031 in module 203, is used for the Spark journey
The Spark stream calculation program that sequence generation module 202 generates is sent to big data cluster.
Spark program submits the Spark program running optimizatin module 2032 in module 203, for collecting the big data
Spark stream calculation program operating condition in cluster, and executive plan is adjusted according to the Spark stream calculation program operating condition,
Distribution and data processing amount of the Spark stream calculation program in big data cluster are adjusted according to executive plan adjusted.
It is above-mentioned describe in detail the embodiment of the present invention based on Spark streaming program generator, accordingly introduce this hair below
It is bright to utilize the program data processing method based on Spark streaming program generator.
Fig. 3 is an a kind of flow diagram of program data processing method of the embodiment of the present invention.
Referring to shown in Fig. 3, the method for the present invention includes:
In step 301, by the Spark program initialization module of program generator, the Spark from Web mode is obtained
The configuration information of information configuration page input generates configuration file according to the configuration information after configuration information verification passes through.
May include: in the step
Obtain the configuration information inputted from the Spark information configuration page of Web mode;
Verify the configuration information of the acquisition;
After verification failure, error prompting is returned to Web page, after verifying successfully, is generated according to the configuration information
Configuration file.
It should be noted that the present invention can also be when obtaining the configuration information by setting account input, to the setting
Account carries out Authority Verification;After Authority Verification success, configuration file is generated according to the configuration information.
In the step, the Spark information configuration page from the Web mode of the Spark program initialization module acquisition
The configuration information of face input includes at least one of the following: process data information, the metadata information of table, Selective type field letter
Breath, packet type field information, sum-type field information and counting type field information.
The configuration information can also include at least one of the following: that the name of Spark stream calculation program, window time are big
It is small, whether need window, whether need sliding window, the time of sliding window, data source theme, target topic, processing logic
Sql information.
The present invention provides the Spark information configuration page by way of Web service, can easily carry out various information
Configuration, preferably facilitate business personnel and carry out data processing according to their own needs.
In step 302, by the processing module of program generator, match from Spark program initialization module acquisition
File is set, Spark stream calculation program is generated according to the configuration file.
In the step, module can be obtained by the Spark configuration file in the processing module, obtain the Spark
The configuration file of program initialization module;By the Spark program generating module in the processing module, according to the Spark
Configuration file obtains the configuration file that module obtains, and generates Spark stream calculation program.
It should be noted that the present invention it is described Spark stream calculation program is generated according to the configuration file after, can be with
It include: that the Spark stream calculation program of the generation is submitted to big data cluster to handle;Or, matching according to described
It can also include: to submit to the Spark stream calculation program of the generation greatly after setting file generated Spark stream calculation program
Data cluster is handled, and collects Spark stream calculation program operating condition in the big data cluster, and according to the Spark
Stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and is counting greatly
According in cluster distribution and data processing amount.
From the embodiment it can be found that the present invention is to provide the Spark information configuration page by way of Web service and can
To carry out configuration information verification, the configuration of various information can be easily carried out, preferably facilitates business personnel according to certainly
Oneself demand carries out data processing, and business personnel no longer can only just can be carried out data processing by the developer calculated in real time,
So as to reduce business development access difficulty, project treatment effeciency is improved;In addition unify using exploitation it is general based on
Spark streaming program generator does not need different developers and writes different Spark Streaming programs, do not need yet
Developer provides Spark information configuration by modification code or file, so as to reduce project development cost.
Fig. 4 is a kind of another flow diagram of program data processing method of the embodiment of the present invention.Fig. 4 is relative to Fig. 3
Program data processing method of the invention is described in more detail.
Referring to shown in Fig. 4, the method for the present invention includes:
In step 401, the configuration information inputted from the Spark information configuration page of Web mode is obtained, in configuration information
After verification and setting account Authority Verification pass through, configuration file is generated according to the configuration information.
The step can be obtained and be believed from the Spark of Web mode by the acquisition module in Spark program initialization module
The configuration information of breath configuration page input;By the correction verification module in Spark program initialization module, for verifying the acquisition
The configuration information that module obtains;By the authentication module in Spark program initialization module, permission is carried out to setting account
Verifying;By the execution module in Spark program initialization module, the correction verification module verify successfully with the authentication
After the success of module Authority Verification, configuration file is generated according to the configuration information;Fail in addition, being verified in the correction verification module
Afterwards, error prompting can be returned to Web page.
In the step, configuration information, example can be inputted in the Spark information configuration page of Web mode by setting account
Such as input the configuration of process data information, the configuration of the metadata information of table, the configuration of select (Selective type field) information,
The configuration of groupby (packet type field) information, the configuration of sum (sum-type field) information, count (counting type word
Section) information, where (operation field being filtered) information, multilist Join (table name that selection needs to carry out Join) information and
The configuration of sort (field for needing to be ranked up) etc..
Furthermore it is also possible to input name (i.e. the name of Spark stream calculation program), the window time that Spark is calculated in real time
Whether size needs window, whether needs sliding window, the time of sliding window, data source topic (theme), target
The configuration informations such as topic, processing logic sql information.Wherein, data source topic can be the topic of consumption Kafka, target
Topic can be the topic of production Kafka.Kafka is put down by an open source stream process of Apache Software Foundation exploitation
Platform.Consumption is an opposite movement with production, and consumption refers to obtaining data inside the topic of Kafka, and production refers to
Treated, data are put into the topic of Kafka.
In the step, configuration information verification can be carried out to the configuration information of the above-mentioned various inputs of acquisition.
For example, above-mentioned configuration information can using Json (JavaScript Object Notation, JS object numbered musical notation,
A kind of data interchange format of lightweight) storage, and by Gson (Google provide be used in Java object and Json data
Between the java class library that is mapped, a Json character can be changed into a Java object, or a Java is converted
For Json character string) it is parsed and is verified.
The process that the present invention carries out configuration information verification can include but is not limited to following manner:
The corresponding value of attribute-name for reading Spark program, judges Spark journey by database (such as mysql database)
The attribute-name of sequence whether there is, and if so, being judged as illegal (verify and do not pass through), will return to error prompting to Web
The page, namely user's input error is reminded back to Web page;If there is no being then judged as legal (i.e. verification pass through).
In the case where the attribute-name of Spark program is not present and is judged as legal, data source is further judged
Field name inside the configuration informations such as topic, target topic, select, groupby, sum, count, where and sort is
No presence, and if so, be judged as legal, if there is no being then judged as illegal.
In the case that field name inside above-mentioned configuration information exists and is judged as legal, sliding window attribute is read
Value and window time attribute value, judge whether sliding time is less than window time and whether the sliding time time is greater than given threshold
Such as 1 hour;If all met, it is judged as that final verification passes through, and data is stored in inside database, while can be given birth to
It is checked at the information configuration file of excel format for business personnel.
It should be noted that the present invention can also be when obtaining the configuration information by setting account input, to the setting
Account carries out Authority Verification.For example, being provided with different login accounts to different personnel, and different accounts are arranged as needed
Therefore different operating rights when inputting configuration information by setting login account, can inquire database to the setting
Account carries out Authority Verification, if data base querying, which arrives, has corresponding account, and operator closing operation permission, then it is assumed that verifying is led to
It crosses.To sum up, Spark program initialization module obtains the configuration information of above-mentioned various inputs, carries out initialization process.It has initialized
At configuration file is generated later, configuration file is saved on linux system.The configuration file wherein generated can be text lattice
The format of the configuration file of formula, configuration file the inside content can be as follows but not limited to this: configuration name=configuration content, with funny
Number separate.
In step 402, configuration file is obtained, the processing logic sql of configuration information in configuration file is carried out at judgement
Reason.
The present invention can obtain module by the Spark configuration file in processing module, and it is initial to obtain the Spark program
Change the configuration file of module, and can be by database come the field information of the data source topic in query configuration information;Into one
Step judges whether the field in the processing logic sql information in the field information of data source topic is correct.It should be noted that place
Field information in reason logic sql needs to be present in the field information of data source topic, but the field letter of data source topic
Breath not necessarily exists in the field information in processing logic sql.
If the field information of lane database storage is not identical as the name of the field in processing logic sql information, just sentence
Break incorrect for the field in processing logic sql information, will remind user that input is wrong;If the field of lane database storage
Information is identical as the name of field in processing logic sql information to be judged as correctly, and correctly just further judgement processing is patrolled
Whether the grammer for collecting sql is wrong.
About processing logic sql grammer whether Cuo Wu judgement includes: the field occurred in select clause or category
Property, if not in aggregate function, then needing to be put into inside groupby clause, if not being put into inside groupby clause,
Then think syntax error;In turn, the field or attribute in groupby clause are not appeared in, can be only present in aggregate function
In, if not being present in aggregate function, then it is assumed that syntax error.
If processing logic sql there is no syntax error, think at this time handle logic sql there is no problem (i.e. processing logic
Sql does not have field errors and syntax error problem simultaneously), further judge whether window time and sliding time are more than threshold
Value.Wherein, the sliding time in configuration information is generally require less than window time, and the threshold value of sliding time and window time can be with
For 1 hour but not limited to this.
Finally, according to the setting rule such as sql grammer judging result, the size of the table of Join and data volume per second to sql
It optimizes, the sql sentence for ultimately generating an optimization is deposited into database.
In step 403, the execution configuration file Shell script of Spark stream calculation program is generated.
In the step, holding for Spark stream calculation program can be generated by the Spark program generating module in processing module
Row configuration file Shell script (shell).
In the step, can in reading database corresponding Spark stream calculation program name, window time size, be
It is no to need window, whether need information and the generations such as sliding window, the time of sliding window, data source topic, target topic
Sql sentence, and driver-memory (driving memory), executor-memory are calculated according to the size of table and data volume
The sizes such as (executing memory), executor-cores (CPU executes nucleus number) and memoryOverhead (out-pile memory), and obtain
In addition the parameter settings such as the parameter of some fixations such as time-out time, queue name also obtain degree of parallelism, memory accounting, JVM
Parameters such as (Java Virtual Machine, Java Virtual Machines).Then, corresponding generation Spark stream calculation program is generated
Execution configuration file Shell script (shell).Shell script refers to the shell of linux system, usually generation resource distribution text
Shell script is regenerated after part.
The present invention can pass through number when judging that window time in configuration information and sliding time are not above threshold value
Whether repeated according to the name of the current Spark stream calculation program of library inquiry.
If the name of current Spark stream calculation program does not repeat, current Spark stream can be found out by following formula
Driver-memory (driving memory), the executor-memory (executing memory), executor- of calculation procedure needs
Cores (executing nucleus number), memoryOverhead (out-pile memory).
The driver-memory (driving memory) of the current Spark stream calculation program needs of present invention calculating generation,
Executor-memory (execute memory), executor-cores (executing nucleus number), memoryOverhead (out-pile memory)
Formula can be such that
Executor-cores required for Spark stream calculation program (executing nucleus number)=(data source data strip per second
Ten thousand) the CPU core number required for * 20,000 data per second of number/2
Driver-memory required for Spark stream calculation program (driving memory)=(data source data strip per second
Ten thousand) number/2 drives memory required for * 20,000 data per second
Executor-memory required for Spark stream calculation program (executing memory)=(data source data strip per second
Ten thousand) number/2 executes memory required for * 20,000 data per second
MemoryOverhead required for Spark stream calculation program (out-pile memory)=(data source data strip per second
Ten thousand) the out-pile memory required for * 20,000 data per second of number/2
Wherein, the data source can refer to consumption Kafka.
In step 404, the execution configuration file Shell script of Spark stream calculation program is executed, Spark flowmeter is generated
Calculate program.
In the step, holding for Spark stream calculation program can be executed by the Spark program generating module in processing module
Row configuration file Shell script generates Spark stream calculation program.
The step can remotely execute Spark by SSH2 (Secure Shell 2,2.0 version of Secure Shell) agreement
The execution configuration file Shell script of stream calculation program, parameter is input in Spark composer, and Spark composer is by connecing
Parameter is received, is separated by comma, parameter is passed to corresponding function, generate corresponding Spark stream calculation program.
In step 405, Spark stream calculation program big data cluster is submitted to handle.
The step can submit the Spark program sending module in module by Spark program, by Spark stream calculation journey
Sequence is submitted to big data cluster and is handled.
In the step, Spark stream calculation program can be submitted in big data cluster by Spark program sending module, greatly
Data cluster refers to yarn (yarn) resource management system.
In a step 406, Spark stream calculation program operating condition in the big data cluster is collected, and according to described
Spark stream calculation program operating condition adjusts executive plan, adjusts Spark stream calculation program according to executive plan adjusted and exists
Distribution and data processing amount in big data cluster.
In the step, the Spark program running optimizatin module in module can be submitted by Spark program, described in collection
Spark stream calculation program operating condition in big data cluster, for example, according to reduce (reduction) task handle data volume,
Map (is exported and is transmitted to reducer as input) data volume that pulls, the data distribution in each execution stage and greatly by shuffle
Situations such as data volume of small, each subregion, adjusts executive plan, adjusts Spark stream calculation journey according to executive plan adjusted
The distribution and data processing amount of sequence in big data cluster, such as processing data skew, the adjustment number of partitions etc., after then handling
Data write back in target topic.
In conclusion the solution of the present invention has the advantages that
Scheme provided by the invention breaches traditional company and needs to recruit the developer's progress Spark calculated in real time
Real-time calculation procedure exploitation and business personnel can only just can be carried out the quagmire of data processing by the developer calculated in real time.
The present invention provides Spark and calculates generator (based on Spark streaming program generator) in real time, preferably facilitates business personnel's root
Data processing is carried out according to the demand of oneself, while the exploitation for decreasing calculating developer in real time is lack of standardization, resource submission accounts for
With the various problems such as too many and unreasonable, the development efficiency of project is also greatly improved simultaneously, saves project duration.
Above it is described in detail according to the technique and scheme of the present invention by reference to attached drawing.
In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention
Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention
Machine program code instruction.
Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium
Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code),
When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server
Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.
Those skilled in the art will also understand is that, the various example logic data in conjunction with described in disclosure herein
Block, mould data block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.
The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities
Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent a modulus evidence
A part of block, program segment or code, a part of the mould data block, program segment or code include one or more for real
The executable instruction of logic function as defined in existing.It should also be noted that in some implementations as replacements, being marked in box
Function can also be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually substantially simultaneously
It executes capablely, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that frame
The combination of figure and/or each box in flow chart and the box in block diagram and or flow chart, can be as defined in executing
The dedicated hardware based systems of functions or operations is realized, or can be come using a combination of dedicated hardware and computer instructions
It realizes.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art
Other those of ordinary skill can understand each embodiment disclosed herein.
Claims (10)
1. one kind is based on Spark streaming program generator, it is characterised in that:
Including Spark program initialization module and processing module;Wherein,
The Spark program initialization module, for obtaining from the input of the Spark information configuration page of Web mode with confidence
Breath generates configuration file according to the configuration information after configuration information verification passes through;
The processing module generates Spark stream calculation journey for the configuration file according to the Spark program initialization module
Sequence.
2. according to claim 1 be based on Spark streaming program generator, which is characterized in that the Spark program is initial
Changing module includes:
Module is obtained, for obtaining the configuration information inputted from the Spark information configuration page of Web mode;
Correction verification module, the configuration information obtained for verifying the acquisition module;
Execution module, for returning to error prompting to Web page, in the calibration mode after correction verification module verification failure
After block check success, configuration file is generated according to the configuration information.
3. according to claim 2 be based on Spark streaming program generator, which is characterized in that the Spark program is initial
Change module further include:
Authentication module is set when for obtaining the configuration information by setting account input in the acquisition module to described
Determine account and carries out Authority Verification;
The execution module generates configuration text after authentication module Authority Verification success, according to the configuration information
Part.
4. according to claim 1 be based on Spark streaming program generator, which is characterized in that the processing module includes:
Spark configuration file obtains module, for obtaining the configuration file of the Spark program initialization module;
Spark program generating module is generated for obtaining the configuration file that module obtains according to the Spark configuration file
Spark stream calculation program;
Spark program submits module, and the Spark stream calculation program for generating the Spark program generating module is submitted to
Big data cluster is handled.
5. according to claim 4 be based on Spark streaming program generator, which is characterized in that the Spark program is submitted
Module includes:
Spark program sending module, the Spark stream calculation program for generating the Spark program generating module are sent to
Big data cluster;
Spark program running optimizatin module, for collecting Spark stream calculation program operating condition in the big data cluster, and
Executive plan is adjusted according to the Spark stream calculation program operating condition, Spark flowmeter is adjusted according to executive plan adjusted
Calculate distribution and data processing amount of the program in big data cluster.
6. according to any one of claims 1 to 5 be based on Spark streaming program generator, it is characterised in that:
What the Spark program initialization module obtained matches confidence from the input of the Spark information configuration page of the Web mode
Breath includes at least one of the following:
Process data information, the metadata information of table, Selective type field information, packet type field information, sum-type word
Segment information and counting type field information.
7. a kind of program data processing method characterized by comprising
By the Spark program initialization module of program generator, obtains and inputted from the Spark information configuration page of Web mode
Configuration information, configuration information verification pass through after, according to the configuration information generate configuration file;
By the processing module of program generator, configuration file is obtained from the Spark program initialization module, is matched according to described
Set file generated Spark stream calculation program.
8. the method according to the description of claim 7 is characterized in that the Spark information configuration page obtained from Web mode
The configuration information of input generates configuration file according to the configuration information after configuration information verification passes through, comprising:
Obtain the configuration information inputted from the Spark information configuration page of Web mode;
Verify the configuration information of the acquisition;
After verification failure, error prompting is returned to Web page, after verifying successfully, is generated and is configured according to the configuration information
File.
9. according to the method described in claim 8, it is characterized in that, the method also includes:
When obtaining the configuration information by setting account input, Authority Verification is carried out to the setting account;
After Authority Verification success, configuration file is generated according to the configuration information.
10. method according to any one of claims 7 to 9, it is characterised in that:
After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark flowmeter of the generation
Calculation program is submitted to big data cluster and is handled;Or,
After the generation Spark stream calculation program according to the configuration file, further includes: by the Spark flowmeter of the generation
Calculation program is submitted to big data cluster and is handled, and Spark stream calculation program operating condition in the big data cluster is collected, and
Executive plan is adjusted according to the Spark stream calculation program operating condition, Spark flowmeter is adjusted according to executive plan adjusted
Calculate distribution and data processing amount of the program in big data cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910186601.9A CN110008242A (en) | 2019-03-12 | 2019-03-12 | One kind being based on Spark streaming program generator and program data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910186601.9A CN110008242A (en) | 2019-03-12 | 2019-03-12 | One kind being based on Spark streaming program generator and program data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110008242A true CN110008242A (en) | 2019-07-12 |
Family
ID=67166866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910186601.9A Pending CN110008242A (en) | 2019-03-12 | 2019-03-12 | One kind being based on Spark streaming program generator and program data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110008242A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111625269A (en) * | 2020-05-14 | 2020-09-04 | 中电工业互联网有限公司 | Web-based universal Spark task submission system and method |
CN112612514A (en) * | 2020-12-31 | 2021-04-06 | 青岛海尔科技有限公司 | Program development method and device, storage medium and electronic device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407472A (en) * | 2016-11-01 | 2017-02-15 | 广西电网有限责任公司电力科学研究院 | Visual editing and management system for big data analysis and calculation task of order model |
CN106777101A (en) * | 2016-12-14 | 2017-05-31 | 深圳天源迪科信息技术股份有限公司 | Data processing engine |
CN108037919A (en) * | 2017-12-01 | 2018-05-15 | 北京博宇通达科技有限公司 | A kind of visualization big data workflow configuration method and system based on WEB |
-
2019
- 2019-03-12 CN CN201910186601.9A patent/CN110008242A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407472A (en) * | 2016-11-01 | 2017-02-15 | 广西电网有限责任公司电力科学研究院 | Visual editing and management system for big data analysis and calculation task of order model |
CN106777101A (en) * | 2016-12-14 | 2017-05-31 | 深圳天源迪科信息技术股份有限公司 | Data processing engine |
CN108037919A (en) * | 2017-12-01 | 2018-05-15 | 北京博宇通达科技有限公司 | A kind of visualization big data workflow configuration method and system based on WEB |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111625269A (en) * | 2020-05-14 | 2020-09-04 | 中电工业互联网有限公司 | Web-based universal Spark task submission system and method |
CN112612514A (en) * | 2020-12-31 | 2021-04-06 | 青岛海尔科技有限公司 | Program development method and device, storage medium and electronic device |
CN112612514B (en) * | 2020-12-31 | 2023-11-28 | 青岛海尔科技有限公司 | Program development method and device, storage medium and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10936479B2 (en) | Pluggable fault detection tests for data pipelines | |
He et al. | X-SQL: reinforce schema representation with context | |
Qian et al. | Timestream: Reliable stream computation in the cloud | |
Jankowski et al. | Storm Applied: Strategies for real-time event processing | |
CN106897322A (en) | The access method and device of a kind of database and file system | |
US20100293535A1 (en) | Profile-Driven Data Stream Processing | |
CN109359026A (en) | Log reporting method, device, electronic equipment and computer readable storage medium | |
US20230018975A1 (en) | Monolith database to distributed database transformation | |
CN104536987B (en) | A kind of method and device for inquiring about data | |
CN117033460B (en) | Automatic data model construction system and method based on bus matrix | |
CN110162297A (en) | A kind of source code fragment natural language description automatic generation method and system | |
CN110008242A (en) | One kind being based on Spark streaming program generator and program data processing method | |
CN116601644A (en) | Providing interpretable machine learning model results using distributed ledgers | |
WO2023040145A1 (en) | Artificial intelligence-based text classification method and apparatus, electronic device, and medium | |
Clark et al. | Event driven architecture modelling and simulation | |
CN113378007B (en) | Data backtracking method and device, computer readable storage medium and electronic device | |
WO2017097125A1 (en) | Executive code generation method and device | |
Murakami et al. | Predicting next changes at the fine-grained level | |
EP3907602A1 (en) | Trustworthy application integration | |
Cappellari et al. | Optimizing data stream processing for large‐scale applications | |
US11810022B2 (en) | Contact center call volume prediction | |
CN110309062A (en) | Case generation method, device, electronic equipment and storage medium | |
Wang | Clustering in the Cloud: Clustring Algorithms to Hadoop Map/Reduce Framework | |
Rao | Effficient Graph-based Computation and Analytics | |
Li | Performance management of event processing systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230420 Address after: Room 101, No. 227 Gaotang Road, Tianhe District, Guangzhou City, Guangdong Province, 510000 (location: Room 601) (office only) Applicant after: Yamei Zhilian Data Technology Co.,Ltd. Address before: 510000 self compiled h, Room 201, No. 1, Hanjing Road, Tianhe District, Guangzhou, Guangdong Province Applicant before: GUANGZHOU YAME INFORMATION TECHNOLOGY Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190712 |
|
RJ01 | Rejection of invention patent application after publication |