CN109086038B

CN109086038B - Spark-based big data development method and device, and terminal

Info

Publication number: CN109086038B
Application number: CN201810755408.8A
Authority: CN
Inventors: 刘霄峰
Original assignee: Qianxun Spatial Intelligence Inc
Current assignee: Qianxun Spatial Intelligence Inc
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2022-05-31
Anticipated expiration: 2038-07-10
Also published as: CN109086038A

Abstract

The invention is suitable for the technical field of big data development, and provides a big data development method, a device and a terminal based on Spark, wherein the big data development method comprises the following steps: an integrated development environment is installed, and template engineering is conveniently introduced; downloading a recent template project, compiling and packaging at the same time, and generating a software development kit; adding the software development kit to the integrated development environment to form a development template; and newly building a big data development project, and developing the big data by applying the development template. In the invention, the development mode based on the template not only provides the encapsulation class and the encapsulation method, but also provides the directly operable development template, thereby improving the development efficiency, reducing the door threshold and accelerating the development progress in the simplest and most effective mode.

Description

Spark-based big data development method and device, and terminal

Technical Field

The invention belongs to the technical field of big data development, and particularly relates to a big data development method and device based on Spark, and a terminal.

Background

In recent years, more and more tool development kits bring great convenience to the development task, namely, a technician packages some dependence and practical methods by self-contained packaging means and then uses the methods by others by reference. The method is the most common technology and function sharing mode at present, but the mode has certain disadvantages that the mode is not friendly to beginners, the development and packaging of Spark and the like are not thorough, and the mode cannot be used by many people quickly.

The existing tool development kit only provides an encapsulation method or a parent class, and is used in an inheritance and reference mode, so that a user can well use an internal method only by reading the internal method to a certain extent, and can really start task development only by correspondingly knowing Spark development through other information sources. Therefore, the problems of slow operation, difficult development and the like are caused, and extra development cost is increased.

Disclosure of Invention

The embodiment of the invention provides a Spark-based big data development method, a Spark-based big data development device and a Spark-based big data development terminal, and aims to solve the problems that the development mode in the prior art is not completely encapsulated and cannot be used quickly.

A Spark-based big data development method comprises the following steps:

an integrated development environment is installed, and template engineering is conveniently introduced;

downloading a recent template project, compiling and packaging at the same time, and generating a software development kit;

adding the software development kit to the integrated development environment to form a development template;

and newly building a big data development project, and developing the big data by applying the development template.

Preferably, after the installing the integrated development environment, the method further includes: and installing the Maven plug-in of the Maven warehouse and the IDE.

Preferably, the development template includes at least one of a general template, a data cleansing template, and a Spark operator template.

Preferably, the development template contains the reading and structuring of input parameters, the input and output of data, and the selection of intermediate cleaning methods.

Preferably, the step of performing big data development by applying the development template in the newly-built big data development project includes:

performing corresponding modification according to the codes of the development template to complete big data development, or

And continuously expanding the development template, simplifying the development process and sharing a code architecture.

Preferably, the development template is a code with detailed comments and capable of running quickly, and the step of applying the development template to develop big data includes:

selecting a required data source writing method according to the annotation, selecting a reasonable RDD operator, and selecting a required data source input method;

the code is modified or pruned as needed.

The invention also provides a Spark-based big data development device, which is characterized by comprising:

the installation unit is used for installing an integrated development environment and is convenient for introducing template engineering;

the compiling unit is used for downloading a recent template project, compiling and packaging at the same time and generating a software development kit;

the adding unit is used for adding the software development toolkit into the integrated development environment to form a development template;

and the development unit is used for newly building a big data development project and applying the development template to develop the big data.

Preferably, the mounting unit further comprises: and installing the Maven plug-in of the Maven warehouse and the IDE.

The invention also provides a memory storing a computer program executed by a processor to perform the steps of:

adding the software development kit into the integrated development environment to form a development template;

The invention also provides a terminal, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the following steps:

In the embodiment of the invention, the development mode based on the template not only provides the class and the method of the encapsulation, but also provides the development template which can be directly operated, thereby improving the development efficiency, reducing the door threshold and accelerating the development progress in the simplest and most effective mode.

Drawings

Fig. 1 is a flowchart of a Spark-based big data development method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a preferred mode of a Spark-based big data development method according to a first embodiment of the present invention;

fig. 3 is a structural diagram of a Spark-based big data development device according to a second embodiment of the present invention;

fig. 4 is a structural diagram of a terminal according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In an embodiment of the present invention, a Spark-based big data development method includes: an integrated development environment is installed, and template engineering is conveniently introduced; downloading a recent template project, compiling and packaging at the same time, and generating a software development kit; adding the software development kit to the integrated development environment to form a development template; and newly building a big data development project, and developing the big data by applying the development template.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

The first embodiment is as follows:

fig. 1 is a flowchart illustrating a big data development method based on Spark according to a first embodiment of the present invention, where the method includes:

step S1, installing an Integrated Development Environment (IDE) to facilitate introduction of template engineering;

an IDE typically includes a code editor, compiler, debugger, and graphical user interface tools. The IDE in the embodiment of the invention can be development environments such as IDEA, Eclipse and the like. In step S1, after the integrated development environment is installed, the Maven plugins of the Maven repository and IDE need to be installed so that the development tool can introduce the template engineering.

Step S2, downloading the recent template project, compiling and packaging at the same time, and generating a Software Development Kit (SDK);

the SDK in the embodiments of the present invention supports multiple versions. The SDK encapsulates various versions of Spark-dependent and general methods, such as access to various databases, preprocessing of some data, and the like. Taking developing a simple Spark task as an example, under a common condition, developing the Spark task requires storing related knowledge of Spark, finding dependence, building a development environment, familiarizing an interface, then performing customized development according to a Spark programming specification, and knowing an existing method of Spark based on RDD.

Step S3, adding the software development kit into the integrated development environment to form a development template;

specifically, the development template includes at least one of a general template, a data cleansing template, and a Spark operator template. The development template needs to be dependent on the existing development tools and related plug-ins, needs to be used with dependent management plug-ins, and needs to use template writing functionality to the existing development tools.

All the template related dependencies are integrated into the existing SDK, when the SDK is imported, all the dependencies of the template class are introduced, and the dependency configuration of the functions can be completed in one key. The related dependence of the development template is provided by an SDK mode, the development template is not a compiled tool kit, but has detailed comments and codes capable of running quickly, and the development template has three types in total, namely a general template, a data cleaning template and a Spark operator template. The general template inherits a reading code of a data source, a simple processing code of data and a data storage code, the data cleaning template adds various ETL processes such as filtering, de-duplication, merging and the like on the basis of the general template, and the Spark operator template adds some use examples of Spark complex operators such as Aggregate and the like on the basis of the data cleaning template.

Step S4, creating a big data development project, and developing big data by applying the development template;

specifically, a big data development project is newly built, and after the SDK is introduced, a development template is created, so that all functions in the SDK can be used. The template can be developed secondarily and contributed by codes, and the development of Spark tasks can be completed with minimum time cost. The dependency introduction is completed after the SDK is introduced, the whole development template deployment is completed after the template class is introduced, the configuration of the task development project of Spark is completed only by the two steps, and the method is very friendly to new users and can be directly compiled and run.

In this embodiment, the template-based development method not only provides the class and method of encapsulation, but also provides a directly operable development template, which improves the development efficiency, reduces the threshold for entry, and accelerates the development progress in the simplest and most effective manner.

The whole big data is developed as the following table 1:

firstly, installing an IDE (integrated development environment), and installing a Maven plug-in of a Maven warehouse and the IDE to lead in a template project; then, a management tool is relied on to download a recent template project, and compiling and packaging are carried out simultaneously to generate an SDK; the SDK is added into the IDE, a Spark task project is newly built, a development template is created after the SDK is introduced, the development can be started, all functions in the SDK can be used at the moment, meanwhile, the development templates in various forms provide executable and diversified program samples, the development can be directly and correspondingly changed according to template codes, the development task is completed, the development template can be continuously expanded, the development flow is simplified, and a code framework is shared.

The data source adaptation can comprise writing and outputting of data sources such as MongoDB, HDFS, Hive, Hbase and MySQL, and the universal method comprises various time specifications such as a day starting timestamp, a week starting timestamp and a five-minute starting timestamp; regular judgment of character strings, judgment of NULL values, dynamic switching of data sources and the like are also carried out; meanwhile, configuration management of some dynamic parameters, local configuration association, HDFS configuration association, KV library configuration association and the like are added. All three templates contain the reading and regularization of input parameters, the input and output of data, and the selection of intermediate washing methods.

The invention provides a large data development mode based on a template, not only provides a packaging class and a method, but also provides a directly operable development template, and the development template is respectively customized according to different scenes, so that a user can directly operate only by introducing the template, and simultaneously, the available method is directly modified according to the template sample, and the adjustment of parameters is completed according to the comments. The development template is directly available for users and is provided with a known code which is read in detail, and when a Spark task is created each time, the development template can be created directly through the template, and meanwhile, improvement and addition can be carried out according to the existing template, and the creation of the own template is completed.

The method provided by the embodiment of the invention can further improve the development efficiency of developers, reduce the threshold of entry, build a Spark big data development environment in a one-stop manner, provide an easy-to-use method and increase the support for various data sources.

In a preferred embodiment of this embodiment (see fig. 2), the step of developing big data by applying the development template includes:

step S5, selecting a needed data source writing method according to the annotation, selecting a reasonable RDD operator, and selecting a needed data source input method;

taking developing a simple Spark task as an example, under a common condition, developing the Spark task requires storing related knowledge of Spark, finding dependence, building a development environment, familiarizing an interface, then performing customized development according to a Spark programming specification, and knowing an existing method of Spark based on RDD. After the development method is provided, the debugging of the Spark task can be carried out only by establishing the MAVEN project belonging to the development method, downloading the SDK introduction and introducing the SDK introduction together with the template. And selecting template classes from the three categories to create, selecting a required data source writing method according to the annotations, selecting a reasonable RDD operator, and selecting a required data source input method.

Step S6, modifying or deleting the code as needed.

The developer does not need to care about details, all data source operations and RDD operator operations are presented in the template class in a code mode, and the developer can complete development only by modifying or deleting the code according to needs. The universal template can be selected for development by an experienced developer, and the cost of code specification and data input and output programming is saved. The user can customize own development template at the same time, can establish a new template according to the templates of the three types, and only needs to share the template when multiple persons develop in a collaborative manner.

The function of the entire SDK is shown in table 2 below:

the method is only a simple example, in real life, the difficulty of entering the door is faced by people, the repeated labor is always a great problem troubling the development, and the development mode based on the template just solves the problem.

The big data development method of the embodiment of the invention is based on the existing IDE, only SDK is directly introduced, the use is convenient, the template class of the invention can be directly operated and presented in a code mode, the modification is easy, the expansion is convenient, and the template development class of the invention is established; the invention improves the development efficiency, reduces the entrance threshold, can adapt to various modes such as individual combat and multi-person cooperation, and accelerates the development progress in the simplest and most effective mode. Therefore, the big data development method and the big data development platform provided by the invention have very wide application prospects in various fields such as big data development and the like. It should be noted that, in the implementation process of the present invention, support of the existing development tools is required, data sources included in the present invention are wide, and include MongoDB, HDFS, Hive, Hbase, Mysql, Kafka, and supported operators include all RDD operators on Spark official networks, including method types and use examples.

Example two:

as shown in fig. 3, a structure diagram of a big data developing device based on Spark according to a second embodiment of the present invention includes: installation unit 1, compiling unit 2 connected with installation unit 1, adding unit 3 connected with compiling unit 2, developing unit 4 connected with adding unit 3, wherein:

the installation unit 1 is used for installing an integrated development environment and is convenient for introducing template engineering;

an IDE typically includes a code editor, compiler, debugger, and graphical user interface tools. The IDE in the embodiment of the present invention may be a development environment such as IDEA, Eclipse, or the like. In step S1, after the integrated development environment is installed, the Maven plugins of the Maven repository and IDE need to be installed so that the development tool can introduce the template engineering.

The compiling unit 2 is used for downloading a recent template project, compiling and packaging at the same time, and generating a software development kit;

The adding unit 3 is used for adding the software development toolkit into the integrated development environment to form a development template;

All the template related dependencies are integrated into the existing SDK, when the SDK is imported, all the dependencies of the template class are introduced, and the dependency configuration of the functions can be completed in one key. The related dependence of the development template is provided by an SDK mode, the development template is not a compiled tool kit but is provided with detailed comments and codes capable of running quickly, and the development template comprises three types in total, namely a universal template, a data cleaning template and a Spark operator template. The general template inherits a reading code of a data source, a simple processing code of data and a data storage code, the data cleaning template adds various ETL processes such as filtering, de-duplication, merging and the like on the basis of the general template, and the Spark operator template adds some use examples of Spark complex operators such as Aggregate and the like on the basis of the data cleaning template.

The development unit 4 is used for newly building a big data development project and applying the development template to develop big data;

specifically, a big data development project is newly built, and after the SDK is introduced, a development template is created, so that all functions in the SDK can be used. The template can be developed secondarily and contributed by codes, and the development of Spark tasks can be completed with minimum time cost.

The SDK includes data source adaptation, a general method, a configuration method and templates. The data source adaptation can comprise writing and outputting of data sources such as MongoDB, HDFS, Hive, Hbase and MySQL, and the universal method comprises various time specifications such as a day starting timestamp, a week starting timestamp and a five-minute starting timestamp; regular judgment of character strings, judgment of NULL values, dynamic switching of data sources and the like are also carried out; meanwhile, configuration management of some dynamic parameters, local configuration association, HDFS configuration association, KV library configuration association and the like are added. All three templates contain the reading and regularization of input parameters, the input and output of data, and the selection of intermediate washing methods.

The dependency introduction is completed after the SDK is introduced, the whole development template deployment is completed after the template class is introduced, the configuration of the task development project of Spark is completed only by the two steps, and the method is very friendly to new users and can be directly compiled and run.

In a preferred embodiment of the present embodiment, the development unit 4 is further configured to:

the code is modified or pruned as needed.

Taking developing a simple Spark task as an example, under a common condition, developing the Spark task requires storing related knowledge of Spark, finding dependence, building a development environment, familiarizing an interface, then performing customized development according to a Spark programming specification, and knowing an existing method of Spark based on RDD. After the development method is provided, debugging of Spark tasks can be performed only by establishing MAVEN engineering belonging to the development method, downloading SDK introduction and introducing templates together. And selecting template classes from the three categories to create, selecting a required data source writing method according to the annotation, selecting a reasonable RDD operator, and selecting a required data source input method.

The big data development method of the embodiment of the invention is based on the existing IDE, only SDK is directly introduced, the use is convenient, the template class of the invention can be directly operated and presented in a code mode, the modification is easy, the expansion is convenient, and the template development class of the invention is established; the invention improves the development efficiency, reduces the entrance threshold, can adapt to various modes such as single-soldier combat and multi-person cooperation, and accelerates the development progress in the simplest and most effective mode. Therefore, the big data development method and the big data development platform provided by the invention have very wide application prospects in various fields such as big data development and the like. It should be noted that, in the implementation process of the present invention, support of the existing development tools is required, the data sources included in the present invention are wide, and include MongoDB, HDFS, Hive, Hbase, Mysql, Kafka, and the supported operators include all RDD operators on Spark official networks, including method categories and use examples, although such a method package is encapsulated by someone, no template is formed so far, and no data source capable of supporting so many data sources is formed, and the one-stop template-based rapid development method of the present invention is not formed.

Example three:

fig. 4 shows a block diagram of a terminal according to a fourth embodiment of the present invention, where the terminal includes: a memory (memory)41, a processor (processor)42 and a bus 43, wherein the processor 42 and the memory 41 are in mutual communication via the bus 43.

A memory 41 for storing various data;

specifically, the memory 41 is used for storing various data, such as parameters, codes, and the like in the process of developing big data, and is not limited herein, and the memory further includes a plurality of computer programs.

The processor 42 is configured to call various computer programs in the memory 41 to execute a Spark-based big data development method provided in the first embodiment, for example:

The present invention further provides a memory, where the memory stores a plurality of computer programs, and the computer programs are called by the processor to execute a Spark-based big data development method according to the first embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution.

Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A Spark-based big data development method is characterized by comprising the following steps:

installing an integrated development environment for introducing template engineering;

newly building a big data development project, and developing the big data by applying the development template;

the step of newly building big data development engineering and applying the development template to develop big data comprises the following steps:

correspondingly changing according to the codes of the development template to complete big data development;

continuing to expand the development template, simplifying the development process and sharing a code architecture;

the development template is a code with comments and capable of running, and the step of applying the development template to develop big data further comprises the following steps:

modifying or deleting the code as required;

the big data development further comprises: data source adaptation, namely regular judgment of character strings, judgment of NULL values, dynamic switching of data sources, configuration management of dynamic parameters, local configuration association, HDFS configuration association and KV library configuration association; the related dependence of the development template is provided in an SDK mode, and the development template comprises at least one of a general template, a data cleaning template and a Spark operator template; the development template comprises reading and regularizing input parameters, inputting and outputting data and selecting an intermediate cleaning method.

2. The big data development method according to claim 1, wherein after installing the integrated development environment, the method further comprises: and installing the Maven plug-in of the Maven warehouse and the IDE.

3. A Spark-based big data development device is characterized by comprising:

the installation unit is used for installing an integrated development environment and introducing template engineering;

the development unit is used for newly building a big data development project and applying the development template to develop big data;

the development unit is further to:

the development template is code with comments and capable of running, and the development unit is further configured to:

modifying or deleting the code as required;

4. The big data developing apparatus according to claim 3, wherein the installation unit further comprises: and installing the Maven plug-in of the Maven warehouse and the IDE.

5. A memory for Spark-based big data development, the memory storing a computer program, the computer program being executable by a processor to perform the steps of:

the step of newly building the big data development project and applying the development template to develop the big data comprises the following steps:

modifying or deleting the code as required;

6. A terminal for Spark-based big data development, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps of the Spark-based big data development method according to any one of claims 1 to 2 when executing the computer program.