CN118445309A

CN118445309A - Spark engine-based data processing method, device and equipment

Info

Publication number: CN118445309A
Application number: CN202410902911.7A
Authority: CN
Inventors: 吴华夫; 陈辟; 姚诗成; 徐晓兰; 黄志坚
Original assignee: Guangzhou Smart Software Co ltd
Current assignee: Guangzhou Smart Software Co ltd
Priority date: 2024-07-08
Filing date: 2024-07-08
Publication date: 2024-08-06
Anticipated expiration: 2044-07-08
Also published as: CN118445309B

Abstract

The application relates to the technical field of data processing, and provides a Spark engine-based data processing method, device and equipment, wherein the method comprises the following steps: acquiring a data source node selected from a data processing interface, a combined query node and a node circulation relation formed by the data source node and the combined query node through a workflow line, and generating a data processing workflow; responding to the triggering operation of the data source node, acquiring service data from a service database corresponding to the data source node through a Spark engine, and storing the service data into a storage space of a server where the Spark engine is located; responding to the triggering operation of the combined query node, and displaying a combined query interface; responding to trigger operation of selecting and determining a target field from various fields, and acquiring data corresponding to the target field from a storage space through a Spark engine; and displaying the data corresponding to the target field on the combined query interface, thereby improving the data processing efficiency and reducing the data processing cost.

Description

Spark engine-based data processing method, device and equipment

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a Spark engine-based data processing method, device and equipment.

Background

Because a huge amount of service data often has incomplete, inconsistent and abnormal data, the data processing on the service data is particularly important. Data processing typically includes data sampling, data splitting, combining queries, multiple data merging, conditional filtering of data, and data cleansing, among others.

Taking a combined query as an example, the combined query refers to selecting a field from an input data set, and performing aggregate calculation on data corresponding to the field. In the related art, when a user performs a combined query operation on a system interface, because the data volume of the query is huge or the performance of a service library is poor, the system writes service data in a data set into a cache library, then reads fields of the service data and service data corresponding to the fields from the cache library, and performs aggregate calculation on the service data corresponding to the fields.

However, writing business data into the cache library requires a certain time, reducing the data processing efficiency. Also, additional costs are required to maintain the cache library.

Disclosure of Invention

The embodiment of the application provides a Spark engine-based data processing method, device and equipment, which can improve the data processing efficiency and reduce the data processing cost, and the technical scheme is as follows:

In a first aspect, an embodiment of the present application provides a data processing method based on Spark engine, including the steps of:

Acquiring a data source node, a combined query node and a node circulation relation formed by the data source node and the combined query node through a working flow line, which are selected from a data processing interface; generating a data processing workflow according to the data source nodes, the combined query nodes and the node circulation relation;

Responding to the triggering operation of the data source node, acquiring service data from a service database corresponding to the data source node through a Spark engine, and storing the service data into a storage space of a server where the Spark engine is located;

Responding to the triggering operation of the combined query node, acquiring each field in the service data from the storage space according to the data processing workflow, and displaying a combined query interface; wherein, each field in the service data is displayed on the combined query interface;

Responding to trigger operation of selecting and determining a target field from various fields, and acquiring data corresponding to the target field from a storage space through a Spark engine; and displaying the data corresponding to the target field on the combined query interface.

In a second aspect, an embodiment of the present application provides a data processing apparatus based on Spark engine, including:

The data processing workflow generating module is used for acquiring the data source node, the combined query node and the node circulation relation formed by the data source node and the combined query node through the workflow line, which are selected from the data processing interface; generating a data processing workflow according to the data source nodes, the combined query nodes and the node circulation relation;

The service data storage module is used for responding to the triggering operation of the data source node, acquiring service data from a service database corresponding to the data source node through the Spark engine and storing the service data into a storage space of a server where the Spark engine is located;

The combined query interface display module is used for responding to the triggering operation of the combined query node, acquiring each field in the service data from the storage space according to the data processing workflow, and displaying the combined query interface; wherein, each field in the service data is displayed on the combined query interface;

The data display module is used for responding to the triggering operation of selecting and determining the target field from the fields and acquiring data corresponding to the target field from the storage space through the Spark engine; and displaying the data corresponding to the target field on the combined query interface.

In a third aspect, embodiments of the present application provide a computer device, a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method as in the first aspect when the computer program is executed.

According to the embodiment of the application, the data source node selected from the data processing interface, the combined query node and the node circulation relation formed by the data source node and the combined query node through the working flow line are obtained; generating a data processing workflow according to the data source nodes, the combined query nodes and the node circulation relation; responding to the triggering operation of the data source node, acquiring service data from a service database corresponding to the data source node through a Spark engine, and storing the service data into a storage space of a server where the Spark engine is located; responding to the triggering operation of the combined query node, acquiring each field in the service data from the storage space according to the data processing workflow, and displaying a combined query interface; wherein, each field in the service data is displayed on the combined query interface; responding to trigger operation of selecting and determining a target field from various fields, and acquiring data corresponding to the target field from a storage space through a Spark engine; and displaying the data corresponding to the target field on the combined query interface. According to the application, the Spark engine is used for storing the service data of the data source into the storage space of the server where the Spark engine is located, and the service data is directly obtained from the storage space, so that the service data is not required to be written into the cache library, the data processing efficiency is improved, the cache library is not required to be relied on, and the data processing cost is reduced.

For a better understanding and implementation, the technical solution of the present application is described in detail below with reference to the accompanying drawings.

Drawings

Fig. 1 is a flow chart of a Spark engine-based data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data processing apparatus based on Spark engine according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if"/"if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination", depending on the context.

The Spark engine-based data processing method can be applied to a data processing scene of a business intelligence (Business Intelligence, BI for short) system. The Spark engine-based data processing method provided by the embodiment of the application can be executed by Spark engine-based data processing equipment, the Spark engine-based data processing equipment can be realized in a software and/or hardware mode, and the Spark engine-based data processing equipment can be formed by two or more physical entities or one physical entity. The data processing device based on the Spark engine can be any electronic device for installing data processing software, and the electronic device can be intelligent devices such as a computer, a mobile phone or a tablet.

Referring to fig. 1, fig. 1 is a flowchart of a Spark engine-based data processing method according to a first embodiment of the present application, where the method includes the following steps:

S10: acquiring a data source node, a combined query node and a node circulation relation formed by the data source node and the combined query node through a working flow line, which are selected from a data processing interface; and generating a data processing workflow according to the data source nodes, the combined query nodes and the node circulation relation.

Wherein the data processing interface is used to design a custom data processing workflow. The data processing interface comprises a node resource area, a canvas area and a toolbar, wherein the node resource area is displayed with a plurality of data source nodes and data processing nodes, the data source nodes comprise but are not limited to example data source nodes, relational data source nodes, data sets and Excel files, each data source node corresponds to a service database, and service data is stored in the service database. For example, if the data source node is an Excel file, a large number of service data tables in an Excel format are stored in the service database corresponding to the data source node. Data processing nodes include, but are not limited to, combined queries, data sampling, data splitting, combined queries, multiple data merging, data filtering, and data cleansing. The canvas area is used for displaying the data processing workflow, and the toolbar is used for operating the data processing workflow.

The combined query refers to selecting a field from an input dataset, displaying service data corresponding to the field, and supporting aggregation calculation of the field.

The node circulation relation is used for indicating the execution sequence of each node in the data processing workflow.

In the embodiment of the application, the data processing interface is an ETL flow customization interface, the data processing workflow is an ETL workflow, and a user selects a data source node and a combined query node from a node resource area of the ETL flow customization interface, wherein the data source node can be an example data source node, drags the data source node and the combined query node to a canvas area, and connects the data source node and the combined query node by a workflow line to generate the ETL workflow. The node circulation relation in the ETL workflow is that the data source node flows to the combined query node.

S20: and responding to the triggering operation of the data source node, acquiring service data from a service database corresponding to the data source node through the Spark engine, and storing the service data into a storage space of a server where the Spark engine is located.

The Spark engine is a fast and general-purpose computing engine designed for large-scale data processing.

In the embodiment of the application, the ETL workflow and the Spark engine have a binding relation, and the ETL workflow is executed through the Spark engine. The user clicks the data source node in the canvas area, the BI system detects the operation of clicking the data source node by the user, the BI system calls the Spark engine, service data are obtained from a service database corresponding to the data source node, and the service data are stored in a storage space of a server where the Spark engine is located.

S30: responding to the triggering operation of the combined query node, acquiring each field in the service data from the storage space according to the data processing workflow, and displaying a combined query interface; wherein, the combined inquiry interface displays each field in the service data.

In the embodiment of the application, a user clicks a combined query node in a canvas area, the BI system detects the operation of clicking the combined query node by the user, the BI system generates a Spark data table according to a data processing workflow, and according to the Spark data table, each field in service data is acquired from a storage space of a server where the Spark engine is located through the Spark engine, and according to each field, a combined query interface is displayed. The combined query interface can be displayed on the ETL flow customization interface in a popup window mode, or can be a new interface after the ETL flow customization interface is jumped.

S40: responding to trigger operation of selecting and determining a target field from various fields, and acquiring data corresponding to the target field from a storage space through a Spark engine; and displaying the data corresponding to the target field on the combined query interface.

Wherein the target field is at least one of the respective fields.

In the embodiment of the application, the combined query interface comprises a data selection area, a data display area and a toolbar. The data selection area displays various fields of service data, names of data sources and table names of Spark tables. The data display area is used for displaying service data, and the toolbar is used for operating the combined query interface and comprises function buttons such as storage, refreshing and SQL viewing. The save button is used for saving the current combined query, the refresh button is used for refreshing the current combined query data, and the view SQL button is used for viewing SQL sentences executed by the current combined query.

The user selects a target field from a plurality of fields in the data selection area, the BI system detects the selection operation of the user on the target field, the data corresponding to the target field is obtained from the storage space of the server where the Spark engine is located through the Spark engine, and the target field and the data corresponding to the target field are displayed in the data display area. Specifically, the data display area is displayed in a table form, the target field is the table head of each column in the table, and the data corresponding to the target field is specific data of each column in the table.

By applying the embodiment of the application, the node circulation relation formed by the data source node and the combined query node through the working flow line is obtained by acquiring the data source node and the combined query node selected from the data processing interface; generating a data processing workflow according to the data source nodes, the combined query nodes and the node circulation relation; responding to the triggering operation of the data source node, acquiring service data from a service database corresponding to the data source node through a Spark engine, and storing the service data into a storage space of a server where the Spark engine is located; responding to the triggering operation of the combined query node, acquiring each field in the service data from the storage space according to the data processing workflow, and displaying a combined query interface; wherein, each field in the service data is displayed on the combined query interface; responding to trigger operation of selecting and determining a target field from various fields, and acquiring data corresponding to the target field from a storage space through a Spark engine; and displaying the data corresponding to the target field on the combined query interface. According to the application, the Spark engine is used for storing the service data of the data source into the storage space of the server where the Spark engine is located, and the Spark engine is used for directly acquiring the service data from the storage space, so that the service data is not required to be written into the cache library, the data processing efficiency is improved, the cache library is not required to be relied on, and the data processing cost is reduced.

In an alternative embodiment, step S20 includes S201 to S203, which is specifically as follows:

S201: and responding to the triggering operation of the data source node, and acquiring the identification of the data source node and the identification of the data processing workflow where the data source node is located.

Wherein the identification of the data source node is used to uniquely identify the data source node. In particular, the identification of the data source node may be a number, letter, or the like.

The identification of the data processing workflow in which the data source node is located is used to uniquely identify the data processing workflow. Specifically, the identification of the data processing workflow where the data source node is located may be a number such as a number or letter.

In an embodiment of the application, the canvas area may have a plurality of ETL workflows, each of which may have a corresponding data source node. And clicking a data source node in one ETL workflow by a user in the canvas area, and acquiring the identification of the data source node and the identification of the ETL workflow where the data source node is located.

S202: and generating a storage address of a server where the Spark engine is located according to the identification of the data source node and the identification of the data processing workflow.

In the embodiment of the application, a preset storage address generation algorithm is utilized to generate the storage address of the server where the Spark engine is located according to the identification of the data source node and the identification of the data processing workflow.

S203: and acquiring service data from a service database corresponding to the data source node through the Spark engine, and storing the service data into a storage space corresponding to the storage address.

In the embodiment of the application, the Spark engine stores the service data acquired from the service database corresponding to the data source node into the storage space corresponding to the storage address, specifically, the service data can be stored into the storage space corresponding to the storage address in a preset file format, and the service data can be compressed to store the compressed file into the storage space corresponding to the storage address.

Based on the identification of the data source node and the identification of the data processing workflow, a storage address of the business data in the storage space is generated, and as the BI system and the Spark engine can acquire the identification of the data source node and the identification of the data processing workflow, the Spark engine is not required to inform the BI system of the storage address of the business data when a combined query interface is displayed subsequently, so that the data processing flow is saved, and the data processing efficiency is further improved. In addition, because the identification of the data source node and the identification of the data processing workflow are unique, the generated storage address is also unique according to the identification of the data source node and the identification of the data processing workflow, and the disorder of the storage address of the service data can be avoided.

In an alternative embodiment, step S30 includes S301 to S304, which are specifically as follows:

s301: and responding to the triggering operation of the combined query node, and acquiring the identification of the data source node according to the node circulation relation of the data processing workflow.

In the embodiment of the application, considering the situation that a plurality of data source nodes and a plurality of data processing nodes exist in the data processing workflow, after a user clicks a combined query node, the BI system acquires the identification of the data source node of the data processing workflow where the combined query node is located according to the node circulation relationship of the data processing workflow where the combined query node is located.

S302: acquiring an identification of a data processing workflow; and generating a first address according to the identification of the data processing workflow and the identification of the data source node.

In the embodiment of the application, the BI system acquires the identification of the data processing workflow where the combined query node is located, and generates the first address according to the identification of the data processing workflow and the identification of the data source node by adopting a preset address generation algorithm.

S303: and generating data table metadata according to the preset data table name, the preset data format and the first address.

The preset data table names can be automatically generated according to preset rules, and preset data formats can be set manually according to actual requirements. Preset data formats include, but are not limited to PARQUET and ORC formats.

In the embodiment of the present application, the BI system generates the data table metadata tableSchema according to the preset data table name, the preset data format and the first address, and examples of the data table metadata tableSchema are as follows:

{

"tableName": "table1",

"storageFormat": "PARQUET",

"location": "file:///D:\\SmartbiMining\\data\\event\\I8a8a9f280189f2c0f2c00b180189f86193fe006a\\c82f011a84b40e0a4c1a0c29866420d9/dataset"

}

Wherein table1 is a preset data table name, PARQUET is a preset data format ,file:///D:\\SmartbiMining\\data\\event\\I8a8a9f280189f2c0f2c00b180189f86193fe006a\\c82f011a84b40e0a4c1a0c29866420d9/dataset is a first address.

S304: and according to the data table metadata, acquiring each field of the service data from the storage space through a Spark engine, and displaying a combined query interface.

In the embodiment of the application, the BI system accesses the Spark engine through the JDBC driver, sends the data table metadata to the Spark engine, and the Spark engine acquires each field of the service data from the storage space according to the preset data format and the first address in the data table metadata and displays the combined query interface according to the preset data table name and each field.

The Spark engine can automatically and quickly acquire service data from the storage space based on the data table metadata generated by the BI system, so that the data processing efficiency is improved.

In an alternative embodiment, the step of storing the service data in a storage space of a server where the Spark engine is located includes:

s21: and storing the business data into a storage space of a server where the Spark engine is located in a data file of a first data format.

In the embodiment of the application, the first data format is PARQUET, and the service data is stored in the storage space corresponding to the storage address of the server where the Spark engine is located in the file format of PARQUET through the Spark engine.

According to the data table metadata, acquiring each field of service data from a storage space through a Spark engine, and displaying a combined query interface, wherein the method comprises the following steps:

S31: and according to the first address, acquiring a data file corresponding to the service data from the storage space through the Spark engine.

In the embodiment of the application, the first address and the storage address are generated according to the identification of the data source node and the identification of the data processing workflow, so that the first address and the storage address are consistent.

And according to the first address, acquiring a data file corresponding to the service data from a storage space corresponding to the storage address through the Spark engine.

S32: and matching the preset data format with the first data format through the Spark engine, and analyzing the data file if the data format is matched with the first data format, so as to obtain service data.

In the embodiment of the application, after the data file corresponding to the service data is obtained through the Spark engine, the preset data format is required to be matched with the first data format, and if the matching is consistent, the data file is analyzed to obtain the service data. If the matching is inconsistent, a prompt of inconsistent data format and analysis failure is provided.

S33: fields are identified from the business data, and the names and types of the respective fields are obtained.

In the embodiment of the application, the service data comprises a plurality of dimension fields and a plurality of measurement fields, and each field has a corresponding name and type. Specifically, field types include, but are not limited to integer, floating point, and string.

S34: and generating a data table according to the names and types of the fields in the service data and the preset data table names.

In the embodiment of the application, the Spark engine generates a Spark data table according to the names and types of all fields in the service data and the preset data table names.

S35: and displaying the combined query interface according to the data table.

In the embodiment of the application, the name of the data table displayed by the combined query interface is the name of the Spark data table, and each field displayed by the combined query interface is each field in the Spark data table.

And the data file corresponding to the service data can be successfully analyzed through matching the preset data format with the first format, so that each field of the service data is obtained, and a combined query interface is displayed.

In an alternative embodiment, after step S304, steps S3041 to S3042 are included, which is specifically as follows:

s3041: the data table metadata is stored to the first storage space.

In the embodiment of the application, after generating the data table metadata, the BI system stores the data table metadata to a preset position of a first storage space, wherein the first storage space is a storage space of a server where the BI system is located.

S3042: and if the process of the data processing workflow is restarted, recovering the data table metadata from the first storage space.

In the embodiment of the application, the process for executing the ETL workflow may be abnormal, so that the process needs to be restarted. When the process of the ETL workflow is restarted, the data table metadata can be directly retrieved from the first storage space by using a recovery tool.

By storing the data table metadata into the first storage space, the data table metadata can be recovered in time after the process is restarted, the data table metadata do not need to be regenerated, and the data processing efficiency is improved.

In an alternative embodiment, step S40 includes steps S401 to S402, which are specifically as follows:

S401: in response to a trigger operation that selects a determination target field from among the respective fields, a data query statement is generated.

In the embodiment of the application, the user selects the target field in the combined query interface, and the BI system automatically generates the data query statement according to the target field selected by the user. Wherein, the data query statement is an SQL statement.

S402: and sending the data query statement to a Spark engine so that the Spark engine acquires the data corresponding to the target field from the storage space according to the data query statement.

In the embodiment of the application, the BI system accesses the Spark engine through the JDBC driver, sends the data query statement to the Spark engine, and the Spark engine acquires the data corresponding to the target field in the service data from the storage space corresponding to the storage address on the server where the Spark engine is located according to the data query statement. The JDBC driver is a self-lapping driver and is used for the BI system to access the Spark engine.

By means of the Spark engine, the BI system can acquire service data from the storage space, a cache library is not needed to be relied on, data processing efficiency is improved, and data processing cost is reduced.

In an alternative embodiment, the Spark engine-based data processing method further includes:

s50: in response to a triggering operation on the target field, several aggregate treatments of the target field are displayed.

In embodiments of the present application, a variety of aggregation treatments may be provided for the target field, including, but not limited to, aggregate values, maximum values, minimum values, average values, counts, and unique counts. The corresponding aggregation processing mode can be set according to the field type of the target field. Specifically, if the field type is character type, the aggregation processing mode includes count and unique count. If the field type is integer or floating point, the aggregation processing mode is aggregate value, maximum value, minimum value, average value, count and unique count.

S60: responding to a selection instruction aiming at a target aggregation processing mode, and carrying out aggregation processing on data corresponding to a target field to obtain an aggregation processing result; displaying an aggregation processing result on a combined query interface; wherein the target polymerization processing mode is one selected from a plurality of polymerization processing modes.

In the embodiment of the application, a user can click a target field in a data display area of a combined query interface, display an operation drop-down frame of the target field, operate multiple aggregation processing modes of the target field in the drop-down frame, select one target aggregation processing mode by the user, and then perform corresponding aggregation processing on data corresponding to the target field according to the target aggregation processing mode to obtain an aggregation processing result, and display the aggregation processing result on the combined query interface.

By setting a plurality of aggregation processing modes for the target field, a user can aggregate the query data according to the requirements.

In an alternative embodiment, step S60 includes steps S601 to S602, which are specifically as follows:

S601: and determining an aggregation function corresponding to the target aggregation processing mode in response to the selection instruction aiming at the target aggregation processing mode.

In the embodiment of the application, each aggregation processing mode corresponds to one aggregation function, and after a user selects a target aggregation processing mode, the BI system acquires the aggregation function corresponding to the target aggregation mode.

S602: and according to the aggregation function, carrying out aggregation operation on the data corresponding to the target field to obtain an aggregation operation result.

In the embodiment of the application, the BI system can rapidly calculate the aggregation operation result of the data corresponding to the target field according to the aggregation function corresponding to the target aggregation mode.

By setting corresponding aggregation functions for various aggregation processing modes, a user can carry out aggregation operation on query data according to requirements.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a Spark engine-based data processing device according to the present application. The Spark engine-based data processing device 5 provided by the embodiment of the application comprises:

The data processing workflow generating module 51 is configured to obtain a data source node, a combined query node, and a node circulation relationship formed by the data source node and the combined query node through a workflow line, which are selected from the data processing interface; generating a data processing workflow according to the data source nodes, the combined query nodes and the node circulation relation;

The service data storage module 52 is configured to obtain service data from a service database corresponding to the data source node through the Spark engine in response to a triggering operation on the data source node, and store the service data in a storage space of a server where the Spark engine is located;

The combined query interface display module 53 is configured to obtain, according to a data processing workflow, each field in the service data from the storage space in response to a triggering operation on the combined query node, and display a combined query interface; wherein, each field in the service data is displayed on the combined query interface;

The data display module 54 is configured to obtain, by means of a Spark engine, data corresponding to the target field from the storage space in response to a trigger operation for selecting and determining the target field from the fields; and displaying the data corresponding to the target field on the combined query interface.

Fig. 3 is a schematic structural diagram of a computer device according to the present application. As shown in fig. 3, the computer device 21 may include: a processor 210, a memory 211, and a computer program 212 stored in the memory 211 and executable on the processor 210, for example: a Spark engine based data processing program; the processor 210, when executing the computer program 212, implements the steps of the embodiments described above.

Wherein the processor 210 may include one or more processing cores. The processor 210 performs various functions of the computer device 21 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 211, and invoking data in the memory 211 using various interfaces and lines to connect various parts within the computer device 21, alternatively, the processor 210 may be implemented in at least one hardware form of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programble Logic Array, PLA). The processor 210 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the touch display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 210 and may be implemented by a single chip.

The Memory 211 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 211 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 211 may be used to store instructions, programs, code sets, or instruction sets. The memory 211 may include a storage program area and a storage data area, wherein the storage program area may store instructions for implementing an operating system, instructions for at least one function (such as touch instructions, etc.), instructions for implementing the above-described various method embodiments, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 211 may optionally also be at least one storage device located remotely from the aforementioned processor 210.

The embodiment of the present application further provides a computer storage medium, where a plurality of instructions may be stored, where the instructions are suitable for being loaded by a processor and executed by a method step of the foregoing embodiment, and a specific execution process may refer to a specific description of the foregoing embodiment, and details are not repeated herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of each method embodiment described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc.

The present invention is not limited to the above-described embodiments, but, if various modifications or variations of the present invention are not departing from the spirit and scope of the present invention, the present invention is intended to include such modifications and variations as fall within the scope of the claims and the equivalents thereof.

Claims

1. A Spark engine-based data processing method, the method comprising the steps of:

Acquiring a data source node selected from a data processing interface, a combined query node and a node circulation relation formed by the data source node and the combined query node through a work flow line; generating a data processing workflow according to the data source node, the combined query node and the node circulation relation;

responding to the triggering operation of the data source node, acquiring service data from a service database corresponding to the data source node through the Spark engine, and storing the service data into a storage space of a server where the Spark engine is located;

responding to trigger operation of selecting and determining a target field from the fields, and acquiring data corresponding to the target field from the storage space through the Spark engine; and displaying the data corresponding to the target field on the combined query interface.

2. The Spark engine-based data processing method of claim 1, wherein:

the step of acquiring service data from a service database corresponding to the data source node through the Spark engine and storing the service data into a storage space of a server where the Spark engine is located in response to a triggering operation on the data source node includes:

responding to the triggering operation of the data source node, and acquiring the identification of the data source node and the identification of the data processing workflow where the data source node is located;

generating a storage address of a server where a Spark engine is located according to the identification of the data source node and the identification of the data processing workflow;

And acquiring service data from a service database corresponding to the data source node through the Spark engine, and storing the service data into a storage space corresponding to the storage address.

3. The Spark engine-based data processing method of claim 1, wherein:

Responding to the triggering operation of the combined query node, acquiring each field in the service data from the storage space according to the data processing workflow, and displaying a combined query interface, wherein the step comprises the following steps:

Responding to the triggering operation of the combined query node, and acquiring the identification of a data source node according to the node circulation relation of the data processing workflow;

acquiring an identification of the data processing workflow; generating a first address according to the identification of the data processing workflow and the identification of the data source node;

generating data table metadata according to a preset data table name, a preset data format and the first address;

and according to the data table metadata, acquiring each field of the service data from the storage space through the Spark engine, and displaying a combined query interface.

4. A Spark engine based data processing method as claimed in claim 3, wherein:

the step of storing the service data in a storage space of a server where the Spark engine is located includes:

Storing the business data in a data file of a first data format into a storage space of a server where the Spark engine is located;

the step of obtaining each field of the service data from the storage space through the Spark engine according to the data table metadata and displaying a combined query interface comprises the following steps:

according to the first address, acquiring a data file corresponding to the service data from the storage space through the Spark engine;

Matching the preset data format with the first data format through the Spark engine, and analyzing the data file if the matching is consistent to obtain the service data;

identifying fields from the service data, and obtaining the names and types of the fields;

generating a data table according to the names and types of the fields in the service data and the preset data table names;

and displaying a combined query interface according to the data table.

5. A Spark engine based data processing method as claimed in claim 3, wherein:

After the step of generating the data table metadata according to the preset data table name, the preset data format and the first address, the method comprises the following steps:

Storing the data table metadata into a first storage space;

and if the process of the data processing workflow is restarted, recovering the data table metadata from the first storage space.

6. The Spark engine-based data processing method according to any one of claims 1 to 5, characterized in that:

The step of acquiring, by the Spark engine, data corresponding to the target field from the storage space in response to a trigger operation of selecting and determining the target field from the fields includes:

Generating a data query statement in response to a trigger operation for selecting a determination target field from the respective fields;

And sending the data query statement to the Spark engine so that the park engine acquires the data corresponding to the target field from the storage space according to the data query statement.

7. The Spark engine-based data processing method according to any one of claims 1 to 5, further comprising:

responding to the triggering operation of the target field, and displaying a plurality of aggregation processing modes of the target field;

Responding to a selection instruction aiming at a target aggregation processing mode, and carrying out aggregation processing on data corresponding to the target field to obtain an aggregation processing result; displaying the aggregation processing result on the combined query interface; the target polymerization processing mode is one selected from a plurality of polymerization processing modes.

8. The Spark engine-based data processing method of claim 7, wherein:

The step of responding to the selection instruction aiming at the target aggregation processing mode to aggregate the data corresponding to the target field to obtain an aggregation processing result comprises the following steps:

Responding to a selection instruction aiming at a target aggregation processing mode, and determining an aggregation function corresponding to the target aggregation processing mode;

and according to the aggregation function, carrying out aggregation operation on the data corresponding to the target field to obtain an aggregation operation result.

9. A Spark engine-based data processing apparatus, comprising:

the data processing workflow generating module is used for acquiring a data source node selected from a data processing interface, a combined query node and a node circulation relation formed by the data source node and the combined query node through a workflow line; generating a data processing workflow according to the data source node, the combined query node and the node circulation relation;

The combined query interface display module is used for responding to the triggering operation of the combined query node, acquiring each field in the service data from the storage space according to the data processing workflow, and displaying a combined query interface; wherein, each field in the service data is displayed on the combined query interface;

10. A computer device, comprising: a processor, a memory and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 8 when the computer program is executed.