CN112487036A

CN112487036A - Data processing method and device

Info

Publication number: CN112487036A
Application number: CN202011397493.9A
Authority: CN
Inventors: 季振宇; 顾晨波; 赵文杰
Original assignee: Guotai Epoint Software Co Ltd
Current assignee: Guotai Epoint Software Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-12

Abstract

The application relates to a data processing method and a data processing device, which belong to the technical field of computers, and the method comprises the following steps: displaying a visual configuration page; performing data modeling through a data import control in a configuration page to import data to be processed into a hive library; configuring a data processing model according to data processing requirements through a data processing control in a configuration page to obtain target SQL; when the execution requirement of the target SQL is online execution, using presto to execute the target SQL; when the execution requirement of the target SQL is timing execution, using hive to execute the target SQL; the problem that the data processing mode of hive SQL execution is single and the data processing requirement of a user cannot be met is solved; the automatic switching between hive SQL and presto SQL can be realized; to meet the data processing requirements of the user.

Description

Data processing method and device

Technical Field

The application relates to a data processing method and device, and belongs to the technical field of computers.

Background

hive is a data warehouse tool based on Hadoop. Hive is used for data extraction, transformation and loading, and is a mechanism capable of storing, querying and analyzing large-scale data stored in Hadoop. The hive data warehouse tool can map the structured data file into a database table, provide SQL query function and convert SQL sentences into MapReduce tasks for execution.

However, hive is not suitable for online transaction processing, nor provides a real-time query function, and in this case, the real-time online data processing requirement cannot be met.

Disclosure of Invention

The application provides a data processing method and device, which can solve the problems that the data processing mode of hive SQL execution is single, and the data processing requirements of users can not be met. The application provides the following technical scheme:

in a first aspect, a data processing method is provided, the method including:

displaying a visual configuration page;

performing data modeling through a data import control in the configuration page to import data to be processed into a hive library;

configuring a data processing model according to data processing requirements through a data processing control in the configuration page to obtain target SQL;

when the execution requirement of the target SQL is online execution, executing the target SQL by using presto;

when the execution requirement of the target SQL is timing execution, using hive to execute the target SQL.

Optionally, the data modeling by the data import control in the configuration page includes:

configuring database connection through the data import control, and synchronizing a table structure into platform metadata in a manner of directly connecting database query;

and/or the presence of a gas in the gas,

configuring a mapping relation through the data import control, and importing data from a relational library to the hive library;

and/or the presence of a gas in the gas,

the mapping relation is configured through the data import control, and data are imported from an unstructured file to the hive library;

and/or the presence of a gas in the gas,

importing the library table resource through the data import control; and when the base table resources support the configuration of the timing task, timing and synchronizing data to the hive base.

Optionally, the configuring, by the data processing control in the configuration page, the data processing model according to the data processing requirement to obtain the target SQL, includes:

displaying fields in an input stream in the configuration page; and receiving the selection operation of the user on the field to obtain the target SQL comprising the selected field.

and receiving the correlation operation executed on the two input tables in the configuration page to obtain the target SQL.

when the default function of the hive library does not support the data processing requirement, acquiring a self-defined function packet, wherein the self-defined function packet comprises a self-defined function supporting the data processing requirement;

registering the user-defined function packet to the hive library, and triggering and executing the step of configuring a data processing model according to data processing requirements through a data processing control in the configuration page to obtain a target SQL, wherein the data processing control corresponds to the functions supported by the hive library.

In a second aspect, there is provided a data processing apparatus, the apparatus comprising:

the page display module is used for displaying a visual configuration page;

the data modeling module is used for carrying out data modeling through the data import control in the configuration page so as to import the data to be processed into the hive library;

the model establishing module is used for configuring a data processing model according to data processing requirements through the data processing control in the configuration page to obtain target SQL;

the first execution module is used for executing the target SQL by using presto when the execution requirement of the target SQL is online execution;

and the second execution module is used for executing the target SQL by using hive when the execution requirement of the target SQL is timing execution.

Optionally, the data modeling module is configured to:

and/or the presence of a gas in the gas,

Optionally, the model building module is configured to:

The beneficial effect of this application lies in: displaying a visual configuration page; performing data modeling through a data import control in a configuration page to import data to be processed into a hive library; configuring a data processing model according to data processing requirements through a data processing control in a configuration page to obtain target SQL; when the execution requirement of the target SQL is online execution, using presto to execute the target SQL; when the execution requirement of the target SQL is timing execution, using hive to execute the target SQL; the problem that the data processing mode of hive SQL execution is single and the data processing requirement of a user cannot be met is solved; target SQL can be executed using presto at model direct preview or runtime; automatically converting to hive to execute target SQL when the model is regularly executed after being released; automatically converting to hive to execute target SQL when the model is regularly executed after being released; the automatic switching between hive SQL and presto SQL can be realized; to meet the data processing requirements of the user.

In addition, a set of visual web interface enables a user to quickly arrange and generate SQL, real-time data preview is achieved, the user does not need to write complex SQL sentences, and operation difficulty is reduced.

The foregoing description is only an overview of the technical solutions of the present application, and in order to make the technical solutions of the present application more clear and clear, and to implement the technical solutions according to the content of the description, the following detailed description is made with reference to the preferred embodiments of the present application and the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a data processing method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an interface for data import according to an embodiment of the present application;

FIG. 3 is a schematic interface diagram of a hive library provided by one embodiment of the present application;

FIG. 4 is a schematic diagram of a column select interface provided in one embodiment of the present application;

FIG. 5 is a schematic interface diagram of table associations provided by one embodiment of the present application;

FIG. 6 is an interface diagram of all data models provided by one embodiment of the present application;

FIG. 7 is an interface diagram of a data model provided by an embodiment of the present application;

FIG. 8 is a block diagram of a data processing apparatus provided in one embodiment of the present application;

fig. 9 is a block diagram of a data processing apparatus according to an embodiment of the present application.

Detailed Description

The following detailed description of embodiments of the present application will be described in conjunction with the accompanying drawings and examples. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.

First, several terms referred to in the present application will be described.

Structured Query Language (SQL): is a special purpose programming language, a database query and programming language, used to access data and query, update and manage relational database systems.

hive: the system is a set of data warehouse analysis system constructed based on Hadoop. It provides a rich SQL query approach to analyze data stored in a Hadoop distributed file system. hive can map the structured data file into a database table and provide complete SQL query function. Hive can also convert SQL statements into MapReduce tasks to run, and the needed content is queried and analyzed through own SQL, and the set of SQL process is called Hive SQL for short.

Presto: is a distributed SQL query engine. It is designed specifically for high-speed, real-time data analysis. It supports standard ANSI SQL including complex queries, aggregations (aggregations), joins (joins), and window functions (window functions).

Presto's operating model is essentially different from Hive or MapReduce. hive translates the query into a multi-stage MapReduce task, running one after the other. Each task reads input data from disk and outputs intermediate results to disk. However, the Presto engine does not use MapReduce. It uses a custom query and execution engine and responsive operators to support the SQL syntax. All data processing is done in memory, except for the improved scheduling algorithm. Different processing ends form a processing pipeline through a network. This avoids unnecessary disk reads and writes and additional latency. Such a pipelined execution model runs multiple data processing segments at the same time, passing data from one processing segment to the next as soon as it is available. Such an approach would greatly reduce the end-to-end response time of various queries.

Metadata (Metadata): the data (data about data) describing data, also called intermediate data and relay data, is mainly information describing data property (property) and is used for supporting functions such as indicating storage location, history data, resource searching, file recording and the like. The metadata may be considered an electronic catalog.

Optionally, the present application is described by taking an execution subject of each embodiment as an example of a computer device, where the computer device may be a desktop computer, a notebook computer, a tablet computer, a mobile phone, and the like, and the embodiment does not limit the device type of the computer device.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present application. The method at least comprises the following steps:

step 101, displaying a visual configuration page.

The configuration page is a set of visual web interfaces and is used for enabling a user to quickly arrange and generate sql and achieve real-time data preview. Therefore, a user is not required to write complex sql statements, and the operation difficulty is reduced.

And 102, performing data modeling through a data import control in the configuration page to import the data to be processed into the hive library.

In one example, data modeling by a data import control in a configuration page includes: configuring database connection through a data import control, and synchronizing a table structure into platform metadata in a manner of directly connecting database query; and/or, configuring a mapping relation through a data import control, and importing data from a relational library to a hive library; and/or, configuring a mapping relation through a data import control, and importing data from an unstructured file to a hive library; and/or importing the data into the library table resource through the data import control; and when the library table resources support the configuration timing task, timing and synchronizing data to the hive library.

Optionally, referring to the configuration page shown in fig. 2, the data import control in the configuration page includes: the data source classification selection control 21, the data source type selection control 22, the data source selection control 23 and the table selection control 24 can realize data import through different controls.

Referring to the hive library shown in fig. 3, the hive library provides a data search function, i.e., a data name input area 31 and a search control 32 are displayed in a configuration page. The user can realize the data search function by inputting the data to be searched in the data name input area 31 and clicking the search control 32. In addition, the hive library displayed by the configuration page includes file resources 33 and library table resources 34, in fig. 3, the library table resources 34 are taken as an example, and the library table resources 34 include information such as the sequence number of hive, the name of a source table, a database to which the hive belongs, the name of a hive table, an import state, an update state, the number of records, import time, a creator, scheduling configuration, and the like. In fig. 3, the library table resource 34 is illustrated as including the above information, in practical implementation, the library table resource 34 may include more or less information, and the content of the library table resource 34 is not limited in this embodiment.

And 103, configuring a data processing model according to data processing requirements through a data processing control in the configuration page to obtain the target SQL.

Optionally, configuring, by a data processing control in the configuration page, the data processing model according to the data processing requirement to obtain the target SQL, including: when the default function of the hive library does not support the data processing requirement, acquiring a custom function packet, wherein the custom function packet comprises a custom function supporting the data processing requirement; and registering the user-defined function packet to the hive library, and triggering and executing a data processing model configured according to data processing requirements through a data processing control in the configuration page to obtain the target SQL, wherein the data processing control corresponds to the functions supported by the hive library.

Illustratively, a user can self-define the hive function by writing java code, thereby realizing complex data processing requirements. At the moment, the user uploads the compiled udf jar packet to hdfs; components are created udf, associated udf functions, and automatically registered with the hive library.

In one example, configuring, by a data processing control in a configuration page, a data processing model according to a data processing requirement to obtain a target SQL, includes: displaying fields in the input stream in a configuration page; and receiving the selection operation of the user on the fields to obtain the target SQL comprising the selected fields.

Referring to the schematic interface diagram of fig. 4 for obtaining target SQL through column selection, a configuration page displays a model step name input area 41 and a field 42 in an input stream; after the user selects the "user code" field and the "water usage" field, the target SQL for the "column select" model step is generated.

In another example, configuring, by a data processing control in a configuration page, a data processing model according to a data processing requirement to obtain a target SQL, includes: and receiving the correlation operation executed on the two input tables in the configuration page to obtain the target SQL.

Referring to the interface schematic diagram of obtaining target SQL through table association shown in fig. 5, a configuration page displays a model step name input area 51, an association type selection control 52, and two tables 53 selected by a user; and obtaining the target SQL after receiving the associated operation 'left connection' input by the user. The target SQL includes two tables 53 for association.

Referring to all of the data processing models of the configuration page display shown in FIG. 6, a detailed page for any one of the data processing models is shown in FIG. 7.

And step 104, when the execution requirement of the target SQL is online execution, executing the target SQL by using presto.

Such as: when the target SQL needs to be previewed or run directly, presto is used to execute the target SQL.

And 105, when the execution requirement of the target SQL is timing execution, using hive to execute the target SQL.

Such as: the target SQL is executed using hive after the model is released.

In summary, the data processing method provided in this embodiment,

fig. 8 is a block diagram of a data processing apparatus according to an embodiment of the present application. The device at least comprises the following modules: a page display module 810, a data modeling module 820, a model building module 830, a first execution module 840, and a second execution module 850.

A page display module 810, configured to display a visualized configuration page;

the data modeling module 820 is used for performing data modeling through the data import control in the configuration page so as to import the data to be processed into the hive library;

the model establishing module 830 is configured to configure a data processing model according to a data processing requirement through the data processing control in the configuration page, so as to obtain a target SQL;

a first execution module 840, configured to execute the target SQL using presto when the execution requirement of the target SQL is online execution;

a second executing module 850, configured to execute the target SQL using hive when the execution requirement of the target SQL is timing execution.

Optionally, the data modeling module 820 is configured to:

and/or the presence of a gas in the gas,

Optionally, the model building module 830 is configured to:

For relevant details reference is made to the above-described method embodiments.

It should be noted that: in the data processing apparatus provided in the above embodiment, when performing data processing, only the division of the above functional modules is taken as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the data processing apparatus is divided into different functional modules to complete all or part of the above described functions. In addition, the data processing apparatus and the data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 9 is a block diagram of a data processing apparatus according to an embodiment of the present application. The apparatus comprises at least a processor 901 and a memory 902.

Processor 901 may include one or more processing cores such as: 4 core processors, 8 core processors, etc. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the data processing methods provided by the method embodiments herein.

In some embodiments, the data processing apparatus may further include: a peripheral interface and at least one peripheral. The processor 901, memory 902 and peripheral interfaces may be connected by buses or signal lines. Each peripheral may be connected to the peripheral interface via a bus, signal line, or circuit board. Illustratively, peripheral devices include, but are not limited to: radio frequency circuit, touch display screen, audio circuit, power supply, etc.

Of course, the data processing apparatus may also include fewer or more components, which is not limited in this embodiment.

Optionally, the present application further provides a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the data processing method of the above-mentioned method embodiment.

Optionally, the present application further provides a computer product, which includes a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the data processing method of the above-mentioned method embodiment.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of data processing, the method comprising:

displaying a visual configuration page;

2. The method of claim 1, wherein the data modeling via a data import control in the configuration page comprises:

and/or the presence of a gas in the gas,

3. The method of claim 1, wherein configuring, by the data processing control in the configuration page, the data processing model according to the data processing requirement to obtain the target SQL comprises:

4. The method of claim 1, wherein configuring, by the data processing control in the configuration page, the data processing model according to the data processing requirement to obtain the target SQL comprises:

5. The method of claim 1, wherein configuring, by the data processing control in the configuration page, the data processing model according to the data processing requirement to obtain the target SQL comprises:

6. A data processing apparatus, characterized in that the apparatus comprises:

the page display module is used for displaying a visual configuration page;

7. The apparatus of claim 6, wherein the data modeling module is configured to:

and/or the presence of a gas in the gas,

8. The apparatus of claim 6, wherein the model building module is configured to:

9. The apparatus of claim 6, wherein the model building module is configured to:

10. The apparatus of claim 6, wherein the model building module is configured to: