CN111767048B

CN111767048B - Data computing processing method, device and system

Info

Publication number: CN111767048B
Application number: CN202010621799.1A
Authority: CN
Inventors: 徐雅光; 陈世强
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2023-09-22
Anticipated expiration: 2040-07-01
Also published as: CN111767048A

Abstract

The embodiment of the application provides a data processing method, a device and a system, wherein a first data set and first configuration information corresponding to the first data set are acquired, the first configuration information is used for determining processing logic of the first data set, the first configuration information is determined according to a first script and a first parameter, the first parameter can be obtained by configuration, and then the first data set can be processed according to the first configuration information to obtain a second data set. That is to say, the processing logic of the first data set is configurable, and different processing logic can be set by configuring different first parameters, so that multiplexing of the data processing process is realized, the number of times of script writing and modification is reduced, the flexibility of the data system is improved, and the actual requirements are better met.

Description

Data computing processing method, device and system

Technical Field

The present application relates to the field of computers, and in particular, to a data computing method, device and system.

Background

Data processing may refer broadly to the data processing that all data-like systems need to implement due to business requirements, where functional data processing refers to the iterative computation of data sets during a row and column operation. At present, the processing program of the functional data processing process is fixed and cannot be dynamically replaced, so that the data system cannot realize process multiplexing when the functional data processing is carried out, and the flexibility of the data system is insufficient and cannot better meet the actual demands.

Disclosure of Invention

In order to solve the technical problems in the prior art, the application provides a data processing method, a device and a system, which can improve the flexibility of a data system.

The application provides a data processing method, which comprises the following steps:

acquiring a first data set and first configuration information corresponding to the first data set; the first configuration information is used for determining processing logic for the first data set; the first configuration information is determined according to a first script and a first parameter, and the first parameter is obtained through configuration;

and processing the first data set according to the first configuration information to obtain a second data set.

Optionally, the first configuration information includes at least one of the following information: column selection configuration information, row filtering configuration information, table association configuration information, grouping column configuration information, and aggregation configuration information.

Optionally, the first data set is processed according to second configuration information, the second configuration information is determined according to a second script and second parameters, and the second parameters are obtained through configuration.

Optionally, the first script is obtained through a functional programming abstraction implementation.

Optionally, the method further comprises:

the second data set is stored.

The embodiment of the application also provides a data processing device, which comprises:

the data acquisition unit is used for acquiring a first data set and first configuration information corresponding to the first data set; the first configuration information is used for determining processing logic for the first data set; the first configuration information is determined according to a first script and a first parameter, and the first parameter is obtained through configuration;

and the data processing unit is used for processing the first data set according to the first configuration information to obtain a second data set.

Optionally, the apparatus further includes:

and the storage unit is used for storing the second data set.

The embodiment of the application also provides a data processing system, which comprises: and the data processing devices are used for processing the data respectively to realize distributed data processing.

The technical scheme provided by the application has at least the following advantages:

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data processing process according to an embodiment of the present application;

fig. 3 is a schematic diagram of a data processing device according to an embodiment of the present application.

Detailed Description

The functional data processing is performed in an iterative manner during the row and column operation, for example, the rows are selected first, and then the data are filtered, and of course, there may be correlation and grouping statistics between the data sets.

However, at present, the processing program of the functional data processing process is fixed and is determined in a script form, and dynamic transformation cannot be realized, so that the data class system cannot realize process multiplexing when performing functional data processing, for example, the data requirement changes in the iterative data processing process or the data service requirement is newly increased, and the process code needs to be modified or newly increased in the script to realize the process multiplexing. Therefore, the flexibility of the data system is insufficient, and the actual requirements cannot be met better.

In order to solve the above technical problems, embodiments of the present application provide a data processing method, apparatus, and system, which acquire a first data set, and first configuration information corresponding to the first data set, where the first configuration information is used to determine processing logic for the first data set, the first configuration information is determined according to a first script and a first parameter, the first parameter may be configured, and then the first data set may be processed according to the first configuration information to obtain a second data set. That is to say, the processing logic of the first data set is configurable, and different processing logic can be set by configuring different first parameters, so that multiplexing of the data processing process is realized, the number of times of script writing and modification is reduced, the flexibility of the data system is improved, and the actual requirements are better met.

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Embodiment one:

the embodiment of the application provides a data processing method, which is specifically described below with reference to the accompanying drawings. Referring to fig. 1, the flowchart of a data processing method according to an embodiment of the present application may include:

s101, acquiring a first data set and first configuration information corresponding to the first data set.

In an embodiment of the present application, the first data set may be a data set to be processed. Specifically, the first data set may be a data set to be processed, which is directly read from a data source, for example, a data set obtained from a text file, a Hive library, and an HDFS file; the first data set may also be a processed data set, for example, the first data set may be obtained by processing a third data set in a manner that references the processing of the first data set.

The first configuration information corresponding to the first data set is used for determining processing logic of the first data set, can be determined according to the first script and the first parameter, and the first parameter can be obtained through configuration.

The data processing method provided by the embodiment of the application can be applied to a large data cluster environment based on Spark and Hadoop, wherein Hadoop is a large data frame and is used for computer cluster software such as distributed storage, distributed operation and the like, spark is a large data distributed computing engine and supports multiple computing frames such as batch computing, stream computing, graph computing and machine learning, perfect butt joint can be realized with components such as hadoop\hive and the like, functions such as data storage, computing and the like are realized together, and the distributed data is that the data is split into different data blocks before the processing process, and then a plurality of machines are scheduled through a certain coordination mechanism, and different data blocks are processed at the same time.

The first data set may be used as a data block in the distributed data processing, that is, the processing of the first data set is used as an intermediate link in the distributed data processing, so that the processing logic of the first data set is related to the overall processing task of the distributed data, and the first configuration information may be determined according to the overall processing task of the distributed data. Thus, for each first data set, the first configuration information can be corresponding, so that a plurality of first data sets can be distributed on a plurality of computers for cooperative processing, the data processing efficiency is improved, and the processing logic can be combed or the influence, blood margin and other analysis can be performed by utilizing each first configuration information.

Therefore, before the first data set is processed, basic data can be obtained, the basic data can include product information, batch task information, service function information and the like related to service requirements, the basic information is managed, meanwhile, a processing library table and field metadata information are maintained, the first data set and processing logic of the first data set can be obtained based on the basic data, further, the processing logic of the first data set is decomposed, and the structured first script and the configurable first parameter are determined.

The first script can be realized through functional programming abstraction, the functional programming is a programming paradigm, and belongs to a structural programming, and the main idea is to write the operation process as much as possible into a series of nested function calls, so that the first script can be combined with the first parameter to obtain a complete callable program, and processing logic of the first data set is obtained. Specifically, the first configuration information can be obtained by dynamically splicing the first script and the first parameter.

The first parameter may be obtained through configuration, and specifically, a configuration interface may be provided for a user, and the first parameter is determined according to input information of the user. The first parameter may be data that is subject to variation during the data processing process, such as filtered keywords, number of columns selected, etc.

The first configuration information may include at least one of the following information: column selection configuration information, row filtering configuration information, table association configuration information, grouping column configuration information, aggregation configuration information, and the like. The first configuration information may be stored in a relational database, such as mysql or the like.

S102, processing the first data set according to the first configuration information to obtain a second data set.

In the embodiment of the application, after the first data set and the first configuration information corresponding to the first data set are acquired, the first data set can be processed according to the first configuration information, so that the second data set is obtained. Wherein the processing of the first dataset may specifically be at least one of the following processing: column selection, row filtering, table association, grouping columns, aggregation, etc.

Referring to fig. 2, a schematic diagram of a data processing process according to an embodiment of the present application is provided, where a first data set may be a data set 02, a second data set may be a data set 03, and a third data set may be a data set 01. Specifically, the first data set is filtered by rows to obtain a second data set, wherein the data processing logic filters all rows with the value of C6 being "Siemens", a Where sentence can be used as a first script, and C6 and "Siemens" can be used as first parameters.

Similarly, processing logic for processing the third data set to obtain the first data set may be represented by second configuration information, where the second configuration information may be obtained by a second script and a second parameter, where the second script is similar to the first script, and the second parameter is similar to the first parameter, and referring to fig. 2, the third data set is selected by a column to obtain the first data set, where the data processing logic selects (Select) C1, C2, C4, and C6, a Select statement may be used as the second script, and C1, C2, C4, and C6 may be used as the second parameter.

When the second data set is a data set obtained by final processing, the second data set may also be stored, specifically, may be stored in a text file, a relational database, a distributed file system, a NoSQL database, a Hive library, and an HDFS file. For example, the second data set may be log-pipelined, and the log may include a file log for embodying the second data set and a structured log for embodying the first script and the first parameter.

The embodiment of the application provides a data processing method, which comprises the steps of obtaining a first data set and first configuration information corresponding to the first data set, wherein the first configuration information is used for determining processing logic of the first data set, the first configuration information is determined according to a first script and a first parameter, the first parameter can be obtained by configuration, and then the first data set can be processed according to the first configuration information to obtain a second data set. That is to say, the processing logic of the first data set is configurable, and different processing logic can be set by configuring different first parameters, so that multiplexing of the data processing process is realized, the number of times of script writing and modification is reduced, the influence of problems such as version deployment, slow updating and the like caused by only selling hard codes on the data processing work is reduced, the flexibility of a data system is improved, and the actual requirements are better met.

Embodiment two:

based on the data processing method provided by the above embodiment, a second embodiment of the present application further provides a data processing device, which is specifically described below with reference to the accompanying drawings. Referring to fig. 3, a schematic diagram of a data processing apparatus according to an embodiment of the present application is shown.

The device of the embodiment of the application comprises:

a data acquisition unit 110, configured to acquire a first data set and first configuration information corresponding to the first data set; the first configuration information is used for determining processing logic for the first data set; the first configuration information is determined according to a first script and a first parameter, and the first parameter is obtained through configuration;

and a data processing unit 120, configured to process the first data set according to the first configuration information, so as to obtain a second data set.

Optionally, the first script is obtained through a functional programming abstract implementation.

Optionally, the apparatus further includes:

and the storage unit is used for storing the second data set.

The embodiment of the application also provides a data processing system which comprises a plurality of data processing devices, wherein the data processing devices can respectively process data so as to realize distributed data processing. The data processing devices may perform data processing processes in parallel or may perform data processing processes in series, and when the data processing devices perform data processing processes in series, an output data set of a previous data processing device may be used as an input data set of a next data processing device, for example, the first data processing device may process a third data set to obtain a first data set, and the second data processing device may process the first data set to obtain a second data set.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein the units and modules illustrated as separate components may or may not be physically separate. In addition, some or all of the units and modules can be selected according to actual needs to achieve the purpose of the embodiment scheme. Those of ordinary skill in the art will understand and implement the present application without undue burden.

The foregoing is merely illustrative of the embodiments of this application and it will be appreciated by those skilled in the art that variations and modifications may be made without departing from the principles of the application, and it is intended to cover all modifications and variations as fall within the scope of the application.

Claims

1. A method of processing data, the method comprising:

processing the first data set according to the first configuration information to obtain a second data set;

the first configuration information includes at least one of the following information: column selection configuration information, row filtering configuration information, table association configuration information, grouping column configuration information and aggregation configuration information;

the first script is obtained through a functional programming abstraction implementation.

2. The method of claim 1, wherein the first data set is processed from a third data set based on second configuration information, the second configuration information being determined based on a second script and a second parameter, the second parameter being obtained by configuration.

3. The method according to any one of claims 1-2, further comprising:

the second data set is stored.

4. A data processing apparatus, comprising:

the data processing unit is used for processing the first data set according to the first configuration information to obtain a second data set;

5. The apparatus of claim 4, wherein the apparatus further comprises:

and the storage unit is used for storing the second data set.

6. A data processing system, comprising: a plurality of data processing apparatuses according to any one of claims 4 to 5, wherein the plurality of data processing apparatuses perform data processing processes, respectively, to achieve distributed data processing.