CN112052253A

CN112052253A - Data processing method, electronic device and storage medium

Info

Publication number: CN112052253A
Application number: CN202010808517.9A
Authority: CN
Inventors: 何通庆; 陈斌; 连庆仁; 吴琳炜; 林鸿其; 上官致钊; 庄贤荣
Original assignee: Wangsu Science and Technology Co Ltd
Current assignee: Wangsu Science and Technology Co Ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-12-08
Anticipated expiration: 2040-08-12
Also published as: CN112052253B

Abstract

The invention discloses a data processing method, electronic equipment and a storage medium. In the invention, through the task decomposition class provided by the predefined data processing frame, the time interval is divided according to the time granularity extracted from the data acquisition instruction, and the subtask corresponding to the obtained subinterval is generated according to the extracted operation type, the data processing interface provided by the data processing frame is called by the subtask to acquire the source data and the configuration data to be processed, the source data and the configuration data are processed based on the interface, and finally the processed data are stored into a queue file capable of acquiring contents as required, so that developers do not need to deeply know the Spark principle and the bottom technology, meanwhile, because the source data and the configuration data are associated together by the queue file, the configuration data do not need to be associated again in the subsequent development, or the associated configuration data is reduced, thereby greatly simplifying the processing of subsequent services and effectively improving the development efficiency.

Description

Data processing method, electronic device and storage medium

Technical Field

The embodiment of the invention relates to the technical field of big data programming, in particular to a data processing method, electronic equipment and a storage medium.

Background

Apache Spark is a fast, general-purpose engine designed specifically for large-scale distributed data distributed memory computation. Is an open-source Hadoop MapReduce-like universal parallel framework provided by AMP laboratories, university of california, burkeley. Since the intermediate output result of the MapReduce Job can be stored in the memory, the write-read HDFS (Hadoop Distributed File System) is not needed any more, and thus the Spark can be better applied to MapReduce algorithms requiring iteration, such as data mining and machine learning.

However, due to different specific services, a large number of configuration operations are often required in actual development, and the implementation process is complicated and difficult. In addition, due to the complexity of Spark development, when large data Spark development is implemented, developers need to have a deep understanding of Spark principles and underlying technologies, such as broadcast variables (broadcases), RDDs (flexible Distributed data sets) operators, and the like, which requires a large amount of labor cost to cultivate dedicated Spark developers.

Disclosure of Invention

An object of embodiments of the present invention is to provide a data processing method, an electronic device, and a storage medium, which aim to reduce the input of labor cost, simplify the code amount, and improve the development efficiency.

In order to solve the above technical problem, an embodiment of the present invention provides a data processing method, including the following steps:

acquiring a data processing instruction, and extracting a job type, a time interval and a time granularity from the data processing instruction;

dividing the time interval into a plurality of subintervals according to the time granularity based on a task decomposition class provided by a predefined data processing framework, and generating a subtask corresponding to each subinterval according to the operation type;

calling a data processing interface provided by the data processing framework through the subtask to acquire data to be processed, wherein the data to be processed comprises source data and configuration data;

and processing the source data and the configuration data based on a data processing interface provided by the data processing framework, and storing the processed data as a queue file in a column storage format.

An embodiment of the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a data processing method as described above.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements a data processing method as described above.

Compared with the prior art, the embodiment of the invention extracts the three information of the operation type, the time interval and the time granularity from the acquired data processing instruction, then based on the task decomposition class provided by the predefined data processing frame, dividing the extracted time interval into a plurality of subintervals according to the extracted time granularity, and generating the subtask corresponding to each subinterval according to the extracted job type, and further acquiring the source data and the configuration data to be processed through a data processing interface provided by a predefined data processing framework, and processes the source data and the configuration data based on the interface, and finally saves the processed data as a queue file capable of acquiring content as required, the method and the device enable developers to realize the processing of subsequent services without deeply knowing the Spark principle and the underlying technology, thereby effectively reducing the investment of labor cost. Meanwhile, the source data and the configuration data are associated together by the request file, so that the subsequent development does not need to associate the configuration data again or associate the configuration data less, the processing of subsequent services is greatly simplified, and the development efficiency is effectively improved.

In addition, before the processing the source data and the configuration data based on the data processing interface provided by the data processing framework, the method further includes:

packaging the source data to obtain an elastic distributed data set RDD object for SQL statement query;

and packaging the configuration data to obtain a simple entity Bean object for SQL statement query.

In addition, the encapsulating the source data to obtain an elastic distributed data set RDD object for SQL statement query includes:

acquiring predefined metadata of source data according to a preset metadata name;

obtaining metadata to be packaged according to the metadata and preset filtering conditions;

marking the source data specified in the metadata as an elastic distributed data set string type object;

converting the elastic distributed data set character string type object into an elastic distributed data set structured type object by taking the metadata to be packaged as a filtering condition;

converting the metadata to be packaged and the elastic distributed data set structured type object into a Dataset < Row > object;

and encapsulating the metadata to be encapsulated, the elastic distributed data set structured type object and the Dataset < Row > object in the same data object to obtain an RDD object for SQL statement query.

In addition, the encapsulating the configuration data to obtain the simple entity Bean object for query by the SQL statement includes:

acquiring predefined metadata of configuration data according to a preset metadata name;

converting the configuration data appointed in the metadata into a structured array by taking the metadata to be packaged as a filtering condition;

and encapsulating the metadata to be encapsulated and the structured array in the same data object to obtain a Bean object for SQL statement query.

In addition, the processing the source data and the configuration data based on the data processing interface provided by the data processing framework, and saving the processed data as a queue file in a column-type storage format includes:

associating the RDD object with the Bean object based on a data processing interface provided by the data processing frame to obtain an associated object in an RDD format;

and saving the associated object in the RDD format as a queue file in a column storage format.

In addition, before the task decomposition class provided based on the predefined data processing framework divides the time interval into a plurality of subintervals according to the time granularity and generates a subtask corresponding to each subinterval according to the job type, the method further includes:

detecting whether the time granularity accords with a preset time granularity dereferencing rule or not, wherein the time granularity dereferencing rule specifies that the time granularity is integral multiple of the generation granularity of the data to be processed;

if yes, executing the task decomposition class provided based on the predefined data processing framework, dividing the time interval into a plurality of subintervals according to the time granularity, and generating a subtask corresponding to each subinterval according to the operation type;

otherwise, rounding the time granularity;

the task decomposition class provided based on the predefined data processing framework divides the time interval into a plurality of subintervals according to the time granularity, and generates a subtask corresponding to each subinterval according to the job type, including:

and dividing the time interval into a plurality of subintervals according to the rounded time granularity based on the task decomposition class provided by the predefined data processing framework, and generating the subtask corresponding to each subinterval according to the operation type.

detecting whether the starting time corresponding to the time interval is greater than the ending time corresponding to the time interval;

if the task resolution class is larger than the preset task resolution class, executing the task resolution class provided based on the predefined data processing frame, dividing the time interval into a plurality of subintervals according to the time granularity, and generating a subtask corresponding to each subinterval according to the operation type;

otherwise, carrying out exception prompting.

In addition, before the data processing interface provided based on the predefined data processing framework is called through the subtask to acquire the data to be processed, the method further includes:

abstracting a task decomposition class and a data processing class based on a Spark framework;

constructing a data processing interface for the data processing class;

and packaging the environment initialization method of the Spark framework, the read-write method of the parquet file provided by Spark SQL, the task decomposition class and the data processing interface to obtain the data processing framework.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a detailed flowchart of a data processing method according to a first embodiment of the present invention;

FIG. 2 is a detailed flowchart of a data processing method according to a second embodiment of the present invention;

FIG. 3 is a detailed flowchart of a data processing method according to a third embodiment of the present invention;

FIG. 4 is a schematic diagram of a junction configuration of a data processing apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic configuration diagram of a data processing apparatus according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

The present embodiment relates to a data processing method, which is applied to an electronic device, such as a personal computer, a tablet computer, a smart phone, and the like, which are not listed here, and the present embodiment is not limited thereto.

The following describes implementation details of the data processing method of the present embodiment, and the following description is provided only for the sake of understanding and is not essential to the present embodiment.

The specific flow of the present embodiment is shown in fig. 1, and specifically includes the following steps:

step 101, acquiring a data processing instruction, and extracting a job type, a time interval and a time granularity from the data processing instruction.

Specifically, in practical applications, the data acquisition command may be triggered by a user, such as a developer, or may be automatically triggered by a timing device when a certain system time is reached, and in specific implementations, a person skilled in the art may set the data acquisition command as needed, which is not limited in this embodiment.

Further, the job type extracted from the data processing instruction described above is a job name of a job to be processed which is input by the user or acquired from a preset area.

The time interval is determined based on the start time startTime and the end time endTime in the data processing instruction.

The time granularity is set according to the service requirement, such as 1 hour, or 1 day, or 1 month.

And 102, dividing the time interval into a plurality of subintervals according to the time granularity based on a task decomposition class provided by a predefined data processing framework, and generating a subtask corresponding to each subinterval according to the job type.

It should be appreciated that in order to ensure that step 102 is performed successfully, the data processing framework needs to be packaged before step 102 is performed.

The data processing framework is specifically packaged as follows:

first, a task breakdown class and a data processing class are abstracted based on a Spark framework, and for convenience of description, the task breakdown class is defined as a parkationjob class, and the data processing class is defined as a parkationjob class.

Next, a data processing interface is constructed for the data processing class.

And finally, encapsulating the environment initialization method of the Spark framework, the read-write method of the parquet file provided by Spark SQL, the task decomposition class and the data processing interface to obtain the data processing framework.

To facilitate understanding of the ParquettionJob class, the ParquettionJob class is described below in connection with part of the pseudo code for that class:

partition [ ] partitions ═ createparation (type); creation of a queue object in a reflective mode, corresponding to a job name input by a type user, for processing a specific service

The pseudo code of the first line is used for creating a queue class in a user reflection mode, and the type is a job name input by a user and corresponds to a specific service to be processed; granularity is the time granularity, the start time of the batchStart subinterval, and the end time of the batchEnd subinterval.

For ease of understanding, the operations in step 102 above are described below with reference to examples:

it is assumed that the start time startTime of the time interval extracted from the data processing instruction is 2020-07-01, the end time endTime is 2020-07-31, and the time granularity is 1 day.

The time region 2020-07-01 to 2020-07-31 may be divided into 31 sub-intervals, i.e. one sub-interval per day, based on the parkationjob class provided by the above-described encapsulated data processing framework.

For the first subinterval, batchStart 2020-07-01 and batchEnd 2020-07-01+1 2020-07-02.

Accordingly, for the next subinterval, batchStart is assigned to batchEnd again, and each assignment is the end time of the previous subinterval.

Accordingly, the batchEnd of the next subinterval is batchStart (reassigned) + granularity.

It should be understood that the above is only an example, and the technical solution of the present embodiment is not specifically limited.

In addition, it is worth mentioning that, in practical applications, the data to be processed is generated in a certain period, that is, the generated particle size may be in hours, days, months, or the like. In other words, there may be strong correlation between data in the same generation granularity, so to avoid the situation that the data in the same generation granularity is split according to the subintervals divided by the proposed time granularity, before executing step 102, it may be detected whether the time granularity meets the preset time granularity value-taking rule.

Correspondingly, if the extracted time granularity is determined to meet the preset time granularity value rule through detection, the operation in the step 102 is executed; otherwise, processing the extracted time granularity.

Specifically, in this embodiment, the time granularity dereferencing rule specifies that the time granularity is an integer multiple of a generated granularity of the data to be processed.

For example, if the generation granularity of the data to be processed is 1 day, the time granularity extracted from the data processing instruction may be 1 day, 2 days, or 3 days. If the extracted time granularity is 1.5 days, the current time granularity does not accord with the time granularity value rule, and therefore the time granularity needs to be processed.

In this embodiment, the processing of the time granularity that does not meet the time granularity value rule is specifically a rounding operation.

The rounding operation may be rounding-up, for example, after rounding-up for 1.5 days, the time granularity after rounding-up is 2 days; for example, after rounding down for 1.5 days, the time granularity after rounding up is 1 day, in a specific implementation, a person skilled in the art may preset a rounding rule, and this embodiment is not limited to this.

In addition, in practical application, when the extracted time granularity does not accord with the time granularity value-taking rule, a prompt can be made on a user interface, and the user inputs the time granularity which accords with the requirement again.

Correspondingly, after rounding the unsatisfactory time granularity, the operation executed in step 102 is specifically to divide the time interval into a plurality of sub-intervals based on the task decomposition class provided by the predefined data processing framework according to the rounded time granularity, and generate the sub-task corresponding to each sub-interval according to the job type.

Further, in order to avoid as much as possible the situation that the sub-intervals divided according to the proposed time granularity have data within the same generation granularity split, before the user triggers data processing, the user may be prompted as to which time granularity value rule the time granularity input by the user needs to meet.

In addition, in order to ensure the smooth proceeding of the step 102 as much as possible, after the detection of the time granularity, it may be further detected whether the start time corresponding to the time interval is greater than the end time corresponding to the time interval.

Correspondingly, if yes, the operation of the step 102 is executed; otherwise, carrying out exception prompting.

Further, in practical applications, the selection criteria of the start time and the end time corresponding to the time interval may also be defined, for example, the start time must be greater than the end time, and the start time + time granularity is not greater than the time interval.

Step 103, calling a data processing interface provided by the data processing framework through the subtask to acquire data to be processed, wherein the data to be processed comprises source data and configuration data.

Specifically, the main purpose of the present embodiment is to associate the source data and the configuration data into a wide table, that is, to store the source data and the configuration data, which are originally stored separately, in a table. Therefore, the data to be processed acquired based on the data processing interface provided by the data processing framework must include source data and configuration data related to the source data.

The Raw data, which is the source data, is data generated in real time, such as information on a website accessed by a user, order information of the user, and the like.

The configuration data, that is, the data described in the Conf or Config file, is attribute information that is not time-efficient and is updated less frequently, such as the sex, age, and telephone of the user.

In addition, it is worth mentioning that, in order to enable a developer to develop the big data Spark only by using the SQL statement without deeply knowing the Spark principle and the underlying technology, the input of the labor cost is reduced, and the source data and the configuration data may be encapsulated before the step 103 is executed.

Specifically, the source data is packaged into an elastic distributed data set RDD object of SQL statement query, and the configuration data is packaged into a simple entity Bean object of the SQL statement query.

Correspondingly, through the operation of step 103, the obtained source data is specifically an RDD object, and the configuration data is specifically a Bean object.

And 104, processing the source data and the configuration data based on a data processing interface provided by the data processing framework, and storing the processed data as a queue file in a column storage format.

Specifically, when the source data is an RDD object and the configuration data is a Bean object, the data processing operation performed in step 104 is specifically to associate the RDD object and the Bean object based on a data processing interface provided by the data processing framework to obtain an associated object in an RDD format, and further store the associated object in the RDD format as a queue file in a column storage format.

Further, in practical application, in order to enable the content stored in the obtained parquet file to be required by a developer, before the RDD object and the Bean object are associated, the RDD object and the Bean object may be filtered according to filtering information input by the developer or preset, and then the content in the RDD object obtained by filtering is associated with the content in the Bean object.

As can be seen from the above description, the acquired data to be processed are the RDD object and the Bean object, and the obtained management object is in the RDD format, so the request class and the data processing interface for the class component in the data processing framework should be capable of processing the RDD object and the Bean object.

Note that the RDD object in the present embodiment is different from the conventional RDD in the packaging principle, and is hereinafter referred to as SmartRDD for distinction.

Accordingly, since the Bean object in the present embodiment is different from a conventional Bean in the packaging principle, the Bean object in the present embodiment is hereinafter referred to as SmartBean for distinction.

In order to facilitate understanding of the request class and the data processing interface for the class component, the following describes the request class and the data processing interface for the class component in conjunction with a partial pseudo code of the class:

it should be understood that, the above description is given of obtaining a part of pseudo code of SmartRDD, and since the pseudo code defines not only an interface for obtaining SmartRDD that needs to be associated, but also a storage location and a storage granularity of SmartRDD, and a time format and a time boundary that need to be followed when obtaining SmartRDD, a subsequent service processing module does not need to know the storage location and the storage granularity of data, and does not need to process the time format problem and the time boundary problem, thereby greatly simplifying an implementation process and further improving development efficiency.

Accordingly, a partial pseudo code of SmartBean is obtained, roughly as follows:

in the pseudo code, "savePath ═ getSavePath (type, granularity, batchStart)" is a position where the final partial file needs to be saved for calculation, and "RDD.

In addition, "SmartRDD" (batch start, batch end) is a function realized by a developer, taking the service a as an example, and assuming that a specific service class realized by the developer is aparrection, the function corresponds to aparrection.

Specifically, the AParquetation class inherits the Parquetation class, and is mainly used for implementing operations such as filtering and associating of data to be processed.

Through the above description, it can be found that, in the specific implementation, how many sub-regions are divided into the time region according to the time granularity, and at least how many partial files are finally obtained.

Still taking the time granularity of 1 day and the time region of 2020-07-01 to 2020-07-31 as an example, at least 31 partial files are finally obtained.

Furthermore, it should be noted that, in practical applications, the update frequency of the configuration data is relatively low, so that the same configuration data can be shared by the source data in the same time zone. That is, the number of the obtained smartrdds needs to be 31, while the number of the smartbeans needs to be only 1, and when 31 partial files are obtained through association, each SmartRDD is specifically associated with the SmartBean to obtain a new SmartRDD, and then the obtained 31 new smartrdds are converted into the corresponding 31 partial files for storage.

In order to facilitate understanding of the data processing method provided in the present embodiment, the following description is made with reference to an example:

assuming that based on a predefined data processing framework and a data processing instruction, the obtained SmartRDD is shown in table 1, and the obtained SmartBean is shown in table 2, based on a data processing interface provided by the data processing framework, after associating the content in table 1 with the content in table 2, the obtained associated object is specifically shown in table 3.

SmartRDD obtained from Table 1

Name (I)	Amount of consumption
		Zhang three	300
Li four	200

TABLE 2 SmartBean obtained

Name (I)	Sex	Native place	Department of department
				Zhang three	For male	Fujian tea	CIM
Li four	Woman	Shanghai province	CIM

TABLE 3 SmartRDD after Association

Name (I)	Amount of consumption	Sex	Native place	Department of department
					Zhang three	300	For male	Fujian tea	CIM
Li four	200	Woman	Shanghai province	CIM

Because the data processing method in the embodiment associates the source data with the configuration data, only one table needs to be processed when data writing input operation is performed and the shuffle operation of the elements in the array is rearranged in a random order, which can greatly reduce the consumption of device resources compared with the existing method for processing different tables.

In order to more intuitively see the resource consumption of the existing scheme (called as the old version) and the scheme (called as the new version) when the user accesses the data, the following comparison is made from the IO and shuffle perspectives:

TABLE 4 New version Job resource consumption

Submitted	Duration	Input	Shuffle Red	Shuffle Write
					2020-03-11 14:04:44	19s	2.9GB		123.4MB
2020-03-11 14:04:37	3s		65KB
					2020-03-11 14:04:37	0.3s		756.3KB	18.5KB
2020-03-11 14:04:35	0.6s		3.7KB	756.2KB
					2020-03-11 14:04:36	0.3s		2.5KB	83B
2020-03-11 14:04:36	0.2s		74.9KB	2.5KB
					2020-03-11 14:04:35	0.5s		3.7KB	74.9KB
2020-03-11 14:04:36	0.2s		441.6KB	42KB
					2020-03-11 14:04:35	1s		3.7KB	441.6KB
2020-03-11 14:04:18	16s	1944.9MB		3.7MB
					2020-03-11 14:04:07	4s

TABLE 5 old version Job resource consumption

It can be seen that the input difference between the new version and the old version is huge, the 41GB is changed into 2.9GB, and the shuffle is also reduced from dozens of GB to several MB.

In addition, by associating and storing the source data and the configuration data as a queue file, although the number of data fields in one table is about half, since only one table is occupied and the table is finally converted into the queue file, the data storage is reduced by about 75% through comparison.

Therefore, in the data processing method provided in this embodiment, three information, namely, a job type, a time interval and a time granularity, are extracted from an acquired data processing instruction, then, based on a task decomposition class provided by a predefined data processing framework, the extracted time interval is divided into a plurality of sub-intervals according to the extracted time granularity, a sub-task corresponding to each sub-interval is generated according to the extracted job type, source data and configuration data to be processed are acquired through a data processing interface provided by the predefined data processing framework, the source data and the configuration data are processed based on the interface, and finally, the processed data are stored as a queue file capable of acquiring content as needed, so that subsequent statistics only need to read related columns and do not need to read unrelated columns. Meanwhile, the source data and the configuration data are associated together by the request file, so that the subsequent development does not need to associate the configuration data again or associate the configuration data less, the processing of subsequent services is greatly simplified, and the development efficiency is effectively improved.

In addition, when the associated object is stored as a request file, data can be compressed, so that disk and memory storage is greatly reduced, and IO consumption is reduced.

Meanwhile, the associated object is stored as the queue file, and when the business processing is carried out based on the queue file, the data types stored in the queue file do not need to be converted, so that the processing speed is further improved.

In addition, because the associated data is packaged into SmartRDD and SmartBean in advance, developers can directly use SQL sentences to realize the data, so that the developers do not need to deeply know the Spark principle and the bottom layer technology, and the investment of labor cost is effectively reduced.

In addition, because the request files corresponding to different sub-intervals in the same time region are related, the data are divided into a plurality of sub-regions for simplification during data processing, and the plurality of request files correspond to the plurality of request files, and can be quickly related during subsequent statistics, so that the development is simplified, and the subsequent operation and maintenance work is facilitated.

A second embodiment of the present invention relates to a data processing method. The second embodiment mainly encapsulates the source data, and then obtains the elastic distributed data set RDD object for SQL statement query.

As shown in fig. 2, the data processing method according to the second embodiment includes the steps of:

step 201, obtaining predefined metadata of source data according to a preset metadata name.

Specifically, the metadata in this embodiment is a representation of the source data. In practical applications, the metadata mainly includes a path of the source data, a file name, a column type, column description information, a column default value, and the like, which are not listed one by one here.

Furthermore, it should be understood that, in practical applications, in order to quickly locate and acquire the metadata of the predefined source data, a name for identifying the uniqueness of each metadata may be assigned to each metadata, so that when the metadata of the predefined source data is acquired, the metadata matched with the name of each metadata stored in the storage is directly matched according to the preset metadata name, and then the metadata matched with the metadata name for input is screened out.

For convenience of explanation, the present embodiment stores information such as column names, column types, column description information, column default values, and the like included in metadata in the form of a table.

Correspondingly, the name used to identify the uniqueness of the metadata is the table name.

Accordingly, the preset metadata name may be composed of a path + time + table name, such as/var/data/2020-05-22/10: 00:00. test.

As is apparent from the above description, in the present embodiment, the table name is a suffix of the truncated metadata name, and thus when the metadata name is "/var/data/2020-05-22/10: 00:00. test", the metadata stored in/var/data/2020-05-22/10: 00:00 and having the table name "test" can be found finally based on the path in the metadata name and the truncated suffix.

In addition, regarding the column NAME included in the metadata, in a specific implementation, a person skilled in the art may define the column NAME in MySQL, for example, set the column NAME to ID/NAME/AGE/. so.

In addition, in practical applications, the preset metadata name is specifically stored in a designated storage area of an electronic device for implementing the data processing method provided in the present embodiment, for example, a developer stores the name in advance as needed before data encapsulation, or stores the name in another electronic device capable of communicating with the electronic device, or inputs the name in real time as needed during implementation of the data processing method by the electronic device, which is not limited in the present embodiment.

To facilitate understanding of the specific form of the metadata, the following is described in connection with examples:

assume that the content recorded in the source data is:

1, Zhang San, 23

2, lie four, 14

3, wangwu, 89

Then, based on the above data to be processed, predefined metadata is shown in table 1:

TABLE 6 metadata of Source data

Column name	Type (B)	Whether or not to allow null	Default value
				ID	INT	N	0
NAME	STRING	Y
				AGE	INT	Y

Specifically, "ID" in table 6 is used to correspond to the identification number of each user in the source data, such as 1, 2, and 3 in the above source data, "NAME" is used to correspond to the NAME of each user in the source data, such as zhang, lie, and wang, and "AGE" is used to correspond to the AGE of each user in the source data, such as 23, 14, and 89. And specifies the type of each column NAME (i.e., the column type mentioned above), such as INT, which is an integer type for "ID" and "AGE", STRING type for "NAME", which is a null type for "Y", which is not a null "N" (i.e., column description information), and whether a default value (i.e., column default value) needs to be set.

As shown in table 6, in table 1, the default portion may record some extended contents according to actual service needs, for example, the default portion corresponding to "ID" in table 1 records "0"; it is also possible not to record any information, such as default value parts corresponding to "NAME" and "AGE".

It should be understood that the above is only an example, and the technical solution of the present embodiment is not limited at all, and in practical applications, a person skilled in the art may set the technical solution as needed, and the present embodiment does not limit the technical solution.

Step 202, obtaining metadata to be packaged according to the metadata and a preset filtering condition.

Specifically, in this embodiment, the filtering condition is specifically a preset column NAME, such as "ID, NAME".

As can be seen from the example of the metadata (table 6) given in step 201, when the received preset filtering condition is "ID, NAME", specifically, information (column NAME, type, column description information, column default value) corresponding to two columns with column NAMEs "ID" and "NAME" is screened from the metadata, that is, the obtained metadata to be encapsulated is information related to the column NAMEs "ID" and "NAME".

For ease of understanding, the metadata given in table 6 is still used as an example here, and when the received preset filtering condition is "ID, NAME", the metadata to be encapsulated obtained according to the metadata recorded in table 1 and the preset filtering condition "ID, NAME" is shown in table 7.

Table 7 metadata to be packaged

Column name	Type (B)	Whether or not to allow null	Default value
				ID	INT	N	0
NAME	STRING	Y

It should be understood that the above is only illustrative, and the specific content of the present embodiment is not limited in any way.

Further, the preset column name may be stored in the electronic device in advance, similar to the preset metadata name in step 201, or may be input in real time according to business needs when implementing the data processing method in the present embodiment.

Step 203, marking the source data specified in the metadata as an elastic distributed data set character string type object.

As can be seen from the above description, the metadata includes a path and a file name of the source data. Therefore, when the source data specified in the metadata is marked as an elastic distributed data set string type object, specifically, a source data file recording the source data is determined according to the source data path recorded in the metadata and the file name of the source data, and then the source data in the source data file is marked as an elastic distributed data set string type object.

In order to facilitate understanding of the above-mentioned elastic distributed data set String type object, the Java programming language is taken as an example in this embodiment, and the above-mentioned elastic distributed data set String type object is specifically a Java rdd < String > object for the Java programming language.

That is to say, in practical application, the specific format corresponding to the elastic distributed data set string type object is named based on different programming languages, which is not limited in this embodiment, and those skilled in the art can set the format as needed.

And 204, converting the elastic distributed data set character string type object into an elastic distributed data set structured type object by taking the metadata to be packaged as a filtering condition.

Still taking the data processing method provided in this embodiment as an example for the Java programming language, when the elastic distributed dataset String type object is a Java rdd < String > object, the elastic distributed dataset structured type object obtained by the conversion is also for the Java programming language, that is, in this embodiment, the elastic distributed dataset structured type object obtained by the conversion is specifically a Java rdd < Row > object for the Java programming language.

In addition, as can be seen from the above description, the metadata further includes related information of the columns of the source data, such as column names, column types, column description information, column default values, and the like. Therefore, when the java rdd < String > object is converted into the java rdd < Row > object by using the metadata to be encapsulated as a filtering condition, specifically, by using the preset column name, the preset column name of the source data, and the specific information is converted according to the preset column name and the related information of the columns, such as the column name, the column type, the column description information, the column default value, and the like, for example, based on the column name used for input, all the related information corresponding to the current column name is found from the java rdd < String > object, and is marked in the Row form, so that the java rdd < Row > object is obtained.

It should be understood that, since the JavaRDD < String > object and the JavaRDD < Row > object are both common objects, the use of the two objects and the conversion between them can be realized by referring to relevant data by those skilled in the art, and the description of the embodiment is omitted.

In addition, it should be noted that, in practical applications, before the step 204 is executed, the metadata to be packaged may be converted into a schema format.

Correspondingly, when the elastic distributed dataset String type object, such as the Java rdd < String > object for the Java programming language, is converted into the elastic distributed dataset structured type object with the metadata to be encapsulated as the filtering condition, such as the Java rdd < Row > object for the Java programming language, the elastic distributed dataset String type object is specifically converted into the Java rdd < Row > object with the schema as the filtering condition.

Step 205, converting the metadata to be encapsulated and the elastic distributed data set structured type object into a Dataset < Row > object.

Namely, the converted Dataset < Row > object records the relationship between the metadata to be encapsulated and the elastic distributed data set structured type object.

Step 206, encapsulating the metadata to be encapsulated, the elastic distributed data set structured type object and the Dataset < Row > object in the same data object to obtain an RDD object for SQL statement query.

As can be seen from the above description, since the metadata to be packaged can be converted into schema, the operation in step 206 may specifically be: and encapsulating the schema, the elastic distributed data set structured type object and the Dataset < Row > object in the same data object, thereby obtaining the RDD object for SQL statement query.

In addition, in the present embodiment, regardless of the JavaRDD < String > object, the JavaRDD < Row > object, or the Dataset < Row > object, the tag is used to store the corresponding data, and not the data itself.

It is not difficult to find out through the above description that the data processing method provided in this embodiment marks the source data as an elastic distributed data set string type object according to the metadata corresponding to the source data, filters the metadata to be encapsulated from the metadata according to the preset filtering condition, converts the elastic distributed data set string type object into an elastic distributed data set structured type object with the metadata to be encapsulated as the filtering condition, converts the metadata to be encapsulated and the elastic distributed data set structured type object into a Dataset < Row > object, encapsulates the obtained metadata to be encapsulated, the elastic distributed data set structured type object, and the Dataset < Row > object into a data object in an RDD format capable of being queried by using SQL statements, so that a developer can implement the development of the large data Spark only by using SQL statements without deeply knowing the Spark principle and the underlying technology, thereby further reducing the input of labor cost.

A third embodiment of the present invention relates to a data processing method. The second embodiment mainly encapsulates the configuration data, and then obtains a simple entity Bean object for query by an SQL statement.

As shown in fig. 3, the data processing method according to the third embodiment includes the steps of:

step 301, obtaining predefined metadata of configuration data according to a preset metadata name.

Step 302, obtaining metadata to be encapsulated according to the metadata and a preset filtering condition.

It is to be understood that

steps

301 and 302 in this embodiment are substantially the same as

steps

201 and 202 in the second embodiment, and are not repeated here.

Step 303, converting the configuration data specified in the metadata into a structured array by using the metadata to be packaged as a filtering condition.

Specifically, in this embodiment, the metadata specifically includes a path of the configuration data and a file name of the configuration data. Therefore, when the configuration data specified in the metadata is converted into a structured array by taking the metadata to be packaged as a filtering condition, a configuration data file for recording the configuration data is determined according to the path of the configuration data recorded in the metadata and the file name of the configuration data; then, reading the configuration data recorded in the configuration data file by using a row unit, and separating each row of the read configuration data to obtain a character string array; and finally, converting the character string array into a structured array according to the element number to be packaged.

The operation of separating the read configuration data for each line may be separation based on a separator input by a user or a style based on a separator default by a system in practical applications, and a person skilled in the art may set the operation as needed, which is not limited in this embodiment.

Furthermore, it is worth mentioning that, since the predefined metadata in the practical application usually includes column information of a plurality of columns, for convenience of management, the columns may be stored in a set so as to find out the metadata to be packaged that satisfies the condition from the column name input by the user.

Accordingly, in practical applications, if the screened metadata to be packaged includes column information of a plurality of columns, for convenience of management, the columns in the metadata to be packaged may also be stored in a set.

And 304, encapsulating the metadata to be encapsulated and the structured array in the same data object to obtain a Bean object for SQL statement query.

As can be easily found from the above description, the data processing method provided in this embodiment encapsulates configuration data with a low update frequency into smartbeans, and provides a custom interface to manage smartbeans, thereby implementing multiplexing of smartbeans, avoiding repeated development even if the same SmartBean can be used by different services, and improving development efficiency.

In addition, the SmartBean and the SmartRDD both store the metadata to be packaged which are screened out based on the same filtering condition, so that the SmartBean and the SmartRDD are conveniently related based on a data processing interface provided by a data processing framework, the consumption of equipment resources is reduced as much as possible under the condition of not increasing Spark operation, and the equipment cost is saved.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A fourth embodiment of the present invention relates to a data processing apparatus, as shown in fig. 4, including: an instruction acquisition module 401, a task decomposition module 402, a data acquisition module 403, and a data processing module 404.

The instruction obtaining module 401 is configured to obtain a data processing instruction, and extract a job type, a time interval, and a time granularity from the data processing instruction; a task decomposition module 402, configured to divide the time interval into multiple sub-intervals according to the time granularity based on a task decomposition class provided by a predefined data processing framework, and generate a subtask corresponding to each sub-interval according to the job type; a data obtaining module 403, configured to call, by the subtask, a data processing interface provided by the data processing framework, and obtain data to be processed, where the data to be processed includes source data and configuration data; and a data processing module 404, configured to process the source data and the configuration data based on a data processing interface provided by the data processing framework, and store the processed data as a queue file in a column storage format.

In addition, in another example, the data processing apparatus further includes a source data encapsulation module and a configuration data encapsulation module.

Specifically, the source data encapsulation module is configured to encapsulate the source data to obtain an elastic distributed data set RDD object for SQL statement query.

Correspondingly, the configuration data packaging module is used for packaging the configuration data to obtain the simple entity Bean object for SQL statement query.

In addition, in another example, the source data encapsulation module is specifically configured to obtain metadata of predefined source data according to a preset metadata name; obtaining metadata to be packaged according to the metadata and preset filtering conditions; marking the source data specified in the metadata as an elastic distributed data set string type object; converting the elastic distributed data set character string type object into an elastic distributed data set structured type object by taking the metadata to be packaged as a filtering condition; converting the metadata to be packaged and the elastic distributed data set structured type object into a Dataset < Row > object; and encapsulating the metadata to be encapsulated, the elastic distributed data set structured type object and the Dataset < Row > object in the same data object to obtain an RDD object for SQL statement query.

In addition, in another example, the configuration data encapsulation module is specifically configured to obtain metadata of predefined configuration data according to a preset metadata name; obtaining metadata to be packaged according to the metadata and preset filtering conditions; converting the configuration data appointed in the metadata into a structured array by taking the metadata to be packaged as a filtering condition; and encapsulating the metadata to be encapsulated and the structured array in the same data object to obtain a Bean object for SQL statement query.

In addition, in another example, the data processing module 404 is specifically configured to associate the RDD object and the Bean object based on a data processing interface provided by the data processing framework, so as to obtain an associated object in an RDD format; and saving the associated object in the RDD format as a queue file in a column storage format.

Further, in another example, the data processing apparatus further includes: the device comprises a time granularity detection module and a time granularity rounding module.

Specifically, the time granularity detection module is configured to detect whether the time granularity meets a preset time granularity dereferencing rule, where the time granularity dereferencing rule specifies that the time granularity is an integer multiple of a generated granularity of the data to be processed.

Correspondingly, if the task is met, the task decomposition module 402 is triggered to execute a task decomposition class provided based on a predefined data processing framework, the time interval is divided into a plurality of sub-intervals according to the time granularity, and a sub-task corresponding to each sub-interval is generated according to the operation type; otherwise, the time granularity rounding module is informed to round the time granularity.

Correspondingly, after the time granularity rounding module rounds the time granularity, the task decomposition module 402 is specifically configured to divide the time interval into a plurality of sub-intervals according to the rounded time granularity based on a task decomposition class provided by a predefined data processing framework, and generate a sub-task corresponding to each sub-interval according to the job type.

Further, in another example, the data processing apparatus further includes: the device comprises a time interval detection module and an abnormity prompting module.

Specifically, the time interval detection module is configured to detect whether a start time corresponding to the time interval is greater than an end time corresponding to the time interval.

Correspondingly, if the number of the sub-tasks is larger than the number of the sub-intervals, triggering the task decomposition module 402 to execute a task decomposition class provided based on a predefined data processing framework, dividing the time interval into a plurality of sub-intervals according to the time granularity, and generating a sub-task corresponding to each sub-interval according to the operation type; otherwise, the abnormity prompting module is informed to prompt abnormity.

In addition, in another example, the data processing apparatus further comprises a data processing framework encapsulation module.

Specifically, the data processing framework encapsulation module is configured to abstract a task decomposition class and a data processing class based on a Spark framework; constructing a data processing interface for the data processing class; and packaging the environment initialization method of the Spark framework, the read-write method of the parquet file provided by Spark SQL, the task decomposition class and the data processing interface to obtain the data processing framework.

It will be appreciated that this embodiment is an apparatus embodiment corresponding to the first, second or third embodiment and that this embodiment may be implemented in conjunction with the first, second or third embodiment. The related technical details mentioned in the first, second, or third embodiment are still valid in this embodiment, and are not repeated here for the sake of reducing repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first, or second, or third embodiment.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A fifth embodiment of the present invention relates to an electronic device, as shown in fig. 5, including at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; the memory 502 stores instructions executable by the at least one processor 501, and the instructions are executed by the at least one processor 501, so that the at least one processor 501 can execute the data processing method described in the first or second embodiment.

The memory 502 and the processor 501 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 501 and the memory 502 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 501 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 501.

The processor 501 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 502 may be used to store data used by processor 501 in performing operations.

A sixth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described data processing method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for practicing the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A data processing method, comprising:

2. The data processing method of claim 1, wherein before the processing the source data and the configuration data based on the data processing interface provided by the data processing framework, the method further comprises:

3. The data processing method according to claim 2, wherein the encapsulating the source data to obtain an elastic distributed data set RDD object for SQL statement query includes:

4. The data processing method of claim 2, wherein the encapsulating the configuration data to obtain a simple entity Bean object for SQL statement query comprises:

5. The data processing method according to claim 2, wherein the processing the source data and the configuration data based on the data processing interface provided by the data processing framework, and saving the processed data as a column-wise storage format request file comprises:

6. The data processing method according to claim 1, wherein before the task decomposition class provided based on the predefined data processing framework divides the time interval into a plurality of sub-intervals according to the time granularity, and generates a sub-task corresponding to each sub-interval according to the job type, the method further comprises:

otherwise, rounding the time granularity;

7. The data processing method according to claim 6, wherein before the task decomposition class provided based on the predefined data processing framework divides the time interval into a plurality of sub-intervals according to the time granularity, and generates a sub-task corresponding to each sub-interval according to the job type, the method further comprises:

otherwise, carrying out exception prompting.

8. The data processing method according to any one of claims 1 to 7, wherein before the data processing interface provided based on a predefined data processing framework is called by the subtask to obtain the data to be processed, the method further comprises:

constructing a data processing interface for the data processing class;

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1 to 8.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 8.