CN115221143A

CN115221143A - Cross-type migration operator-based multi-source big data processing method

Info

Publication number: CN115221143A
Application number: CN202210443573.6A
Authority: CN
Inventors: 李常宝; 康健; 曹龙启; 贾贺; 薛铮
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-10-21

Abstract

The invention provides a cross-type migration operator-based multi-source big data processing method, which can solve the problem of complex data integration operation such as traditional data migration and the like and realize non-programming cross-data-source type data operation. The method comprises the following steps: the method comprises the steps that firstly, data operation operators are created according to data integration methods of different types of data sources, the data sources in the data operation operators are configured, and metadata structure information of input data and output data is configured according to the structures of the data sources to obtain data operation process files; secondly, reading data in a data source by using spark according to the data operation flow file and the data operation operator configuration information thereof, and taking the data as source data; performing type conversion on the read data set according to the source data; and thirdly, according to the data processing operation flow drawn in the data operation flow file, performing data processing by using each data operation operator according to the flow, and writing the data set into a target data source.

Description

Cross-type migration operator-based multi-source big data processing method

Technical Field

The invention belongs to the technical field of big data processing, and relates to a cross-type migration operator-based multi-source big data processing method.

Background

With the continuous expansion of business requirements, the data scale is increasingly complex, and the data sources are also increasingly diversified. In daily data development process and practical application, data can be sourced from a plurality of data sources, and the data sources are often stored in different types of databases according to different use scenes. In this case, the data not only has the problem that the field types in the heterogeneous source database may be inconsistent, but also needs to deal with the structural differences brought by the heterogeneous database, which makes the use of the heterogeneous source data and data migration a vast project, and developers have to spend a lot of time to deal with the problem of data diversity and performance, which brings great influence to the data analysis processing.

The traditional data migration or data integration operation usually needs to write a special program for data reading and writing and processing operation, when multiple data sources are involved, not only the matching of the field types of the data sources but also the difference of different data storage structure modes in relational and non-relational databases need to be considered, so that the codes of the data operation are very complicated, and the production purpose cannot be achieved by quick development.

Disclosure of Invention

Aiming at the defects, the invention provides a cross-type migration operator-based multi-source big data processing method, which can solve the problem of complex data integration operation such as traditional data migration and the like and realize non-programming cross-data-source type data operation.

The invention is realized by the following technical scheme.

A cross-type migration operator-based multi-source big data processing method comprises the following steps:

step one, establishing a data operation operator according to data integration methods of different types of data sources, configuring the data sources in the data operation operator, and configuring metadata structure information of input data and output data according to the structures of the data sources to obtain a data operation flow file;

secondly, reading data in a data source by using spark according to the data operation flow file and the data operation operator configuration information thereof, and taking the data as source data; performing type conversion on the read data set according to the source data to finally obtain a data stream to be operated, and transmitting the data stream through an output port defined by an operator;

and thirdly, according to the data processing operation flow drawn in the data operation flow file, performing data processing by using each data operation operator according to the flow, transmitting data among the operators through ports defined by each operator for adaptation, finally converting the processed operator data into a data set meeting a field structure of a target data source, and writing the data set into the target data source.

The invention has the beneficial effects that:

compared with the prior art, the method adopts an operator-based mode to package the actual data processing process and the code logic, so that service personnel can conveniently obtain a data integration processing result in a simpler mode, the data processing process can timely feed back the data operation condition and check the data operation result in real time, and a user can conveniently control the data processing process; meanwhile, the method supports data type conversion and data migration among heterogeneous data sources, and avoids complex type adaptation operation in the traditional heterogeneous data migration process; in addition, the provided big data processing strategy can ensure that the data operation of large data volume is smoothly completed, and finally, the data operation of non-programming cross-heterogeneous data source type is realized.

Detailed Description

The present invention will be described in further detail below.

The method for processing the sub-computerized multi-source big data by the cross-type migration comprises the following steps:

the method comprises the steps that firstly, data operation operators are created according to data integration methods of different types of data sources, the data sources in the data operation operators are configured, and metadata structure information of input data and output data is configured according to the structures of the data sources to obtain data operation process files;

the input data refers to data read from a selected data source, and the output data refers to data to be stored in a target data source or data to be processed; the data operation operators are in adaptive connection through ports, and the ports are used for identifying application types of the data circulated in the operators, and the application types comprise data sets, graphic data graphs, text data texts, structured data json and the like.

in this embodiment, the reading of the data in the data source by using spark includes JDBC driving, database authorization information, database name, table name, data range, metadata information, and the like.

And thirdly, according to the data processing operation flow drawn in the data operation flow file, using each data operation operator according to the flow to perform data processing, transmitting data among the operators for adaptation through a port defined by each operator, finally converting the processed operator data into a data set meeting a field structure of a destination data source, and writing the data set into the destination data source.

In this embodiment, each data operation operator includes, in addition to the input and output related operators, preprocessing operators such as data filtering, deduplication, sorting, format conversion, and missing value processing, statistical analysis operators such as data retrieval and condition screening, machine learning operators such as classification, clustering, and regression, and other data service operators.

In this embodiment, the data processing using each data operator specifically includes the following steps:

step 1, configuring a data operation operator according to a data source type and a data field type of a data operation flow file, and determining a metadata structure of selected data;

in this embodiment, the data operation operator includes an operator configuration html file and a Jar package exported by an execution code entity. In specific implementation, the operator encapsulation can be performed on the database read-write operation of the data source type, including the relational database MySQL, the non-relational databases mnoogodb, the elastic search, the HBase, the HDFS, and the read-write operation encapsulation of various local documents.

Step 2, connecting the data operation operators to process and improve data content, determining the execution sequence and the data flow direction of the data operation operators, generating a DAG (direct current) diagram corresponding to a data import and export process and Json data corresponding to the DAG diagram, and submitting an execution process request;

in the embodiment, the data content is perfected by connecting a data retrieval statistical operator with graphical operators such as bar charts and line graphs.

And 3, receiving the execution flow request by the flow execution service module, acquiring the DAG data, analyzing operators, operator configuration parameters and operator execution sequences selected in the flow, executing each data operator according to the analysis result in sequence, performing type processing in the operators according to the structure type of the data source and the recorded metadata, reading or writing a data set, and finally finishing the data processing flow.

In this embodiment, when reading or writing the data set, in order to avoid service interruption and process interruption caused by excessively large memory occupied by the server in the data processing process, a batch import policy is adopted, that is, the data set is pulled in batches when reading data from the source data source and then imported into the target data source.

In this embodiment, the performing type processing in the operator according to the structure type of the data source and the recorded metadata specifically includes:

when the MySQL structured database is processed, the sql statements are spliced by setting data offset and the number of data read at a time, the JDBC is used for executing the sql statements to obtain a data set, and then the insert statements are spliced in the same way to insert the data set. The setting of the data offset is inefficient because the setting of the offset leads to full table scanning, so that when table data for determining the only main key is imported and exported, data positioning is performed in a pointer-based mode, and the start position of the data to be read at this time is determined through a where clause, so that data records outside a target range do not need to be considered when the data are read.

When processing an unstructured database, when reading the elastic search data by using RestHighLevelClient, the adopted paging mode is in a form of from + size, under the condition of deep paging, the use mode has low efficiency, and because the default es reading data cannot exceed the limit of 10000, a scroll mode is required to be adopted to set a cursor to perform paging reading until the hits field returned in a result set is empty reading completion; when reading hbase data, because the total record number of the data cannot be obtained and the reading performance is too low by using the from + size format, a cursor needs to be set, the data is read in batches by using scanner.

When the cross-storage type is subjected to the cooperative processing of import and export operations, the source data set is converted into a structural data set by using dataset and then is imported into the target data source. When an operator is configured, reading metadata information metadata of data according to a selected data source structure, configuring the type of the metadata in an interface, and converting the metadata into a common data type; and finally configuring metadata according to the field requirement of the target data and the field type of the target data source. In the execution of an operator, the data type of the dataset is converted according to the change of the configured metadata, the conversion of the data type or the renaming of the data field is carried out through the method withColumn () and withColumRenamed () provided by the dataset, and finally the goal of the cooperative processing of the heterogeneous data source is achieved.

In the cooperative processing process, when an elastic search is read, a null value cannot be read, and a parameter "es.field.read.empty.as.null" needs to be configured in a calling API and is "no"; when reading hbase, data needs to be read as javaparirdd first, and then further converted into dataset and the like.

Although embodiments of this invention have been described, it will be apparent to those skilled in the art that various modifications may be made without departing from the principles of the invention and these are to be considered within the scope of the invention.

Claims

1. A cross-type migration operator-based multi-source big data processing method is characterized by comprising the following steps:

2. The method for processing the sub-computerized multi-source big data by the cross-type migration, according to claim 1, wherein the input data refers to data read from a selected data source, and the output data refers to data to be stored in a destination data source or data to be processed; and the data operation operators are in adaptive connection through ports.

3. The method for processing the sub-computerized multi-source big data of the cross-type migration in claim 2, wherein the port is used for identifying the application type of the streaming data in the operator, and the application type comprises a dataset, a graphic data graph, a text data text and a structured data json.

4. The method for processing the large data of the multiple computerized sources through the operators of the cross-type migration, according to claim 1, 2 or 3, wherein the reading of the data in the data source by using spark comprises JDBC driving, database authorization information, database name, table name, data range and metadata information.

5. The method for processing cross-type migration-based operator-oriented multi-source big data according to claim 1, 2 or 3, wherein the data processing using each data operation operator specifically comprises the following steps:

step 2, connecting the data operation operator to process and improve data content, determining the execution sequence and the data flow direction of the data operation operator, generating a DAG (direct current) diagram corresponding to a data import and export process and Json data corresponding to the DAG diagram, and submitting an execution process request;

6. The method for processing the sub-computerized multi-source big data of the cross-type migration in the claim 5, wherein the data operation operators comprise an operator configuration html file and a Jar package exported by an execution code entity.

7. The cross-type migration computerized multi-source big data processing method according to claim 6, wherein the data source type database read-write operations are subjected to operator packaging, including relational database MySQL, and non-relational databases mnoogodb, elastic search, HBase, and HDFS and local read-write operation packaging of various types of documents.

8. The method for processing the multi-source big data by the operator of the cross-type migration according to the claim 5, 6 or 7, characterized in that the data content is perfected by connecting the data retrieval statistical operator with the histogram and the line graph imaging operator.

9. The method for processing the sub-computerized multi-source big data of the cross-type migration in claim 5, 6 or 7, wherein a batch import strategy is adopted when the data sets are read or written, namely, the data sets are pulled in batches when the data sets are read from the source data sources and then imported into the target data sources.

10. The method for processing the sub-computerized multi-source big data by the cross-type migration according to claim 5, 6 or 7, wherein the type processing is performed in the operator according to the structure type of the data source and the recorded metadata, specifically:

when the MySQL structured database is processed, the sql statements are spliced by setting data offset and the number of data read at a time, the JDBC is used for executing the sql statements to obtain a data set, and then the insert statements are spliced in the same way to insert the data set.

11. The method for processing the sub-computerized multi-source big data of the cross-type migration, according to claim 10, wherein the data offset is set, data positioning is performed in a pointer-based manner, and a start position of the data to be read at this time is determined through a where clause, so that data records outside a target range do not need to be considered when the data are read.

12. The method for processing the sub-computerized multi-source big data by the cross-type migration according to claim 5, 6 or 7, wherein the type processing is performed in the operator according to the structure type of the data source and the recorded metadata, specifically:

when the cross-storage type is subjected to import and export operation cooperative processing, converting a source data set into a structural data set by using dataset, and importing the structural data set into a target data source;

when an operator is configured, reading metadata information metadata of data according to a selected data source structure, configuring the type of the metadata in an interface, and converting the metadata into a common data type; and finally configuring metadata according to the field requirement of the target data and the field type of the target data source;

when the operator is executed, the data type of the dataset is converted according to the change of the configured metadata, the data type conversion or the data field renaming is carried out through the method withColumn () and withColumRenamed () provided by the dataset, and finally the goal of the cooperative processing of the heterogeneous data source is achieved.