CN115221143A - Cross-type migration operator-based multi-source big data processing method - Google Patents

Cross-type migration operator-based multi-source big data processing method Download PDF

Info

Publication number
CN115221143A
CN115221143A CN202210443573.6A CN202210443573A CN115221143A CN 115221143 A CN115221143 A CN 115221143A CN 202210443573 A CN202210443573 A CN 202210443573A CN 115221143 A CN115221143 A CN 115221143A
Authority
CN
China
Prior art keywords
data
source
operator
type
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210443573.6A
Other languages
Chinese (zh)
Inventor
李常宝
康健
曹龙启
贾贺
薛铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202210443573.6A priority Critical patent/CN115221143A/en
Publication of CN115221143A publication Critical patent/CN115221143A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cross-type migration operator-based multi-source big data processing method, which can solve the problem of complex data integration operation such as traditional data migration and the like and realize non-programming cross-data-source type data operation. The method comprises the following steps: the method comprises the steps that firstly, data operation operators are created according to data integration methods of different types of data sources, the data sources in the data operation operators are configured, and metadata structure information of input data and output data is configured according to the structures of the data sources to obtain data operation process files; secondly, reading data in a data source by using spark according to the data operation flow file and the data operation operator configuration information thereof, and taking the data as source data; performing type conversion on the read data set according to the source data; and thirdly, according to the data processing operation flow drawn in the data operation flow file, performing data processing by using each data operation operator according to the flow, and writing the data set into a target data source.

Description

Cross-type migration operator-based multi-source big data processing method
Technical Field
The invention belongs to the technical field of big data processing, and relates to a cross-type migration operator-based multi-source big data processing method.
Background
With the continuous expansion of business requirements, the data scale is increasingly complex, and the data sources are also increasingly diversified. In daily data development process and practical application, data can be sourced from a plurality of data sources, and the data sources are often stored in different types of databases according to different use scenes. In this case, the data not only has the problem that the field types in the heterogeneous source database may be inconsistent, but also needs to deal with the structural differences brought by the heterogeneous database, which makes the use of the heterogeneous source data and data migration a vast project, and developers have to spend a lot of time to deal with the problem of data diversity and performance, which brings great influence to the data analysis processing.
The traditional data migration or data integration operation usually needs to write a special program for data reading and writing and processing operation, when multiple data sources are involved, not only the matching of the field types of the data sources but also the difference of different data storage structure modes in relational and non-relational databases need to be considered, so that the codes of the data operation are very complicated, and the production purpose cannot be achieved by quick development.
Disclosure of Invention
Aiming at the defects, the invention provides a cross-type migration operator-based multi-source big data processing method, which can solve the problem of complex data integration operation such as traditional data migration and the like and realize non-programming cross-data-source type data operation.
The invention is realized by the following technical scheme.
A cross-type migration operator-based multi-source big data processing method comprises the following steps:
step one, establishing a data operation operator according to data integration methods of different types of data sources, configuring the data sources in the data operation operator, and configuring metadata structure information of input data and output data according to the structures of the data sources to obtain a data operation flow file;
secondly, reading data in a data source by using spark according to the data operation flow file and the data operation operator configuration information thereof, and taking the data as source data; performing type conversion on the read data set according to the source data to finally obtain a data stream to be operated, and transmitting the data stream through an output port defined by an operator;
and thirdly, according to the data processing operation flow drawn in the data operation flow file, performing data processing by using each data operation operator according to the flow, transmitting data among the operators through ports defined by each operator for adaptation, finally converting the processed operator data into a data set meeting a field structure of a target data source, and writing the data set into the target data source.
The invention has the beneficial effects that:
compared with the prior art, the method adopts an operator-based mode to package the actual data processing process and the code logic, so that service personnel can conveniently obtain a data integration processing result in a simpler mode, the data processing process can timely feed back the data operation condition and check the data operation result in real time, and a user can conveniently control the data processing process; meanwhile, the method supports data type conversion and data migration among heterogeneous data sources, and avoids complex type adaptation operation in the traditional heterogeneous data migration process; in addition, the provided big data processing strategy can ensure that the data operation of large data volume is smoothly completed, and finally, the data operation of non-programming cross-heterogeneous data source type is realized.
Detailed Description
The present invention will be described in further detail below.
The method for processing the sub-computerized multi-source big data by the cross-type migration comprises the following steps:
the method comprises the steps that firstly, data operation operators are created according to data integration methods of different types of data sources, the data sources in the data operation operators are configured, and metadata structure information of input data and output data is configured according to the structures of the data sources to obtain data operation process files;
the input data refers to data read from a selected data source, and the output data refers to data to be stored in a target data source or data to be processed; the data operation operators are in adaptive connection through ports, and the ports are used for identifying application types of the data circulated in the operators, and the application types comprise data sets, graphic data graphs, text data texts, structured data json and the like.
Secondly, reading data in a data source by using spark according to the data operation flow file and the data operation operator configuration information thereof, and taking the data as source data; performing type conversion on the read data set according to the source data to finally obtain a data stream to be operated, and transmitting the data stream through an output port defined by an operator;
in this embodiment, the reading of the data in the data source by using spark includes JDBC driving, database authorization information, database name, table name, data range, metadata information, and the like.
And thirdly, according to the data processing operation flow drawn in the data operation flow file, using each data operation operator according to the flow to perform data processing, transmitting data among the operators for adaptation through a port defined by each operator, finally converting the processed operator data into a data set meeting a field structure of a destination data source, and writing the data set into the destination data source.
In this embodiment, each data operation operator includes, in addition to the input and output related operators, preprocessing operators such as data filtering, deduplication, sorting, format conversion, and missing value processing, statistical analysis operators such as data retrieval and condition screening, machine learning operators such as classification, clustering, and regression, and other data service operators.
In this embodiment, the data processing using each data operator specifically includes the following steps:
step 1, configuring a data operation operator according to a data source type and a data field type of a data operation flow file, and determining a metadata structure of selected data;
in this embodiment, the data operation operator includes an operator configuration html file and a Jar package exported by an execution code entity. In specific implementation, the operator encapsulation can be performed on the database read-write operation of the data source type, including the relational database MySQL, the non-relational databases mnoogodb, the elastic search, the HBase, the HDFS, and the read-write operation encapsulation of various local documents.
Step 2, connecting the data operation operators to process and improve data content, determining the execution sequence and the data flow direction of the data operation operators, generating a DAG (direct current) diagram corresponding to a data import and export process and Json data corresponding to the DAG diagram, and submitting an execution process request;
in the embodiment, the data content is perfected by connecting a data retrieval statistical operator with graphical operators such as bar charts and line graphs.
And 3, receiving the execution flow request by the flow execution service module, acquiring the DAG data, analyzing operators, operator configuration parameters and operator execution sequences selected in the flow, executing each data operator according to the analysis result in sequence, performing type processing in the operators according to the structure type of the data source and the recorded metadata, reading or writing a data set, and finally finishing the data processing flow.
In this embodiment, when reading or writing the data set, in order to avoid service interruption and process interruption caused by excessively large memory occupied by the server in the data processing process, a batch import policy is adopted, that is, the data set is pulled in batches when reading data from the source data source and then imported into the target data source.
In this embodiment, the performing type processing in the operator according to the structure type of the data source and the recorded metadata specifically includes:
when the MySQL structured database is processed, the sql statements are spliced by setting data offset and the number of data read at a time, the JDBC is used for executing the sql statements to obtain a data set, and then the insert statements are spliced in the same way to insert the data set. The setting of the data offset is inefficient because the setting of the offset leads to full table scanning, so that when table data for determining the only main key is imported and exported, data positioning is performed in a pointer-based mode, and the start position of the data to be read at this time is determined through a where clause, so that data records outside a target range do not need to be considered when the data are read.
When processing an unstructured database, when reading the elastic search data by using RestHighLevelClient, the adopted paging mode is in a form of from + size, under the condition of deep paging, the use mode has low efficiency, and because the default es reading data cannot exceed the limit of 10000, a scroll mode is required to be adopted to set a cursor to perform paging reading until the hits field returned in a result set is empty reading completion; when reading hbase data, because the total record number of the data cannot be obtained and the reading performance is too low by using the from + size format, a cursor needs to be set, the data is read in batches by using scanner.
When the cross-storage type is subjected to the cooperative processing of import and export operations, the source data set is converted into a structural data set by using dataset and then is imported into the target data source. When an operator is configured, reading metadata information metadata of data according to a selected data source structure, configuring the type of the metadata in an interface, and converting the metadata into a common data type; and finally configuring metadata according to the field requirement of the target data and the field type of the target data source. In the execution of an operator, the data type of the dataset is converted according to the change of the configured metadata, the conversion of the data type or the renaming of the data field is carried out through the method withColumn () and withColumRenamed () provided by the dataset, and finally the goal of the cooperative processing of the heterogeneous data source is achieved.
In the cooperative processing process, when an elastic search is read, a null value cannot be read, and a parameter "es.field.read.empty.as.null" needs to be configured in a calling API and is "no"; when reading hbase, data needs to be read as javaparirdd first, and then further converted into dataset and the like.
Although embodiments of this invention have been described, it will be apparent to those skilled in the art that various modifications may be made without departing from the principles of the invention and these are to be considered within the scope of the invention.

Claims (12)

1. A cross-type migration operator-based multi-source big data processing method is characterized by comprising the following steps:
step one, establishing a data operation operator according to data integration methods of different types of data sources, configuring the data sources in the data operation operator, and configuring metadata structure information of input data and output data according to the structures of the data sources to obtain a data operation flow file;
secondly, reading data in a data source by using spark according to the data operation flow file and the data operation operator configuration information thereof, and taking the data as source data; performing type conversion on the read data set according to the source data to finally obtain a data stream to be operated, and transmitting the data stream through an output port defined by an operator;
and thirdly, according to the data processing operation flow drawn in the data operation flow file, performing data processing by using each data operation operator according to the flow, transmitting data among the operators through ports defined by each operator for adaptation, finally converting the processed operator data into a data set meeting a field structure of a target data source, and writing the data set into the target data source.
2. The method for processing the sub-computerized multi-source big data by the cross-type migration, according to claim 1, wherein the input data refers to data read from a selected data source, and the output data refers to data to be stored in a destination data source or data to be processed; and the data operation operators are in adaptive connection through ports.
3. The method for processing the sub-computerized multi-source big data of the cross-type migration in claim 2, wherein the port is used for identifying the application type of the streaming data in the operator, and the application type comprises a dataset, a graphic data graph, a text data text and a structured data json.
4. The method for processing the large data of the multiple computerized sources through the operators of the cross-type migration, according to claim 1, 2 or 3, wherein the reading of the data in the data source by using spark comprises JDBC driving, database authorization information, database name, table name, data range and metadata information.
5. The method for processing cross-type migration-based operator-oriented multi-source big data according to claim 1, 2 or 3, wherein the data processing using each data operation operator specifically comprises the following steps:
step 1, configuring a data operation operator according to a data source type and a data field type of a data operation flow file, and determining a metadata structure of selected data;
step 2, connecting the data operation operator to process and improve data content, determining the execution sequence and the data flow direction of the data operation operator, generating a DAG (direct current) diagram corresponding to a data import and export process and Json data corresponding to the DAG diagram, and submitting an execution process request;
and 3, receiving the execution flow request by the flow execution service module, acquiring the DAG data, analyzing operators, operator configuration parameters and operator execution sequences selected in the flow, executing each data operator according to the analysis result in sequence, performing type processing in the operators according to the structure type of the data source and the recorded metadata, reading or writing a data set, and finally finishing the data processing flow.
6. The method for processing the sub-computerized multi-source big data of the cross-type migration in the claim 5, wherein the data operation operators comprise an operator configuration html file and a Jar package exported by an execution code entity.
7. The cross-type migration computerized multi-source big data processing method according to claim 6, wherein the data source type database read-write operations are subjected to operator packaging, including relational database MySQL, and non-relational databases mnoogodb, elastic search, HBase, and HDFS and local read-write operation packaging of various types of documents.
8. The method for processing the multi-source big data by the operator of the cross-type migration according to the claim 5, 6 or 7, characterized in that the data content is perfected by connecting the data retrieval statistical operator with the histogram and the line graph imaging operator.
9. The method for processing the sub-computerized multi-source big data of the cross-type migration in claim 5, 6 or 7, wherein a batch import strategy is adopted when the data sets are read or written, namely, the data sets are pulled in batches when the data sets are read from the source data sources and then imported into the target data sources.
10. The method for processing the sub-computerized multi-source big data by the cross-type migration according to claim 5, 6 or 7, wherein the type processing is performed in the operator according to the structure type of the data source and the recorded metadata, specifically:
when the MySQL structured database is processed, the sql statements are spliced by setting data offset and the number of data read at a time, the JDBC is used for executing the sql statements to obtain a data set, and then the insert statements are spliced in the same way to insert the data set.
11. The method for processing the sub-computerized multi-source big data of the cross-type migration, according to claim 10, wherein the data offset is set, data positioning is performed in a pointer-based manner, and a start position of the data to be read at this time is determined through a where clause, so that data records outside a target range do not need to be considered when the data are read.
12. The method for processing the sub-computerized multi-source big data by the cross-type migration according to claim 5, 6 or 7, wherein the type processing is performed in the operator according to the structure type of the data source and the recorded metadata, specifically:
when the cross-storage type is subjected to import and export operation cooperative processing, converting a source data set into a structural data set by using dataset, and importing the structural data set into a target data source;
when an operator is configured, reading metadata information metadata of data according to a selected data source structure, configuring the type of the metadata in an interface, and converting the metadata into a common data type; and finally configuring metadata according to the field requirement of the target data and the field type of the target data source;
when the operator is executed, the data type of the dataset is converted according to the change of the configured metadata, the data type conversion or the data field renaming is carried out through the method withColumn () and withColumRenamed () provided by the dataset, and finally the goal of the cooperative processing of the heterogeneous data source is achieved.
CN202210443573.6A 2022-04-26 2022-04-26 Cross-type migration operator-based multi-source big data processing method Pending CN115221143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210443573.6A CN115221143A (en) 2022-04-26 2022-04-26 Cross-type migration operator-based multi-source big data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210443573.6A CN115221143A (en) 2022-04-26 2022-04-26 Cross-type migration operator-based multi-source big data processing method

Publications (1)

Publication Number Publication Date
CN115221143A true CN115221143A (en) 2022-10-21

Family

ID=83608848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210443573.6A Pending CN115221143A (en) 2022-04-26 2022-04-26 Cross-type migration operator-based multi-source big data processing method

Country Status (1)

Country Link
CN (1) CN115221143A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116455678A (en) * 2023-06-16 2023-07-18 中国电子科技集团公司第十五研究所 Network security log tandem method and system
CN117591497A (en) * 2024-01-18 2024-02-23 中核武汉核电运行技术股份有限公司 Nuclear power historical data cross-system migration method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116455678A (en) * 2023-06-16 2023-07-18 中国电子科技集团公司第十五研究所 Network security log tandem method and system
CN116455678B (en) * 2023-06-16 2023-09-05 中国电子科技集团公司第十五研究所 Network security log tandem method and system
CN117591497A (en) * 2024-01-18 2024-02-23 中核武汉核电运行技术股份有限公司 Nuclear power historical data cross-system migration method
CN117591497B (en) * 2024-01-18 2024-05-03 中核武汉核电运行技术股份有限公司 Nuclear power historical data cross-system migration method

Similar Documents

Publication Publication Date Title
CN112199366B (en) Data table processing method, device and equipment
Konda Magellan: Toward building entity matching management systems
US10698916B1 (en) Data preparation context navigation
CN115221143A (en) Cross-type migration operator-based multi-source big data processing method
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US6122644A (en) System for halloween protection in a database system
US11170022B1 (en) Method and device for processing multi-source heterogeneous data
US20210366055A1 (en) Systems and methods for generating accurate transaction data and manipulation
JP2002538546A (en) ABAP Code Converter Specifications
WO2020155740A1 (en) Information query method and apparatus, and computer device and storage medium
CN112000773B (en) Search engine technology-based data association relation mining method and application
CN110597844B (en) Unified access method for heterogeneous database data and related equipment
CN105989150A (en) Data query method and device based on big data environment
CN102426582A (en) Data operation management device and data operation management method
CN111125116B (en) Method and system for positioning code field in service table and corresponding code table
CN113626464B (en) Query supporting method and system based on ClickHouse database memory data
CN103440265A (en) MapReduce-based CDC (Change Data Capture) method of MYSQL database
WO2019242125A1 (en) Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium
CN113407565A (en) Cross-database data query method, device and equipment
CN110471646B (en) Method for realizing complex program logic through manual configuration
Martin Advanced database techniques
CN110321388B (en) Quick sequencing query method and system based on Greenplus
US8229946B1 (en) Business rules application parallel processing system
CN116955393A (en) Data processing method and device, electronic equipment and storage medium
CN115114297A (en) Data lightweight storage and search method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination