CN117472693A - Buried point data processing method, system, equipment and storage medium based on data lake - Google Patents

Buried point data processing method, system, equipment and storage medium based on data lake Download PDF

Info

Publication number
CN117472693A
CN117472693A CN202311596764.7A CN202311596764A CN117472693A CN 117472693 A CN117472693 A CN 117472693A CN 202311596764 A CN202311596764 A CN 202311596764A CN 117472693 A CN117472693 A CN 117472693A
Authority
CN
China
Prior art keywords
data
iceberg
buried
point
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311596764.7A
Other languages
Chinese (zh)
Inventor
方行健
高飞
房英明
赖飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Travel Network Technology Shanghai Co Ltd
Original Assignee
Ctrip Travel Network Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Travel Network Technology Shanghai Co Ltd filed Critical Ctrip Travel Network Technology Shanghai Co Ltd
Priority to CN202311596764.7A priority Critical patent/CN117472693A/en
Publication of CN117472693A publication Critical patent/CN117472693A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a buried data processing method, a system, equipment and a storage medium based on a data lake, wherein the method comprises the following steps: establishing a corresponding Hive table and an Iceberg table for each buried point according to a buried point private parameter structure configured on an analysis platform by a user; based on the embedded point, embedded point structure and embedded point character section type of the newly added parameter, updating or newly creating an Iceberg table corresponding to the embedded point, and adding a field corresponding to the newly added parameter; processing the whole buried data by writing Spark task; the data in the Hive table is written into the Iceberg table in batches through an interface between the Spark task and the Iceberg table; the SQL calculation engine is used for providing query service to the outside and querying the Iceberg table data. The invention can realize analysis of massive buried point data, can output analysis results in a second level when analyzing hundreds of millions of data, improves the processing speed of buried point data and the efficiency of writing data into an Iceberg data lake, and ensures the usability and timeliness of buried point analysis functions.

Description

Buried point data processing method, system, equipment and storage medium based on data lake
Technical Field
The invention relates to the field of data architecture, in particular to a buried data processing method, a buried data processing system, buried data processing equipment and a storage medium based on a data lake.
Background
At present, buried point data is analyzed in a main mode that buried point log data is synchronized to a Hive table from a message queue, then the data is imported into a ClickHouse database, and the buried point data is queried and analyzed through the Hive table. But some problems are encountered when analyzing by clickHouse:
1. problems with data volume, the data volume per day at peak time can exceed 200 billion if the data volume at buried points per day averages around 100 billion. The introduction of these hundreds of billions of data from the Hive table into the ClickHouse for analysis every day consumes a lot of resources of the cluster, which may cause the ClickHouse cluster to be overloaded, thereby degrading the performance of the cluster and even downtime of the machine.
2. The problem of timeliness of data, because of the resource limitation of the ClickHouse cluster, too large a data volume can cause long time to be required for importing the data from the Hive table into the ClickHouse, and generally, the process needs to wait for more than ten hours, which affects the usability of the buried point analysis function.
3. The problem of data accuracy is that a method of acquiring Hive table data through a Spark engine and then connecting a ClickHouse database to import the Hive data into the ClickHouse is currently used. Because of the problem of Spark engine, when the data volume is large and the cluster resources are insufficient, the problem of data loss may exist when the ClickHouse is imported, so that the data of the analysis function is inaccurate and the use of the user is affected.
4. The private data of the buried point needs to be analyzed, but the private data structure of the buried point data is complex, and may include a plurality of layers of nested Json object structures and Json array structures. For example, a Json array contains multiple Json objects, some Json objects contain multiple Json arrays, and the user wants to obtain all values in the Json array. For this special case, the direct use of Sql for querying is cumbersome, increasing the difficulty of using buried data for analysis.
Accordingly, the invention provides a method, a system, a device and a storage medium for processing buried data based on a data lake.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a buried point data processing method, a system, equipment and a storage medium based on a data lake, which overcome the difficulty of the prior art, can realize the analysis of massive buried point data, can output an analysis result in seconds when analyzing hundreds of millions of data, improve the processing speed of the buried point data and the efficiency of writing the data into an Iceg data lake, and ensure the availability and timeliness of a buried point analysis function.
The embodiment of the invention provides a buried point data processing method based on a data lake, which comprises the following steps of:
s110, creating a corresponding Hive table and an Iceberg table for each buried point according to a buried point private parameter structure configured on an analysis platform by a user;
s120, based on the embedded point, embedded point structure and embedded point field type update of the newly-built parameter or the Iceberg table corresponding to the newly-built embedded point, adding the field corresponding to the newly-built parameter;
s130, processing the whole buried data by writing Spark tasks;
s140, data in the Hive table are written into the Iceberg table in batches through an interface between the Spark task and the Iceberg table; and
s150, using an SQL computing engine to provide query service to the outside, and querying the Iceberg table data through the SQL computing engine.
Preferably, the step S110 includes:
s111, when an Iceberg data table is created, index information is added in a metadata file of the data table, and a corresponding relation between the data file and an index is established; and
and S112, when the processing operation of the target data is executed in the Iceberg, performing index matching of row group level on the target data according to the corresponding relation between the data file and the table index, and determining the target data file.
Preferably, the step S120 includes:
s121, for the embedded point of the newly added parameter, the embedded point is a field corresponding to the newly added parameter of the Iceberg table;
s122, for the newly added buried points, creating an Iceberg table according to the newly added buried point structure; and
s123, reconstructing the Iceberg table for modifying the embedded point field type.
Preferably, the step S130 includes:
s131, analyzing all values in the Json object and the Json array of the multi-layer nesting for the complex private parameter data structure;
s132, writing each piece of analyzed detail data into a corresponding field in the Hive table according to a private parameter structure of the buried point through an ORC interface; and
s133, splitting data, namely, based on a preset threshold, large buried points with data volume larger than the preset threshold and small buried points smaller than the preset threshold, wherein the large buried points use unique marks Meta ID of buried point data to conduct data partitioning, the small buried points use buried point names to conduct partitioning, and one Spark task is disassembled into a plurality of tasks.
Preferably, the step S140 further includes: using the Arctic component in Iceberg, the originally generated files are merged during the importing of Hive table data into the Iceberg table.
Preferably, the step S150 includes: query services are provided to the outside using a Trino engine, through which the Iceberg table data is queried, which provides an abstract interface at the table level, maintaining metadata information of the table in the file.
Preferably, the step S150 further includes: the Iceberg table encapsulates the metadata management of the table and the organizational storage of the table data itself.
The embodiment of the invention also provides a buried point data processing system based on the data lake, which is used for realizing the buried point data processing method based on the data lake, and comprises the following steps:
the embedded point configuration module is used for creating a corresponding Hive table and an Iceberg table for each embedded point according to an embedded point private parameter structure configured on the analysis platform by a user;
the embedded point adding module is used for adding fields corresponding to the parameters based on embedded points, embedded point structures and embedded point character segment types of the added parameters to update or newly build the Iceberg table corresponding to the embedded points;
the embedded point processing module is used for processing the whole amount of embedded point data by writing Spark tasks;
the data writing module is used for writing data in the Hive table into the Iceberg table in batches through an interface between the Spark task and the Iceberg table; and
and the engine query module is used for providing query service for the outside by using an SQL computing engine, and querying the Iceberg table data by using the SQL computing engine.
The embodiment of the invention also provides buried data processing equipment based on the data lake, which comprises the following steps:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the above described data lake-based buried data processing method via execution of the executable instructions.
Embodiments of the present invention also provide a computer-readable storage medium storing a program that, when executed, implements the steps of the above-described data lake-based buried data processing method.
The invention aims to provide a buried point data processing method, a system, equipment and a storage medium based on a data lake, which can analyze massive buried point data, can output an analysis result in a second level when analyzing hundreds of millions of data, improve the processing speed of buried point data and the efficiency of writing the data into an Iceberg data lake, and ensure the availability and timeliness of a buried point analysis function.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings.
FIG. 1 is a flow chart of a method of data lake-based point data processing of the present invention.
FIG. 2 is a schematic diagram of the structure of a data lake-based point data processing system of the present invention.
Fig. 3 is a schematic diagram of the structure of the data lake-based buried data processing apparatus of the present invention.
Fig. 4 is a schematic structural view of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Other advantages and effects of the present application will be readily apparent to those skilled in the art from the present disclosure, by describing embodiments of the present application with specific examples. The present application may be embodied or applied in other specific forms and details, and various modifications and alterations may be made to the details of the present application from different points of view and application without departing from the spirit of the present application. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
The embodiments of the present application will be described in detail below with reference to the drawings so that those skilled in the art to which the present application pertains can easily implement the same. This application may be embodied in many different forms and is not limited to the embodiments described herein.
In the description of the present application, reference to the terms "one embodiment," "some embodiments," "examples," "particular examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples, and features of the various embodiments or examples, presented herein may be combined and combined by those skilled in the art without conflict.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the context of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
For the purpose of clarity of the description of the present application, components that are not related to the description are omitted, and the same or similar components are given the same reference numerals throughout the description.
Throughout the specification, when a device is said to be "connected" to another device, this includes not only the case of "direct connection" but also the case of "indirect connection" with other elements interposed therebetween. In addition, when a certain component is said to be "included" in a certain device, unless otherwise stated, other components are not excluded, but it means that other components may be included.
When a device is said to be "on" another device, this may be directly on the other device, but may also be accompanied by other devices therebetween. When a device is said to be "directly on" another device in contrast, there is no other device in between.
Although the terms first, second, etc. may be used herein to connote various elements in some instances, the elements should not be limited by the terms. These terms are only used to distinguish one element from another element. For example, a first interface, a second interface, etc. Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, steps, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, steps, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions, steps or operations are in some way inherently mutually exclusive.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the language clearly indicates the contrary. The meaning of "comprising" in the specification is to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not preclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.
Although not differently defined, including technical and scientific terms used herein, all terms have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The term addition defined in the commonly used dictionary is interpreted as having a meaning conforming to the contents of the related art document and the current hint, so long as no definition is made, it is not interpreted as an ideal or very formulaic meaning too much.
FIG. 1 is a flow chart of a method of data lake-based point data processing of the present invention. As shown in fig. 1, the method for processing buried data based on a data lake of the present invention comprises:
s110, creating a corresponding Hive table and an Iceberg table for each buried point according to a buried point private parameter structure configured on the analysis platform by a user.
S120, embedding points, embedding point structures and embedding point field types based on the newly added parameters update or newly build an Iceberg table corresponding to the embedded points, and the fields corresponding to the newly added parameters.
S130, processing the whole buried data by writing Spark tasks.
And S140, writing the data in the Hive table into the Iceberg table in batches through an interface between the Spark task and the Iceberg table. And
S150, using the SQL computing engine to provide query service to the outside, and querying the Iceberg table data through the SQL computing engine.
The data lake Iceberg in the present invention is a data system or a repository, typically an object blob or a file, stored in a natural/original format. Data lakes are typically a single data store that includes raw copies of source system data, sensor data, social data, etc., as well as conversion data for reporting, visualization, advanced analysis, and machine learning tasks. The data lakes may include structured data from relational databases (rows and columns), semi-structured data (CSV, journal, XML, JSON), unstructured data (email, document, PDF), and binary data (image, audio, video). The data lake may be established locally (within an organized data center) or at a cloud service.
Iceberg is an open table format for large analytics data sets that uses a high performance table format similar to SQL (Structured Query Language ) tables that can be added to Spark (a fast general purpose computing engine designed for large scale data processing), trino (a fast distributed SQL query engine for large data analytics), flank (an open source stream processing framework), hive (a data warehouse tool) and other computing engines and supports add-drop-change operations on the table format.
Hive table Hive is a data warehouse tool based on Hadoop, and is used for extracting, converting and loading data, which is a mechanism capable of storing, querying and analyzing large-scale data stored in Hadoop. Hive can map a structured data file into a database table, can provide SQL query analysis function, and can convert SQL sentences into MapReduce tasks to be executed, so that the purpose of analyzing data is realized.
Spark is a fast and general-purpose computing engine designed for large-scale data processing, and can read data in a distributed manner, and the processed data is written into a target medium in a distributed manner through various conversion, processing and the like on the data.
In a preferred embodiment, step S110 includes:
and S111, when the Iceberg data table is created, adding index information into the metadata file of the data table, and establishing a corresponding relation between the data file and the index. And
S112, when the processing operation of the target data is executed in the Iceberg, the index matching of the row group level is carried out on the target data according to the corresponding relation between the data file and the table index, and the target data file is determined, but the method is not limited to the method.
In a preferred embodiment, step S120 includes:
s121, the embedded point of the newly added parameter is a field corresponding to the newly added parameter of the Iceberg table.
S122, for the newly added buried point, creating an Iceberg table according to the newly added buried point structure.
And
S123, reconstructing the Iceberg table for modifying the embedded point field type, but not limited to the above.
In a preferred embodiment, step S130 includes:
s131, analyzing all values in the Json object and the Json array of the multi-layer nesting for the complex private parameter data structure.
S132, writing each piece of analyzed detail data into a corresponding field in the Hive table according to the private parameter structure of the buried point through an ORC interface. And
S133, splitting data, namely, based on a preset threshold, large buried points with the data volume larger than the preset threshold and small buried points with the data volume smaller than the preset threshold, wherein the large buried points are used for data partitioning by using unique identification Meta ID of buried point data, the small buried points are used for partitioning by using buried point names, and one Spark task is split into a plurality of tasks, but the method is not limited to the method. The Json object and Json array, JSON (JavaScript Object Notation, JS object numbered musical notation) are lightweight data exchange formats. It stores and presents data in a text format that is completely independent of the programming language based on a subset of ECMAScript (European Computer Manufacturers Association, js specification by the european computer institute). The compact and clear hierarchical structure makes JSON an ideal data exchange language. Is easy to read and write by people, is easy to analyze and generate by machines, and effectively improves the network transmission efficiency. JSON is a serialized object or array. ORC is an efficient file format storage format designed to overcome the limitations of other Hive file formats, and the use of ORC files can improve Hive's performance in reading, writing and processing data. ORC (Optimized Row Columnar) is an efficient columnar storage format, commonly used in data warehouse and large-scale data analysis scenarios. It was developed by the Hive community and managed by the Apache software foundation. ORC consists mainly of three levels: files, stripes (strips), and Row groups (Row groups). A file is a top-level structure containing a plurality of stripes, each consisting of a plurality of row groups. This multi-level structure can help the ORC optimize query performance and increase data compression rate. In ORC, data is stored column by column rather than row, meaning that the same type of data will be stored in adjacent locations, facilitating batch processing and compression. In addition, ORC provides additional functions such as skipping unnecessary rows, bloom filters, and dictionary coding, etc., to increase query speed and reduce storage costs. The Meta ID is the decentralized identity DID sub-protocol of the next generation internet Metanet.
In a preferred embodiment, step S140 further comprises: in Iceberg, an Arctic component is used, and in the process of importing Hive table data into Iceberg table, the files generated originally are combined, but not limited to, the file is merged. The Arctic is a lake warehouse management system under an open architecture, and provides more optimization for stream-oriented and updated scenes and a pluggable data self-optimization mechanism and management service above an open data lake format. Based on Arctic, various data platforms can be helped, tools and products can be quickly built and opened for use, and the flow batch is a uniform lake warehouse.
In a preferred embodiment, step S150 includes: the Trino engine is used for providing query service to the outside, the Trino engine is used for querying the Iceberg table data, the Iceberg table provides an abstract interface at a table level, and metadata information of the table is maintained in the file, but the abstract interface is not limited to the abstract interface. The Trino is a PB-level distributed SQL calculation engine based on a memory, and realizes low coupling of a calculation layer and a storage layer through a Connector SPI (Serial Peripheral Interface ), so that the Trino supports access and operation on various data sources and has the characteristic of cross-source query.
In a preferred embodiment, step S150 further comprises: the Iceberg table encapsulates metadata management of the table and organization storage of the table data itself, but is not limited thereto.
The buried point data processing method based on the data lake can realize analysis of massive buried point data, can output analysis results in seconds when analyzing hundreds of millions of data, improves the processing speed of the buried point data and the efficiency of writing the data into the Iceberg data lake, and ensures the availability and timeliness of a buried point analysis function.
The invention realizes an efficient OLAP engine based on Trino and Iceberg, provides second-level query response capability, and supports the interactive buried point analysis requirement.
In order to realize the functions, the technical scheme of the invention comprises the following steps:
(1) And creating a corresponding hive table and an iceberg table for each buried point according to the buried point private parameter structure configured on the analysis platform by the user. When the Iceberg data table is created, index information is added in a metadata file of the data table, a corresponding relation between the data file and the index is established, and when processing operation of target data is executed in the Iceberg, index matching of row group level is carried out on the target data according to the corresponding relation between the data file and the table index, and the target data file is determined.
(2) Updating an iceberg table corresponding to the newly built embedded point. Some buried points are added with parameters every day, and some buried points are also added. And for the embedded point of the newly added parameter, the embedded point is a field corresponding to the newly added parameter of the iceberg table. For the newly added buried point, an iceberg table is created according to the newly added buried point structure. If the modified embedded field type needs to be modified, the iceberg table will be reconstructed.
(3) By writing Spark tasks, the full amount of buried data is processed. For a complex private parameter data structure, all values in the Json object and the Json array which are nested in multiple layers are analyzed, and then each piece of analyzed detail data is written into a corresponding field in the Hive table through an ORC API according to the private parameter structure of the buried point. Because the data volume is relatively large, in order to promote the processing speed of data, divide 100 hundred million data each day into the big buried point of data volume and the buried point of data volume little, the unique identification metaid of buried point data is used to the buried point of data volume big to carry out the data subregion, and the buried point of data volume little uses buried point name to carry out the subregion, breaks up into a plurality ofly with Spark task, has improved the parallelism of processing task, has promoted the processing speed of task.
(4) The data in the Hive table is written to the Iceberg table in bulk via Spark-Iceberg API. In contrast to previous connection of the ClickHouse database via JDBC, and then importing data into the table of ClickHouse using Spark engine, importing data into the IceBerg table using Spark-Iceberg API does not result in a data loss. At the same time, the data import speed is improved. Billion-magnitude data, which previously required at least 10 hours or more when clickHouse was imported, now the Iceberg table was imported, and the data was imported to completion in substantially 3 hours. In addition, an Arctic component is introduced to optimize the underlying files of the Iceberg table. The Arctic component is used in the Iceberg, and a plurality of small files which are originally generated can be combined in the process of importing the data of the Hive table into the Iceberg table, so that the data redundancy is reduced, and the file retrieval efficiency is improved.
(5) The Trino engine is used to provide a query service to the outside, and the Iceberg table number is queried by the Trino engine. The design of Trino introduction Iceberg connector improves the scalability limitations known by Hive. Iceberg provides a table-level abstract interface that itself maintains metadata information for tables in files. Based on the method, the Iceberg encapsulates the metadata management of the table and how the table data is organized and stored, the query can be positioned to the file level, and the query efficiency is greatly improved.
The technical scheme provided by the embodiment of the invention has the following effects: by using Spark data processing engine, iceberg data lake frame and Trino SQL calculation engine, a general method and function for analyzing massive buried data are realized. Only the function needs to be called in the project, so that the problem that a user does not need to develop codes for some complex parameter queries when using the buried point analysis function is avoided. When hundreds of millions of data are analyzed, the second-level output of analysis results can be achieved. Meanwhile, the processing speed of the buried point data and the efficiency of writing the data into the Iceberg data lake are improved, and the usability and timeliness of the buried point analysis function are ensured.
FIG. 2 is a schematic diagram of the structure of a data lake-based point data processing system of the present invention. As shown in fig. 2, an embodiment of the present invention further provides a data lake-based buried data processing system, for implementing the above-mentioned data lake-based buried data processing method, where the data lake-based buried data processing system 5 includes:
the buried point configuration module 51 creates a Hive table and an Iceberg table for each buried point according to the buried point private parameter structure configured on the analysis platform by the user.
The embedded point adding module 52 adds a field corresponding to the parameter based on the embedded point, the embedded point structure, the embedded point field type update or the Iceberg table corresponding to the newly built embedded point of the added parameter.
The buried point processing module 53 processes the entire amount of buried point data by writing Spark tasks.
The data writing module 54 writes data in the Hive table into the Iceberg table in batch through an interface between the Spark task and the Iceberg table. And
The engine query module 55 provides a query service to the outside using an SQL calculation engine, through which the Iceberg table data is queried.
In a preferred embodiment, the embedded point configuration module 51 is configured to add index information to the metadata file of the data table when creating the Iceberg data table, and establish a correspondence between the data file and the index. When the processing operation of the target data is executed in the Iceberg, the index matching of the row group level is performed on the target data according to the corresponding relation between the data file and the table index, and the target data file is determined, but the method is not limited to the method.
In a preferred embodiment, the embedded point adding module 52 is configured to add a field corresponding to the new parameter to the Iceberg table for the embedded point of the new parameter. For the newly added buried point, an Iceberg table is created according to the newly added buried point structure. For modifying the embedded field type, the Iceberg table is reconstructed, but not limited thereto.
In a preferred embodiment, the buried point processing module 53 is configured to parse out all values in the multiple layers of nested Json objects and Json arrays for complex private data structures. And writing each piece of analyzed detail data into a corresponding field in the Hive table according to the private parameter structure of the buried point through an ORC interface. Splitting data, namely, based on a preset threshold, large buried points with the data volume larger than the preset threshold and small buried points smaller than the preset threshold, wherein the large buried points are used for data partitioning by using unique identification Meta ID of buried point data, the small buried points are used for partitioning by using buried point names, and one Spark task is split into a plurality of tasks, but the method is not limited to the method.
In a preferred embodiment, the data writing module 54 is configured to use the Arctic component in Iceberg, and to merge the originally generated files during the importing of Hive table data into Iceberg tables, but is not limited thereto.
In a preferred embodiment, the engine query module 55 is configured to provide a query service to the outside using a Trino engine, through which the Iceberg table data is queried, the Iceberg table providing a table-level abstract interface, and maintaining metadata information of the table in a file, but not limited thereto.
In a preferred embodiment, the engine query module 55 is further configured to encapsulate, but not limited to, the Iceberg table for metadata management of the table and for organizational storage of the table data itself.
The buried point data processing system based on the data lake can analyze massive buried point data, can output analysis results in seconds when analyzing hundreds of millions of data, improves the processing speed of the buried point data and the efficiency of writing the data into the Iceberg data lake, and ensures the availability and timeliness of a buried point analysis function.
The embodiment of the invention also provides buried point data processing equipment based on the data lake, which comprises a processor. A memory having stored therein executable instructions of a processor. Wherein the processor is configured to execute the steps of the data lake-based buried data processing method via execution of the executable instructions.
As shown above, the buried point data processing equipment based on the data lake can analyze massive buried point data, can output analysis results in a second level when analyzing hundreds of millions of data, improves the processing speed of buried point data and the efficiency of writing the data into the Iceberg data lake, and ensures the availability and timeliness of a buried point analysis function.
Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" platform.
Fig. 3 is a schematic diagram of the structure of the data lake-based buried data processing apparatus of the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 3. The electronic device 600 shown in fig. 3 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 3, the electronic device 600 is embodied in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including memory unit 620 and processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the invention described in the above method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 6201 and/or cache memory unit 6202, and may further include Read Only Memory (ROM) 6203.
The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program is executed to implement the steps of the buried data processing method based on the data lake. In some possible embodiments, the aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the method portions of this specification, when the program product is run on the terminal device.
As shown above, the buried point data processing system based on the data lake can analyze massive buried point data, can output analysis results in a second level when analyzing hundreds of millions of data, improves the processing speed of buried point data and the efficiency of writing the data into the Iceberg data lake, and ensures the availability and timeliness of a buried point analysis function.
Fig. 4 is a schematic structural view of a computer-readable storage medium of the present invention. Referring to fig. 4, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected over the Internet using an Internet service provider).
In summary, the invention aims to provide a buried point data processing method, a system, equipment and a storage medium based on a data lake, which can realize analysis of massive buried point data, can output an analysis result in a second level when analyzing hundreds of millions of data, improve the processing speed of buried point data and the efficiency of writing the data into an Iceberg data lake, and ensure the availability and timeliness of a buried point analysis function.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (10)

1. A buried point data processing method based on a data lake is characterized by comprising the following steps:
s110, creating a corresponding Hive table and an Iceberg table for each buried point according to a buried point private parameter structure configured on an analysis platform by a user;
s120, based on the embedded point, embedded point structure and embedded point field type update of the newly-built parameter or the Iceberg table corresponding to the newly-built embedded point, adding the field corresponding to the newly-built parameter;
s130, processing the whole buried data by writing Spark tasks;
s140, data in the Hive table are written into the Iceberg table in batches through an interface between the Spark task and the Iceberg table; and
s150, using an SQL computing engine to provide query service to the outside, and querying the Iceberg table data through the SQL computing engine.
2. The method for processing buried data based on data lake of claim 1, wherein said step S110 comprises:
s111, when an Iceberg data table is created, index information is added in a metadata file of the data table, and a corresponding relation between the data file and an index is established; and
and S112, when the processing operation of the target data is executed in the Iceberg, performing index matching of row group level on the target data according to the corresponding relation between the data file and the table index, and determining the target data file.
3. The method for processing buried data based on data lake of claim 1, wherein said step S120 comprises:
s121, for the embedded point of the newly added parameter, the embedded point is a field corresponding to the newly added parameter of the Iceberg table;
s122, for the newly added buried points, creating an Iceberg table according to the newly added buried point structure; and
s123, reconstructing the Iceberg table for modifying the embedded point field type.
4. The method for processing buried data based on data lake of claim 1, wherein said step S130 comprises:
s131, analyzing all values in the Json object and the Json array of the multi-layer nesting for the complex private parameter data structure;
s132, writing each piece of analyzed detail data into a corresponding field in the Hive table according to a private parameter structure of the buried point through an ORC interface; and
s133, splitting data, namely, based on a preset threshold, large buried points with data volume larger than the preset threshold and small buried points smaller than the preset threshold, wherein the large buried points use unique identification MetaID of buried point data to conduct data partitioning, the small buried points use buried point names to conduct partitioning, and one Spark task is disassembled into a plurality of tasks.
5. The method for processing buried data based on data lake of claim 1, wherein said step S140 further comprises: using the Arctic component in Iceberg, the originally generated files are merged during the importing of Hive table data into the Iceberg table.
6. The method for processing buried data based on data lake of claim 1, wherein said step S150 comprises: query services are provided to the outside using a Trino engine, through which the Iceberg table data is queried, which provides an abstract interface at the table level, maintaining metadata information of the table in the file.
7. The method for processing buried data based on data lake of claim 6, wherein said step S150 further comprises: the Iceberg table encapsulates the metadata management of the table and the organizational storage of the table data itself.
8. A data lake-based point-of-burial data processing system for implementing the data lake-based point-of-burial data processing method of claim 1, comprising:
the embedded point configuration module is used for creating a corresponding Hive table and an Iceberg table for each embedded point according to an embedded point private parameter structure configured on the analysis platform by a user;
the embedded point adding module is used for adding fields corresponding to the parameters based on embedded points, embedded point structures and embedded point character segment types of the added parameters to update or newly build the Iceberg table corresponding to the embedded points;
the embedded point processing module is used for processing the whole amount of embedded point data by writing Spark tasks;
the data writing module is used for writing data in the Hive table into the Iceberg table in batches through an interface between the Spark task and the Iceberg table; and
and the engine query module is used for providing query service for the outside by using an SQL computing engine, and querying the Iceberg table data by using the SQL computing engine.
9. A point of burial data processing apparatus based on a data lake, comprising:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the data lake-based method of buried data processing of any one of claims 1 to 7 via execution of the executable instructions.
10. A computer-readable storage medium storing a program, wherein the program when executed by a processor implements the steps of the data lake-based buried data processing method of any one of claims 1 to 7.
CN202311596764.7A 2023-11-27 2023-11-27 Buried point data processing method, system, equipment and storage medium based on data lake Pending CN117472693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311596764.7A CN117472693A (en) 2023-11-27 2023-11-27 Buried point data processing method, system, equipment and storage medium based on data lake

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311596764.7A CN117472693A (en) 2023-11-27 2023-11-27 Buried point data processing method, system, equipment and storage medium based on data lake

Publications (1)

Publication Number Publication Date
CN117472693A true CN117472693A (en) 2024-01-30

Family

ID=89634816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311596764.7A Pending CN117472693A (en) 2023-11-27 2023-11-27 Buried point data processing method, system, equipment and storage medium based on data lake

Country Status (1)

Country Link
CN (1) CN117472693A (en)

Similar Documents

Publication Publication Date Title
US11475034B2 (en) Schemaless to relational representation conversion
CN110032604B (en) Data storage device, translation device and database access method
US10366053B1 (en) Consistent randomized record-level splitting of machine learning data
US11494339B2 (en) Multi-level compression for storing data in a data store
Aji et al. Hadoop-GIS: A high performance spatial data warehousing system over MapReduce
US11663213B2 (en) Distinct value estimation for query planning
US10095732B2 (en) Scalable analysis platform for semi-structured data
US9773029B2 (en) Generation of a data model
US20170161641A1 (en) Streamlined analytic model training and scoring system
Chung et al. JackHare: a framework for SQL to NoSQL translation using MapReduce
US20180137112A1 (en) Data migration in a networked computer environment
Siddiqui et al. Pseudo-cache-based IoT small files management framework in HDFS cluster
Pajić et al. Model of point cloud data management system in big data paradigm
CN114461603A (en) Multi-source heterogeneous data fusion method and device
US20170060915A1 (en) System and a method for associating contextual structured data with unstructured documents on map-reduce
US11514697B2 (en) Probabilistic text index for semi-structured data in columnar analytics storage formats
EP3474158A1 (en) Method and device for executing distributed computing task
CN111198917A (en) Data processing method, device, equipment and storage medium
Kvet et al. Data block and tuple identification using master index
McClean et al. A comparison of mapreduce and parallel database management systems
CN117472693A (en) Buried point data processing method, system, equipment and storage medium based on data lake
US10769214B2 (en) Encoding and decoding files for a document store
Purdilă et al. Single‐scan: a fast star‐join query processing algorithm
CN112579673A (en) Multi-source data processing method and device
Hashem et al. Pre-processing and modeling tools for bigdata

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination