CN116361314A - Quasi-real-time query method and device for kafka data, electronic equipment and medium - Google Patents

Quasi-real-time query method and device for kafka data, electronic equipment and medium Download PDF

Info

Publication number
CN116361314A
CN116361314A CN202211713863.4A CN202211713863A CN116361314A CN 116361314 A CN116361314 A CN 116361314A CN 202211713863 A CN202211713863 A CN 202211713863A CN 116361314 A CN116361314 A CN 116361314A
Authority
CN
China
Prior art keywords
kafka data
data
kafka
field
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211713863.4A
Other languages
Chinese (zh)
Inventor
冯中原
王军博
李成其
姜霄
苗泽
张小雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Petroleum and Chemical Corp
Petro CyberWorks Information Technology Co Ltd
Original Assignee
China Petroleum and Chemical Corp
Petro CyberWorks Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Petroleum and Chemical Corp, Petro CyberWorks Information Technology Co Ltd filed Critical China Petroleum and Chemical Corp
Priority to CN202211713863.4A priority Critical patent/CN116361314A/en
Publication of CN116361314A publication Critical patent/CN116361314A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to big data technology, and discloses a kafka data quasi-real-time query method, which comprises the following steps: adding a preset message header field to the kafka data, taking a message body of the kafka data as an independent field, and carrying out field recombination on the message header field and the independent field to obtain updated kafka data; importing the updated kafka data into a distributed file system according to a storage structure of the hive table to obtain stored kafka data; mapping the stored kafka data into a hive external table constructed in advance to obtain mapped kafka data, and checking the continuity of the mapped kafka data by using an ofest field in a message header field to obtain continuous kafka data; real-time queries are made for continuous kafka data. The invention also provides a kafka data quasi-real-time query device, electronic equipment and a storage medium. The invention can reduce the resource consumption when the kafka data is queried.

Description

Quasi-real-time query method and device for kafka data, electronic equipment and medium
Technical Field
The invention relates to the technical field of big data, in particular to a kafka data quasi-real-time query method, a device, electronic equipment and a computer readable storage medium.
Background
With the wide application of kafka as a high-throughput distributed message subscription system in the fields of data and the like, there is often a need to query kafka data, for example, to find whether a piece of data exists in kafka, that is, whether kafka receives a piece of data. But kafka is used as a message middleware, and the storage structure of the kafka is not suitable for query analysis, so that in order to query the data in the kafka, the kafka data needs to be subjected to a dump processing to perform quasi-real-time query on the kafka data.
Existing kafka data query techniques enable data to be queried through some kafka management tools, usually using real-time consumption of data and processing in memory, only supporting the situation of small amounts of data. In practical application, mass data exists in the production environment, only a small amount of data is considered to be queried, so that the flexibility of the system for supporting data query is low, complex query cannot be supported, and the resource consumption is high when the kafka data is queried.
Disclosure of Invention
In view of the foregoing, embodiments of the present invention provide a kafka data quasi-real-time query method, apparatus, electronic device, and computer-readable storage medium.
In a first aspect, an embodiment of the present invention provides a kafka data near real-time query method, including:
adding a preset message header field to the kafka data, taking a message body of the kafka data as an independent field, and carrying out field recombination on the message header field and the independent field to obtain updated kafka data;
importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data;
mapping the stored kafka data into a pre-constructed hive external table to obtain mapped kafka data, and checking the continuity of the mapped kafka data by using an ofest field in the message header field to obtain continuous kafka data;
and carrying out real-time query on the continuous kafka data.
According to an embodiment of the present invention, the field reorganizing the header field and the independent field to obtain updated kafka data includes:
taking the message header field as a message header of the updated kafka data;
taking the independent field as a message body of the updated kafka data;
and reorganizing the message format of the updated kafka data according to the message header and the message body to obtain the updated kafka data.
According to an embodiment of the present invention, the importing the updated kafka data into a preset distributed file system according to the storage structure of the preset hive table to obtain the stored kafka data includes:
performing field rewriting on the KafkaSource class in the Flume to obtain a rewritten KafkaSource class;
carrying out event serialization rewriting on the HeaderAndBodyTextEventSerializer class in the Flume according to the storage structure of the hive table to obtain a rewritten HeaderAndBodyTextEventSerializer class;
configuring a source component of the Flume according to the rewritten KafkaSource class and the updated kafka data to obtain configuration source information, configuring a channel component of the Flume according to a preset buffer attribute to obtain configuration channel information, and configuring a sink component of the Flume according to the distributed file system and the rewritten HeadAndByTextEventSerializer class to obtain configuration sink information;
generating a transmission file by the configuration source information, the configuration channel information and the configuration sink information, and operating the Flume according to the transmission file to obtain the storage kafka data.
According to an embodiment of the present invention, before the mapping the stored kafka data to the pre-constructed hive external table to obtain the mapped kafka data, the method further includes:
Creating a storage path and a separation field of the kafka data;
generating field attributes of the hive external table according to the field attributes of the stored kafka data;
and creating the hive external table according to the storage path, the separation field and the field attribute.
According to an embodiment of the present invention, the verifying the continuity of the mapped kafka data by using the ofest field in the header field to obtain continuous kafka data includes:
performing continuity check on the digital records in the offset field corresponding to the mapped kafka data to obtain digital records of the offset field;
when the digital record is continuous, the mapped kafka data is taken as the continuous kafka data.
According to an embodiment of the present invention, the real-time query of the continuous kafka data includes:
establishing connection between a preset Hive visual client tool and a Hive data warehouse by using preset connection parameters to obtain a visual interface of a Hive database;
and carrying out real-time query on the continuous kafka data in the visual interface according to a preset SQL statement.
According to an embodiment of the present invention, the real-time query of the continuous kafka data includes:
And carrying out real-time query on the continuous kafka data by using a preset compatible hive query engine.
In a second aspect, an embodiment of the present invention provides a kafka data near real-time query apparatus, which is characterized in that the apparatus includes:
a field reorganization module, configured to add a preset message header field to kafka data, take a message body of the kafka data as an independent field, and perform field reorganization on the message header field and the independent field to obtain updated kafka data;
the kafka data importing module is used for importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data;
a data continuity checking module, configured to map the stored kafka data to a hive external table constructed in advance to obtain mapped kafka data, and check continuity of the mapped kafka data by using an ofest field in the header field to obtain continuous kafka data;
and the kafka data real-time query module is used for carrying out real-time query on the continuous kafka data.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
a processor;
a memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement a kafka data quasi-real time querying method as described in the previous first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a kafka data quasi-real time query method as described in the previous first aspect.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the embodiment of the invention uses the Flume to import the Kafka data into the HDFS, in the importing process, the Kafka message body is stored as a separate field, and necessary Kafka message header fields are added for data check and analysis; the structure of the data falling into the HDFS accords with the storage structure of the Hive table; establishing a Hive external table pointing to data which already falls into an HDFS; checking the data integrity by using a mode based on Kafka offset continuity judgment; using Hive inquiry client to flexibly analyze and inquire data; and the historical data is compressed and combined regularly, so that the occupation of resources is reduced. Flexible SQL-based query analysis of the full information of the Kafka message is realized. The data can be queried for delay control on the order of minutes (3 minutes for an actual production environment, the theory could be shorter). The storage space and the query performance can be transversely expanded and improved as required, and the historical data quantity of storable queries is almost not limited. Therefore, the kafka data quasi-real-time query method, the device, the electronic equipment and the computer readable storage medium can solve the problem of high resource consumption when the kafka data is queried.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 shows a workflow diagram of a kafka data near real-time query method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of the importing of kafka data into a distributed file system according to the first embodiment of the present invention;
FIG. 3 is a flow chart of constructing a hive external table according to a first embodiment of the present invention;
FIG. 4 is a functional block diagram of a kafka data quasi-real-time query device according to a third embodiment of the present invention;
fig. 5 shows a diagram of the composition and structure of an electronic device implementing the kafka data quasi-real-time query method according to the fifth embodiment of the present invention.
Detailed Description
The disclosure is further described below with reference to the embodiments shown in the drawings.
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
The invention provides a kafka data quasi-real-time query method based on a big data technology, which is based on distributed storage of HDFS and Hive, and combines a judging method of kafka ofest continuity check to map kafka data into a Hive table so as to realize the kafka data quasi-real-time query. Compared with the traditional method, the quasi-real-time query technology based on the kafka data has high flexibility in supporting the query and reduces the consumption of system resources, and has great potential and application prospect in the scene application of the kafka data for query and traceability.
Example 1
As shown in fig. 1, the invention provides a kafka data quasi-real-time query method, which comprises the following steps:
s1, adding a preset message header field to kafka data, taking a message body of the kafka data as an independent field, and carrying out field recombination on the message header field and the independent field to obtain updated kafka data;
in the embodiment of the invention, the message header field comprises information such as partition, offset, timestamp, topic, key (key value) and the like of the message. The necessary kafka message header fields are added for data checksum analysis. Wherein, the partition is a grouping of topic physically, a topic can be divided into a plurality of partition, each partition is an ordered queue; offset refers to the displacement of the consumer, that is, the displacement of each message at a certain part is fixed, but the displacement of the consumer consuming the part is continuously advanced along with the consumption progress, but eventually cannot exceed the displacement of the latest message of the partition; the timestamp is used for recording a message sending timestamp; topic is a logical concept representing a class of messages, typically used to distinguish actual service messages; the key is a message key, and is used when making a part on a message, i.e. deciding which part under a certain topic the message is stored in.
In detail, the message body corresponding to the kafka data is stored as a separate field, and a necessary kafka message header field is added to perform data checksum analysis, wherein the message body is used to store actual message data.
In the embodiment of the invention, the updated kafka data is that a message body corresponding to the original kafka data is used as an independent field, and a necessary message header field is added to form a complete format of the message for later importing the kafka data into an HDFS for storage. The HDFS is a distributed file system, can provide high-throughput data access, is very suitable for application on a large-scale data set, and comprises a NameNode and a plurality of DataNodes, wherein the NameNode is used as a main server for managing the naming space of the file system and the access operation of a client to files; the DataNode in the cluster manages the stored data.
In the embodiment of the present invention, the field reorganizing the message header field and the independent field to obtain updated kafka data includes:
taking the message header field as a message header of the updated kafka data;
taking the independent field as a message body of the updated kafka data;
And reorganizing the message format of the updated kafka data according to the message header and the message body to obtain the updated kafka data.
In detail, information such as partition, offset, timestamp, topic, key and the like in a message header field is taken as a message header of the updated kafka data, a kafka message body is taken as an independent field and is taken as a message body of the updated kafka data, and the message header and the message body are recombined into a message format to form the updated kafka data. Wherein the message header field may record the sending time of the message, the partition of the message, the subject of the message, etc.
Further, the updated kafka data is imported into the HDFS for storage, and the structure of the updated kafka data falls into the HDFS during the import process to conform to the storage structure of the Hive table, so the storage structure of the Hive table is analyzed to let the updated kafka data fall into the HDFS.
S2, importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data;
in the embodiment of the present invention, the storage structure of the hive table is stored according to the storage structure of the table, that is, the storage structure of the hive table is used to set the formats of the row separator, the column separator and the read file, so that kafka data can be imported into the HDFS by using the flash according to the storage structure of the hive table. Wherein Flume is a highly available, highly reliable, distributed system of mass log collection, aggregation and transmission provided by Cloudera. Hive is a data warehouse tool based on Hadoop, and is used for extracting, converting and loading data, and is a mechanism capable of storing, querying and analyzing large-scale data stored in Hadoop. The Hive data warehouse tool can map structured data files into a database table and provide SQL query functions.
In an embodiment of the present invention, referring to fig. 2, the importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data includes:
s21, performing field rewriting on the KafkaSource class in the Flume to obtain a rewritten KafkaSource class;
s22, carrying out event serialization rewriting on HeaderAndBodyTextEventSerializer class in the Flume according to the storage structure of the hive table to obtain rewritten HeaderAndBodyTextEventSerializer class;
s23, configuring a source component of the Flume according to the rewritten KafkaSource class and the updated kafka data to obtain configuration source information, configuring a channel component of the Flume according to a preset buffer attribute to obtain configuration channel information, and configuring a sink component of the Flume according to the distributed file system and the rewritten HeaderAndBodyTextEventSerializer class to obtain configuration sink information;
s24, generating a transmission file from the configuration source information, the configuration channel information and the configuration sink information, and operating the Flume according to the transmission file to obtain the storage kafka data.
In detail, two classes of KafkaSource and headerAndBodyTextEventSerializer of Flume are rewritten for realizing acquisition of partition, offset, kafka timer, topic, key and other information and writing HDFS (Hive-compliant line segmentation mode, column segmentation mode) according to Hive table data structure, wherein the KafkaSource class is used for reading information from kafka theme, and the headerAndBodyTextEventSerializer class is used as class header and text event serialization program for writing the text of event into output stream and attaching line-feed after each event.
Specifically, the configuration source information is to configure the type, source, subject, etc. of the kafka data; the channel information is configured for the type of the buffer area, the capacity of the buffer area and the like; the configuration sink information is configured according to the HDFS sink, for example, the kafka data is imported into the HDFS if the agent, sink, hdfssink, type=hdfs, and the configuration source information, the configuration channel information and the configuration sink information generate a transmission file, that is, the operation Flume can import the kafka data into the HDFS to obtain the stored kafka data. Where Source is the component responsible for receiving data to the Flume Agent. The Source component can process log data of various types and formats; sink constantly polls events in channels and removes them in batches and writes them in batches to a storage or indexing system or is sent to another Flume Agent; channel is a buffer located between Source and Sink, which allows Source and Sink to operate at different rates. Therefore, source is used to receive the Kafka data, then the Kafka data is transmitted to a Channel, finally the Kafka data is transmitted to an HDFS through a Sink to be stored, namely, the Kafka data is imported to the HDFS by using a flash, and in the importing process, the Kafka message body is stored as a separate field, and necessary Kafka message header fields are added to perform data check and analysis; and the structure of these data falling into HDFS is to conform to the storage structure of Hive table.
Further, in order to realize the query on the kafka data, the data stored in the HDFS is transferred to Hive, and therefore, the kafka data is mapped into the Hive external table for data query.
S3, mapping the stored kafka data into a pre-constructed hive external table to obtain mapped kafka data, and checking the continuity of the mapped kafka data by using an ofest field in the message header field to obtain continuous kafka data;
in the embodiment of the invention, the hive external table refers to the data existing in the HDFS, and the data cannot be moved to the data warehouse directory when being loaded and created, but only a link is established with the external data, and when deleting an external table, only the link is deleted, and the data cannot be deleted.
In an embodiment of the present invention, as shown in fig. 3, before mapping the stored kafka data to a pre-constructed hive external table to obtain mapped kafka data, the method further includes:
s31, creating a storage path and a separation field of the kafka data;
s32, generating field attributes of the hive external table according to the field attributes of the stored kafka data;
s33, creating the hive external table according to the storage path, the separation field and the field attribute.
In detail, the storage path is a file which can be specified in the HDFS, and is represented by location, that is, the storage position of the specification table on the HDFS is specified; the separation field is divided into a row separation and a column separation, and a separator is formulated with fields terminated by.
Specifically, the storage of the kafka data in the HDFS is performed according to the storage structure of the hive table, that is, the field attribute of the external table is generated according to the field attribute corresponding to the kafka data in the HDFS, wherein the field attribute includes the column name in the external table and the data type corresponding to the column name.
Illustratively, to create the hive external table, an external table is created using external keywords, such as creating an external table named as "usable", and the field attributes include number, name and age, i.e., create external table etable (id int, name string, age int) row format delimited fields terminated by ' \t ' map keys terminated by ':position '/home/external '; wherein, the line separation character is defined by row format delimited fields terminated by '\t', and the separation character is '\t'; MAP keys terminated by ': A' defines a key and value separator in the MAP, location '/home/external' defines the HDFS data storage path, and '/home/external' is the directory created in the HDFS.
Further, hive is a data warehouse tool based on Hadoop, which can map structured data files into a table and provide SQL-like query functions. Therefore, after the external table is created, the data in the HDFS needs to be mapped into the hive external table, that is, the data is loaded into the hive external table, if the data is saved in the external. Txt file, the data stored in the HDFS is mapped into the hive external table, and the data is imported into the external table using the command line, the command line acts as hive (default) > load data local inpath '/opt/module/data/external. Txt' into table etable; i.e. the data stored into the HDFS is transferred to the hive external table.
In the embodiment of the invention, the data stored in the HDFS is transferred to the hive external table, and the data received in the previous day is verified whether the offset is continuous or not in the early morning, the data is complete when the data is continuous, and the data is missing when the data is discontinuous, so that the data integrity of the query system is ensured.
In the embodiment of the present invention, the verifying the continuity of the mapped kafka data by using the ofest field in the header field to obtain continuous kafka data includes:
Performing continuity check on the digital records in the offset field corresponding to the mapped kafka data to obtain digital records of the offset field;
when the digital record is continuous, the mapped kafka data is taken as the continuous kafka data.
In detail, kafka is a sequential read-write, each time a message is produced, it is additionally written into a file corresponding to a part, and the read state of the message is maintained by a consumer, so the offset in each part is generally continuously incremented, and the read information is not deleted, so the information is additionally written and sequentially read-written each time. After the consumer consumes the message, a message is produced to the topic in the reader that specifically maintains the offset for each consumer, and the value of offset+1 of the message that it has currently read is recorded as the new offset message.
Specifically, when each message of a consumer message, kafka will record the consumed offsets, namely, the consumer_offsets-1, the consumer_offsets-2, the consumer_offsets-3 to the consumer_offsets-50, according to the offsets recorded for the consumer under the topic, if there are 50 pieces of information in total, and if the corresponding offsets are from 1 to 50, there is no missing digital record in the middle, the data is continuous; if the corresponding offset of the data lacks the digital record from 1 to 50, the data is discontinuous, so that the data is continuously verified according to the offset, and the data integrity of the query system is ensured.
Further, after judging the continuity of the kafka data transferred to hive, query analysis of the full information of the kafka message can be performed on the continuous kafka data, the data can be controlled at the minute level through query extension, and the efficiency of query analysis on the kafka data is greatly improved.
S4, carrying out real-time query on the continuous kafka data.
In the embodiment of the invention, the hive is used as a data warehouse, has a series of functions such as a database, a table, a function and the like, and can be operated by a visual tool in use, so that the operation of the hive is more visual and convenient than the operation of a command line.
In an embodiment of the present invention, the performing real-time query on the continuous kafka data includes:
establishing connection between a preset Hive visual client tool and a Hive data warehouse by using preset connection parameters to obtain a visual interface of a Hive database;
and carrying out real-time query on the continuous kafka data in the visual interface according to a preset SQL statement.
In detail, the connection parameters include a host IP, a database, a password, etc.; the Hive visualization client tools include, but are not limited to, dbean tools, which can connect both Hive and Mysql. Firstly, connection is established between the Hive and the visual client tool, namely, a database is selected to establish new connection, apache Hive is selected, then connection parameters are configured, and a driver is added, so that the connection between the visual client tool and the Hive can be established, and after the connection is successful, data query operation can be performed on the Hive in a visual interface.
Specifically, the SQL query statement is used for querying data in hive according to requirements, if names of all persons in the table are queried, the SQL statement can be used for querying, namely select name from eatble, and after the query is successful, the query result can be visually checked in the visual interface. Therefore, the data in hive can be queried in real time according to different requirements by using the select query statement in the SQL statement.
Further, for kafka historical data (such as data half year ago) transferred to hive, the kafka historical data is compressed into a Gzip format at regular time, storage space occupation is reduced, small file merging is performed at the same time, the problems of resource occupation and more small files are solved, and therefore HDFS performance is improved.
Example two
In order to more clearly understand the present invention, the case of performing real-time query on the continuous kafka data when the query efficiency can be improved according to the embodiment of the present invention is further explained by a second embodiment.
In the embodiment of the invention, in order to improve the query efficiency, other query engines compatible with hive can be used for data query, such as presto, impala, etc., so as to improve the query efficiency.
In an embodiment of the present invention, the performing real-time query on the continuous kafka data includes:
And carrying out real-time query on the continuous kafka data by using a preset compatible hive query engine.
In detail, the compatible hive query engine includes, but is not limited to presto, impala, wherein presto is a distributed SQL query engine, separate compute and store layers, which do not store data, enable access to various data sources (Storage) through Connector SPI, designed to exclusively perform high-speed, real-time data analysis, including complex, aggregated, connected and windowed functions; the Impala is a hadoop class system of the bottom layer, and SQL support and high-performance multi-user support are added. Is a completely new computing engine implemented in c++ and Java, supports multiple file formats, and embeds the computing process into the nodes of the Hadoop infrastructure in order to minimize network transmission bandwidth in the computation. The Impala comprises two components, namely front end, is responsible for receiving the query and completing the generation of the distributed execution plan to BackEnd, is responsible for the actual execution of the plan, and widely utilizes codegen to accelerate some high-repeatability calculation flows, so that the execution efficiency is improved. The Impala and the Hive are often matched, namely, the Hive is used for data conversion treatment; rapid data analysis (batch + real time) was then performed on the basis of the result set using Impala.
Specifically, the compatible hive query engine can be used for rapidly analyzing data so as to improve the efficiency of data query, and can be used for flexibly analyzing and querying the data, thereby realizing flexible and SQL-based query analysis on the full information of the kafka message.
Example III
As shown in fig. 4, this embodiment also provides a functional block diagram of the kafka data quasi-real-time query device.
The kafka data quasi-real-time query device 100 described in the present embodiment may be installed in an electronic apparatus. The kafka data quasi-real-time query device 100 may include a field reorganization module 101, a kafka data import module 102, a data continuity check module 103, and a kafka data real-time query module 104 according to the implemented functions. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the field reorganization module 101 is configured to add a preset header field to kafka data, take a message body of the kafka data as an independent field, and perform field reorganization on the header field and the independent field to obtain updated kafka data;
The kafka data importing module 102 is configured to import the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table, so as to obtain stored kafka data;
the data continuity checking module 103 is configured to map the stored kafka data to a hive external table constructed in advance to obtain mapped kafka data, and check continuity of the mapped kafka data by using an ofest field in the message header field to obtain continuous kafka data;
the kafka data real-time query module 104 is configured to query the continuous kafka data in real time.
In detail, each module in the kafka data quasi-real-time query device 100 in the embodiment of the present invention adopts the same technical means as the kafka data quasi-real-time query method in the first embodiment and the second embodiment, and can produce the same technical effects, which are not described herein.
Example IV
As shown in fig. 5, the present embodiment further provides a computer electronic device, which may include a processor 10, a memory 11, a communication bus 12, and a communication interface 13, and may further include a computer program, such as a kafka data quasi-real-time query program, stored in the memory 11 and executable on the processor 10.
The processor 10 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and so on. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, executes or executes programs or modules stored in the memory 11 (for example, executes a kafka data quasi-real-time query program, etc.), and invokes data stored in the memory 11 to perform various functions of the electronic device and process data.
The memory 11 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory 11 may in other embodiments also be an external storage device of the electronic device, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only for storing application software installed in an electronic device and various types of data, such as codes of a kafka data quasi-real-time query program, etc., but also for temporarily storing data that has been output or is to be output.
The communication bus 12 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
The communication interface 13 is used for communication between the electronic device and other devices, including a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), or alternatively a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.
Only an electronic device having components is shown, and it will be understood by those skilled in the art that the structures shown in the figures do not limit the electronic device, and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The kafka data quasi-real-time query program stored in the memory 11 of the electronic device is a combination of instructions that, when executed in the processor 10, can implement:
Adding a preset message header field to the kafka data, taking a message body of the kafka data as an independent field, and carrying out field recombination on the message header field and the independent field to obtain updated kafka data;
importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data;
mapping the stored kafka data into a pre-constructed hive external table to obtain mapped kafka data, and checking the continuity of the mapped kafka data by using an ofest field in the message header field to obtain continuous kafka data;
and carrying out real-time query on the continuous kafka data.
In particular, the specific implementation method of the above instructions by the processor 10 may refer to the description of the relevant steps in the corresponding embodiment of the drawings, which is not repeated herein.
Further, the electronic device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
Example five
The present embodiment provides a storage medium storing a computer program which, when executed by a processor, implements the steps of the kafka data quasi-real-time querying method as described above.
These program code may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.
Storage media includes both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media may include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
It is noted that the terms used herein are used merely to describe particular embodiments and are not intended to limit exemplary embodiments in accordance with the present application and when the terms "comprises" and/or "comprising" are used in this specification they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. A method for quasi-real-time querying of kafka data, the method comprising:
adding a preset message header field to the kafka data, taking a message body of the kafka data as an independent field, and carrying out field recombination on the message header field and the independent field to obtain updated kafka data;
importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data;
Mapping the stored kafka data into a pre-constructed hive external table to obtain mapped kafka data, and checking the continuity of the mapped kafka data by using an ofest field in the message header field to obtain continuous kafka data;
and carrying out real-time query on the continuous kafka data.
2. The method for near real-time query of kafka data according to claim 1, wherein said field reorganizing said header field and said independent field to obtain updated kafka data comprises:
taking the message header field as a message header of the updated kafka data;
taking the independent field as a message body of the updated kafka data;
and reorganizing the message format of the updated kafka data according to the message header and the message body to obtain the updated kafka data.
3. The kafka data quasi-real-time query method of claim 1, wherein importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data comprises:
performing field rewriting on the KafkaSource class in the Flume to obtain a rewritten KafkaSource class;
Carrying out event serialization rewriting on the HeaderAndBodyTextEventSerializer class in the Flume according to the storage structure of the hive table to obtain a rewritten HeaderAndBodyTextEventSerializer class;
configuring a source component of the Flume according to the rewritten KafkaSource class and the updated kafka data to obtain configuration source information, configuring a channel component of the Flume according to a preset buffer attribute to obtain configuration channel information, and configuring a sink component of the Flume according to the distributed file system and the rewritten HeadAndByTextEventSerializer class to obtain configuration sink information;
generating a transmission file by the configuration source information, the configuration channel information and the configuration sink information, and operating the Flume according to the transmission file to obtain the storage kafka data.
4. The kafka data quasi-real-time query method of claim 1, further comprising, prior to said mapping said stored kafka data into a pre-constructed hive external table to obtain mapped kafka data:
creating a storage path and a separation field of the kafka data;
generating field attributes of the hive external table according to the field attributes of the stored kafka data;
And creating the hive external table according to the storage path, the separation field and the field attribute.
5. The kafka data quasi-real-time query method of claim 1, wherein said verifying continuity of said mapped kafka data using an ofest field in said header field to obtain continuous kafka data comprises:
performing continuity check on the digital records in the offset field corresponding to the mapped kafka data to obtain digital records of the offset field;
when the digital record is continuous, the mapped kafka data is taken as the continuous kafka data.
6. The kafka data quasi-real time querying method according to claim 1, wherein said real time querying of said continuous kafka data comprises:
establishing connection between a preset Hive visual client tool and a Hive data warehouse by using preset connection parameters to obtain a visual interface of a Hive database;
and carrying out real-time query on the continuous kafka data in the visual interface according to a preset SQL statement.
7. The kafka data quasi-real time querying method according to claim 1, wherein said real time querying of said continuous kafka data comprises:
And carrying out real-time query on the continuous kafka data by using a preset compatible hive query engine.
8. A kafka data quasi-real-time query apparatus, the apparatus comprising:
a field reorganization module, configured to add a preset message header field to kafka data, take a message body of the kafka data as an independent field, and perform field reorganization on the message header field and the independent field to obtain updated kafka data;
the kafka data importing module is used for importing the updated kafka data into a preset distributed file system according to a storage structure of a preset hive table to obtain stored kafka data;
a data continuity checking module, configured to map the stored kafka data to a hive external table constructed in advance to obtain mapped kafka data, and check continuity of the mapped kafka data by using an ofest field in the header field to obtain continuous kafka data;
and the kafka data real-time query module is used for carrying out real-time query on the continuous kafka data.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the kafka data quasi-real time querying method of any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements the kafka data quasi-real-time querying method according to any one of claims 1 to 7.
CN202211713863.4A 2022-12-29 2022-12-29 Quasi-real-time query method and device for kafka data, electronic equipment and medium Pending CN116361314A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211713863.4A CN116361314A (en) 2022-12-29 2022-12-29 Quasi-real-time query method and device for kafka data, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211713863.4A CN116361314A (en) 2022-12-29 2022-12-29 Quasi-real-time query method and device for kafka data, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN116361314A true CN116361314A (en) 2023-06-30

Family

ID=86939701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211713863.4A Pending CN116361314A (en) 2022-12-29 2022-12-29 Quasi-real-time query method and device for kafka data, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN116361314A (en)

Similar Documents

Publication Publication Date Title
CN108536761B (en) Report data query method and server
CN107402976B (en) Power grid multi-source data fusion method and system based on multi-element heterogeneous model
CN103338135B (en) A kind of method for real-time monitoring of cluster storage capacity
US20130191523A1 (en) Real-time analytics for large data sets
CN112347071B (en) Power distribution network cloud platform data fusion method and power distribution network cloud platform
CN104317800A (en) Hybrid storage system and method for mass intelligent power utilization data
CN111324610A (en) Data synchronization method and device
CN104239572A (en) System and method for achieving metadata analysis based on distributed cache
CN107092627A (en) The column-shaped storage of record is represented
CN110688399A (en) Stream type calculation real-time report system and method
CN111897808B (en) Data processing method and device, computer equipment and storage medium
CN111159180A (en) Data processing method and system based on data resource directory construction
CN114647716B (en) System suitable for generalized data warehouse
CN114218218A (en) Data processing method, device and equipment based on data warehouse and storage medium
CN111666344A (en) Heterogeneous data synchronization method and device
CN106780157B (en) Ceph-based power grid multi-temporal model storage and management system and method
Singh et al. Spatial data analysis with ArcGIS and MapReduce
CN115858488A (en) Parallel migration method and device based on data governance and readable medium
CN102685222B (en) A kind of cloud SRM device for electric power system
CN114238085A (en) Interface testing method and device, computer equipment and storage medium
CN116680315A (en) Data offline processing method and device, electronic equipment and storage medium
CN116361314A (en) Quasi-real-time query method and device for kafka data, electronic equipment and medium
CN115114297A (en) Data lightweight storage and search method and device, electronic equipment and storage medium
CN111475471B (en) Information system for industrial design resource sharing
CN105809577B (en) Power plant informatization data classification processing method based on rules and components

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination