CN115934882A

CN115934882A - HBase-based trillion-level real-time data association method, retrieval method and retrieval device

Info

Publication number: CN115934882A
Application number: CN202211730188.6A
Authority: CN
Inventors: 陈海龙
Original assignee: Qizhidao Network Technology Co Ltd
Current assignee: Qizhidao Network Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-07

Abstract

The application discloses a trillion-level real-time data association method, a retrieval method and a retrieval device based on HBase, and relates to the technical field of processing and retrieval of big data. The trillion-level real-time data association method based on the HBase comprises the following steps: acquiring a plurality of discrete data sources; performing data processing and summarization on the plurality of discrete data sources to obtain a summary table, wherein the summary table comprises a summary table main key ID, a plurality of summary table fields and a plurality of summary table field values; and constructing an index table according to each summary table field, taking the summary table field value as the key of the index table, and taking the summary table primary key ID as the value of the index table. The method and the device have the purpose of realizing real-time association and storage of the mass data.

Description

HBase-based trillion-level real-time data association method, retrieval method and retrieval device

Technical Field

The application relates to the technical field of big data processing and retrieval, in particular to a trillion-level real-time data association method, a retrieval method and a device based on HBase.

Background

With the rapid arrival of mobile internet, the business of enterprises on line is developed vigorously, the data volume is exponentially increased, and the traditional relational database such as MySQL cannot meet the increasing business requirements. The query from the database takes longer and longer, and in addition, the performance of the database system and the data organization management capability are greatly weakened, and even the system is crashed.

In the field of big data, data are often stored discretely, and complete information can be formed by the data for use only after data association processing. The traditional small-batch data can be directly associated through code processing or a database, but a good solution is not provided for real-time association and storage of mass data.

Disclosure of Invention

The application provides an HBase-based trillion-level real-time data association method, a HBase-based trillion-level real-time data retrieval method and a HBase-based trillion-level real-time data retrieval device, which can be used for realizing real-time association and storage of mass data.

In a first aspect, the HBase-based trillion-level real-time data association method provided by the application adopts the following technical scheme:

a trillion-level real-time data association method based on HBase comprises the following steps:

acquiring a plurality of discrete data sources;

performing data processing and summarization on the plurality of discrete data sources to obtain a summary table, wherein the summary table comprises a summary table main key ID, a plurality of summary table fields and a plurality of summary table field values;

and constructing an index table according to each summary table field, taking the summary table field value as the key of the index table, and taking the summary table primary key ID as the value of the index table.

By adopting the technical scheme, data processing and summarization are carried out on the obtained multiple discrete data sources, a summary table can be obtained, the summary table comprises a summary table main key ID, multiple summary table fields and multiple summary table field values, an index table can be constructed according to each summary table field value, the summary table field value is used as a key of the index table, the summary table main key ID is used as a value of the index table, and the multiple discrete data sources can be stored in the summary table and form association by adopting the method; therefore, real-time association and storage of mass data are realized.

Optionally, the acquiring a plurality of discrete data sources specifically includes:

abstracting a data source inlet and a data source outlet;

and acquiring a plurality of discrete data sources by adopting a preset transmission protocol according to the data source inlet and the data source outlet.

By adopting the technical scheme, the data source inlet and the data source outlet are abstracted, and the expansion of the data source can be well supported, so that the data source inlet and the data source outlet can be adapted to various types of data sources.

Optionally, kafka is employed to carry and store a plurality of said discrete data sources.

By adopting the technical scheme and adopting Kafka to carry and store the discrete data source, the discrete data source can be subjected to message persistent storage, and data loss can be avoided; meanwhile, the method can also support million-level data frequency of a single node so as to process ultrahigh frequency data access.

Optionally, the performing data processing and summarizing on the multiple discrete data sources to obtain a summary table specifically includes:

analyzing the plurality of discrete data sources to obtain the keywords and the data content of each discrete data source;

classifying the keywords to obtain associated keywords;

and constructing the summary table according to the associated keywords, and cleaning the data content corresponding to the associated keywords into the summary table.

By adopting the technical scheme, the plurality of discrete data sources are analyzed, the keywords and the data content of each discrete data source can be obtained, the obtained keywords are classified, the associated keywords can be obtained, then a summary table can be constructed according to the associated keywords, and the data content corresponding to the associated keywords is cleaned into the summary table; therefore, the aim of automatically classifying a plurality of discrete data sources can be fulfilled.

In a second aspect, the present application provides an HBase-based trillion-level real-time data retrieval method, which adopts the following technical scheme:

a trillion-level real-time data retrieval method based on HBase comprises the following steps:

acquiring a value of a summary table field;

acquiring a summary table primary key ID from a response index table according to the summary table field value;

and retrieving response data from the corresponding summary table according to the summary table primary key ID.

By adopting the technical scheme, the ID of the primary key of the summary table can be obtained in the index table according to the field value of the summary table, and then the response data can be retrieved in the summary table according to the ID of the primary key of the summary table; therefore, the purpose of retrieving the associated data can be achieved.

Optionally, the retrieving response data from the summary table according to the summary table primary key ID specifically includes:

and adapting the summary table according to the primary key ID of the summary table, and searching the response data in the summary table by utilizing a K-V searching mode of HBase.

By adopting the technical scheme, the retrieval speed cannot be reduced along with the increase of the data volume by utilizing the retrieval mode of the K-V of the HBase, so that the purpose of retrieving the ultra-large-scale multi-field associated data can be realized.

In a third aspect, the present application provides an HBase-based trillion-level real-time data association apparatus, which adopts the following technical scheme:

an HBase-based trillion-level real-time data association device comprises:

the device comprises a first acquisition module, a second acquisition module and a data processing module, wherein the first acquisition module is used for acquiring a plurality of discrete data sources;

the first generation module is used for carrying out data processing and summarizing on the plurality of discrete data sources to obtain a summary table, and the summary table comprises a summary table main key ID, a plurality of summary table fields and a plurality of summary table field values;

and the second generation module is used for constructing an index table according to each summary table field, the summary table field value is used as the key of the index table, and the summary table main key ID is used as the value of the index table.

By adopting the technical scheme, the plurality of discrete data sources can be acquired by the aid of the first acquisition module, then the plurality of discrete data sources are subjected to data processing and summarization by the aid of the first generation module, so that a summary table is obtained, the summary table comprises a summary table main key ID, a plurality of summary table fields and a plurality of summary table field values, an index table is constructed by the aid of the second generation module according to each summary table field, the summary table field values serve as index table keys, and the summary table main key ID serves as the value of the index table; and then a plurality of discrete data sources can be stored in the summary table and form association, thereby realizing real-time association and storage of mass data.

In a fourth aspect, the present application provides an HBase-based trillion-level real-time data retrieval device, which adopts the following technical scheme:

the second acquisition module is used for acquiring the field value of the summary table;

a third obtaining module, configured to obtain a summary table primary key ID from a response index table according to the summary table field value;

and the data retrieval module is used for retrieving response data from the corresponding summary table according to the primary key ID of the summary table.

By adopting the technical scheme, the field value of the summary table is obtained by the aid of the second obtaining module, then the main key ID of the summary table is obtained from the corresponding index table by the aid of the third obtaining module according to the field value of the summary table, and then the response data is retrieved from the corresponding summary table by the aid of the data retrieving module according to the main key ID of the summary table; therefore, the purpose of searching the associated data can be realized.

In a fifth aspect, the present application provides a terminal, which adopts the following technical solution:

a terminal comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, when loaded with the computer program, performing the method of the first aspect and/or the second aspect.

By adopting the above technical solution, the method of the first aspect and/or the second aspect generates a computer program, and stores the computer program in a memory, so as to be loaded and executed by a processor, so that a user can establish a connection with the device of the third aspect and/or the fourth aspect through a terminal, and inquire each item of content processed by the device.

In a sixth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium, in which a computer program is stored which, when loaded by a processor, performs the method of the first and/or second aspect.

By adopting the above technical solution, the method of the first aspect and/or the second aspect is generated into a computer program and stored in a computer readable storage medium, and after the computer readable storage medium is loaded into any computer, any computer can execute the method of the first aspect and/or the second aspect.

Drawings

FIG. 1 is a flowchart of a method of steps S100-S300 in an embodiment of the present application;

FIG. 2 is a flowchart of a method of steps S110-S120 in an embodiment of the present application;

FIG. 3 is a flow chart of multiple data sources storing via Kafka in an embodiment of the present application;

FIG. 4 is a flowchart of a method of steps S210-S230 in an embodiment of the present application;

FIG. 5 is a diagram illustrating an example of creating a summary table in an embodiment of the present application;

FIG. 6 is a diagram of an example of a summary table and an index table in an embodiment of the present application;

FIG. 7 is a flowchart of a method of steps S400-S600 in an embodiment of the present application;

fig. 8 is a schematic diagram of data association and data retrieval performed simultaneously in the embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to figures 1-8.

The following are explanations of terms appearing in the present application:

HBase is a distributed, column-oriented open source database, and the technology is derived from the Google paper "Bigtable" written by Fay Chang; a distributed storage system for structured data. Just as Bigtable takes advantage of the distributed data storage provided by the Google file system, HBase provides Bigtable-like capabilities over Hadoop. HBase is a sub-item of the Hadoop item of Apache. HBase is different from a general relational database, and is a database suitable for unstructured data storage. Another difference is that HBase is based on a column rather than a row based pattern.

K-mapping is a special type of mapping, where X and Y are two topological spaces, and f: X } Y is a continuous mapping, where for any tight set of Y, K, f- '(K) is a tight set in X, f is called K mapping, K mapping is a tight coverage mapping, both full and K-projection are K mapping, and K-projection to non-K1' z-space is an example of K mapping rather than full mapping.

V (Key-Value) means that one Key corresponds to one Value, the Key cannot be repeated, the Value can be repeated, the mapping is composed of a plurality of keys and values, a container for storing the keys and the values, a K-Key and a V-Key, the mapping can be composed of a plurality of K keys and V keys, the keys and the values can be processed uniformly to form Key Value pairs, the mapping can be composed of a plurality of Key Value pairs, the Key Value pairs are extracted into Entry classes, and each specific Key Value pair is each Entry object.

Kafka is an open source stream processing platform developed by the Apache software foundation, written by Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data of a consumer in a web site. This action (web browsing, searching and other user actions) is a key factor in many social functions on modern networks. These data are typically addressed by log aggregation that handles logs due to throughput requirements. This is a viable solution to the limitations of Hadoop-like log data and offline analysis systems, but which require real-time processing. The purpose of Kafka is to unify online and offline message processing through a Hadoop-derived parallel loading mechanism, and also to provide real-time messages through clustering.

Flink is an open source streaming framework developed by the Apache software Foundation, at the heart of which is a distributed streaming data streaming engine written in Java and Scala. Flink executes arbitrary stream data programs in a data parallel and pipelined manner, and Flink's pipelined runtime system can execute batch and stream processing programs. In addition, the Flink runtime itself also supports the execution of iterative algorithms.

The embodiment of the application discloses an HBase-based trillion-level real-time data association method, and with reference to FIG. 1, the method comprises the following steps:

s100: a plurality of discrete data sources is acquired.

In an embodiment of the present application, referring to fig. 2, step S100 may specifically include the following steps:

s110: and abstracting the data source inlet and the data source outlet.

In particular, in this embodiment, due to the numerous data sources, a single data acquisition cannot be well adapted; therefore, the data source inlet and the data source outlet can be abstracted through the Flink technology, and the expansion of the data source can be well supported; meanwhile, a Flink technology is used for collecting discrete data sources, and millisecond-level data delay can be realized.

Specifically, in the present embodiment, the adapters are packaged at the data source inlet and the data source outlet, and the packaged adapters can be adapted to various data sources.

Specifically, in this embodiment, the discrete data source includes, but is not limited to, web page media text data, internet data captured by a web crawler, hadoop data, server running log data, data that can access other business systems, and the like.

S120: and acquiring a plurality of discrete data sources by adopting a preset transmission protocol according to the data source inlet and the data source outlet.

Specifically, in the present embodiment, the preset transmission protocol includes, but is not limited to, TCP/IP, netBEUI, DHCP, FTP, and the like.

In an embodiment of the present application, a summary table is constructed based on HBase, and the summary table is mainly used for storing associated data, so that all related information is stored in the same table through association.

In an embodiment of the application, kafka can be used for bearing and storing a discrete data source, so that message persistent storage can be performed, and data loss is avoided; meanwhile, kafka can support million-level data frequency of a single node and can process ultrahigh-frequency data access.

For example, referring to fig. 3, data from multiple data sources is collected in real time and transmitted to Kafka for storage, so that data loss from multiple data sources can be avoided.

S200: and carrying out data processing and summarization on the plurality of discrete data sources to obtain a summary table, wherein the summary table comprises a summary table main key ID, a plurality of summary table fields and a plurality of summary table field values.

The summary table primary key ID disclosed in this embodiment is an abbreviation of a primary keyword ID of the summary table, and is exemplarily described below.

In an embodiment of the present application, referring to fig. 4, step S200 specifically includes the following steps:

s210: and analyzing the plurality of discrete data sources to obtain the keywords and the data content of each discrete data source.

S220: and classifying the keywords to obtain associated keywords.

S230: and constructing a summary table according to the associated keywords, and cleaning the data content corresponding to the associated keywords into the summary table.

Furthermore, the discrete data can be summarized into the same summary table by carrying out data processing and summarization on the discrete data; wherein, data association can be automatically carried out based on the field overwriting characteristic of HBase.

For example, referring to fig. 5, the discrete data source may include the game protocol data, the mail data, and the wechat data, where the discrete data includes IDs and data contents, where the IDs are associated keywords, and the IDs represent identity numbers of registrars, so that a summary table may be constructed according to the IDs, and the game protocol data contents, the mail data contents, and the wechat data contents are all cleaned into the constructed summary table; thereby realizing the associated storage of data.

S300: and constructing an index table according to each summary table field, taking the summary table field value as the key of the index table, and taking the summary table primary key ID as the value of the index table.

In an embodiment of the present application, an index table is constructed based on HBase, and the index table is mainly used for indexing data, and each field of a summary table generates an index table for indexing data of the field.

For example, referring to fig. 6, a summary table is constructed from ID, ID card, name, phone, license plate, QQ, and WeChat, and ID is associated with ID card, name, phone, license plate, QQ, and WeChat, respectively; the identification card, the name, the telephone, the license plate, the QQ and the WeChat all represent fields of a summary table, and an index table is constructed according to each field, namely the index table comprises an identification card index table, a name index table, a telephone index table, a license plate index table, a QQ index table and a WeChat index table; and the data content (i.e. the summary table field value) contained in the index table is used as the key of the index table, and the ID (i.e. the summary table primary key ID) is used as the value of the index table.

An implementation principle of an HBase-based trillion-level real-time data association method in the embodiment of the present application is as follows: when the discrete data source comes, the discrete data source is subjected to data processing and summarization to obtain a summary table; meanwhile, an index table is constructed according to each summary table field, and data content corresponding to the summary table field is written into the index table corresponding to the field; therefore, the purpose of the associated storage of a plurality of data sources is realized.

The embodiment of the application discloses a trillion-level real-time data retrieval method based on HBase, and the method comprises the following steps of:

s400: and acquiring the value of the summary table field.

In one embodiment of the application, collecting the field value of the summary table from the constructed index table; specifically, the value of the summary table field recorded in this embodiment is the data content corresponding to the summary table field.

S500: and acquiring the ID of the primary key of the summary table from the index table of the response according to the field value of the summary table.

In one embodiment of the present application, when constructing a summary table, the summary table primary key ID, the summary table field, and the summary table field value in the summary table are all associated, and an index table is constructed according to each summary table field; furthermore, the summary table primary key ID can be collected in the index table according to the input summary table field value.

S600: responsive data is retrieved from the responsive summary table based on the summary table primary key ID.

In an embodiment of the present application, step S600 specifically includes:

and adapting the summary table according to the primary key ID of the summary table, and searching response data in the summary table by utilizing a K-V searching mode of HBase.

Since the K-V retrieval speed of HBase is very high (is (n (1)) complexity, the overall complexity of the retrieval mode is (n (1)), and the retrieval speed is not reduced along with the increase of the data quantity; therefore, the retrieval mode is suitable for ultra-large-scale retrieval of multi-field associated data.

The implementation principle of the HBase-based trillion-level real-time data retrieval method in the embodiment of the application is as follows: acquiring a value of a summary table field, acquiring a primary key ID of the summary table from a response index table according to the value of the summary table field, and retrieving response data from the response summary table according to the primary key ID of the summary table; so that the retrieval speed is not lowered by the increase of the data amount.

In an embodiment of the present application, the data association method and the data retrieval method may be executed separately or simultaneously.

Specifically, referring to fig. 8, a schematic diagram of data association and data retrieval performed simultaneously in the embodiment of the present application is shown; the method comprises the steps of collecting a plurality of data sources in real time, storing the collected data sources to Kafka, carrying out data association on the stored data sources to construct a summary table, constructing an index table according to each summary table field, storing the summary table and the index table to HBase, and finally, executing data retrieval.

The embodiment of the application discloses an HBase-based trillion-level real-time data association device, which comprises a first acquisition module, a first generation module and a second generation module; the first acquisition module is used for acquiring a plurality of discrete data sources; the first generation module is used for carrying out data processing and summarizing on the plurality of discrete data sources to obtain a summary table, and the summary table comprises a summary table main key ID, a plurality of summary table fields and a plurality of summary table field values; the second generation module is used for constructing an index table according to each summary table field, taking the summary table field value as the key of the index table, and taking the summary table main key ID as the index table value.

When the HBase-based trillion-level real-time data association apparatus of this embodiment is specifically applied, the HBase-based trillion-level real-time data association method of the above embodiment is adopted, and therefore, the application of the apparatus is not described herein again.

The embodiment of the application discloses a trillion-level real-time data retrieval device based on HBase, which comprises a second acquisition module, a third acquisition module and a data retrieval module; the second obtaining module is used for obtaining the value of the summary table field; the third obtaining module is used for obtaining the ID of the primary key of the summary table from the index table responding to the field value of the summary table; and the data retrieval module is used for retrieving response data from the summary table according to the summary table primary key ID.

When the HBase-based trillion-level real-time data retrieval device of this embodiment is specifically applied, the HBase-based trillion-level real-time data retrieval method of the above embodiment is adopted, and therefore, the application of the device is not described herein again.

In one embodiment of the present application, the data association apparatus and the data retrieval apparatus can be applied separately or simultaneously; when the two are independently applied, the corresponding methods of the device are respectively executed; when both are applied simultaneously, the data association method and the data retrieval method in the above embodiments are executed simultaneously.

The embodiment of the application discloses a terminal, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program by adopting the HBase-based trillion-level real-time data association method and/or the HBase-based trillion-level real-time data retrieval method of the embodiment.

In an embodiment of the present application, the terminal may employ a computer device such as a desktop computer, a notebook computer, or a cloud server, and the terminal includes but is not limited to a processor and a memory, for example, the terminal may further include an input/output device, a network access device, a bus, and the like.

In an embodiment of the present application, the processor may adopt a Central Processing Unit (CPU), and of course, according to an actual use situation, other general processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), ready-made programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. may also be adopted, and a general processor may adopt a microprocessor or any conventional processor, etc., which is not limited in this application.

In an embodiment of the present application, the memory may be an internal storage unit of the terminal device, for example, a hard disk or a memory of the terminal device, or an external storage device of the terminal device, for example, a plug-in hard disk, a smart card (SMC), a secure digital card (SD) or a flash memory card (FC) equipped on the terminal device, or a combination of the internal storage unit of the terminal device and the external storage device, and the memory is used for storing a computer program and other programs and data required by the terminal device, and the memory is also used for temporarily storing data that has been output or will be output, which is not limited in this application.

The HBase-based trillion-level real-time data association method and/or the HBase-based trillion-level real-time data retrieval method in the embodiments described above are stored in a memory of the terminal through the terminal, and are loaded and executed on a processor of the server, so that a user can establish contact with the device through the terminal and inquire various contents processed by the device.

The embodiment of the application discloses a computer-readable storage medium, and the computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the HBase-based trillion-level real-time data association method and/or the HBase-based trillion-level real-time data retrieval method of the above embodiments are/is adopted.

In an embodiment of the present application, a computer program may be stored in a computer readable medium, where the computer program includes a computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form, and the computer readable medium includes any entity or device capable of carrying the computer program code, a recording medium, a usb disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunication signal, a software distribution medium, and the like, and the computer readable medium includes but is not limited to the above components.

The HBase-based trillion-level real-time data association method and/or the HBase-based trillion-level real-time data retrieval method of the embodiments are stored in a computer-readable storage medium, and are loaded and executed on a processor, and after the computer-readable storage medium is loaded into any computer, any computer can execute the HBase-based trillion-level real-time data association method and/or the HBase-based trillion-level real-time data retrieval method of the embodiments.

The foregoing is a preferred embodiment of the present application and is not intended to limit the scope of the application in any way, and any features disclosed in this specification (including the abstract and drawings) may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

Claims

1. A trillion-level real-time data association method based on HBase is characterized by comprising the following steps:

acquiring a plurality of discrete data sources;

2. The HBase-based trillion-level real-time data association method according to claim 1, wherein the obtaining a plurality of discrete data sources specifically comprises:

abstracting a data source inlet and a data source outlet;

3. The HBase-based trillion-level real-time data association method according to claim 1, wherein Kafka is employed to carry and store a plurality of said discrete data sources.

4. The HBase-based trillion-level real-time data association method according to any one of claims 1-3, wherein the data processing and summarization of the plurality of discrete data sources to obtain a summary table specifically comprises:

classifying the keywords to obtain associated keywords;

5. A trillion-level real-time data retrieval method based on HBase is characterized by comprising the following steps:

acquiring a summary table field value;

and retrieving response data from the summary table according to the summary table primary key ID.

6. The HBase-based trillion-level real-time data retrieval method according to claim 5, wherein said retrieving response data from a responsive summary table according to said summary table primary key ID specifically comprises:

7. An HBase-based trillion-level real-time data association device, comprising:

8. A trillion-level real-time data retrieval device based on HBase is characterized by comprising:

9. A terminal comprising a memory, a processor and a computer program stored in the memory and being executable on the processor, wherein the processor, when loaded with the computer program, performs the method of any of claims 1-4 and/or 5-6.

10. A computer-readable storage medium, in which a computer program is stored which, when being loaded by a processor, is adapted to carry out the method of any one of claims 1-4 and/or 5-6.