CN116089545B

CN116089545B - Method for collecting storage medium change data into data warehouse

Info

Publication number: CN116089545B
Application number: CN202310364245.1A
Authority: CN
Inventors: 韩雷; 陶赵文
Original assignee: Yunzhu Information Technology Chengdu Co ltd
Current assignee: Yunzhu Information Technology Chengdu Co ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-08-22
Anticipated expiration: 2043-04-07
Also published as: CN116089545A

Abstract

The invention relates to the field of big data, in particular to a method for collecting storage medium change data into a data warehouse, which comprises the following steps: expanding the capacity of the storage medium cluster, installing a data acquisition plug-in at each server of the storage medium cluster, and restarting each storage medium in the cluster in sequence; calling an interface of a storage medium and configuring related parameters of a data acquisition plug-in; creating an index, inserting the index into a storage medium, synchronizing the index with related parameters of a data acquisition plug-in, grabbing variable data of the storage medium based on the index, and sending the variable data into kafka; consuming variable data in the kafka through a stream processing module, and writing the consumed variable data into a distributed file system; and mapping variable data in the distributed file system to a data warehouse, and adding a corresponding date partition in the data warehouse to complete writing of storage medium change data.

Description

Method for collecting storage medium change data into data warehouse

Technical Field

The invention relates to the field of big data, in particular to a method for collecting storage medium change data into a data warehouse.

Background

In the data warehouse construction process of a large data platform, the large data platform needs to collect data from various different data sources and enter the ods table of the data warehouse. In data acquisition, the problem of how to realize no sense of an acquisition program on a service system and reduce the pressure of the acquisition program on a service data source exists. Based on the above problems, binlog-based analysis is provided for mysql collection at present, but no mature collection scheme is available for storage media of document types such as elastiscearch, and data is still obtained through batch training, so that the following problems exist: as the number of the storage media is larger, the training brings great performance influence to the storage media, and the service is easy to collapse; in addition, the training is usually to acquire data in batches at a fixed time interval, the existing time interval is inconvenient to determine, and the data entering the data warehouse is easy to have larger delay due to the existing time interval, so that the data cannot be queried and used in time. Based on the above, we devised a method for collecting storage medium change data into a data warehouse.

Disclosure of Invention

The invention aims to provide a method for acquiring storage medium change data into a data warehouse, which can effectively grasp variable data in an elastiscearch by designing a method for automatically acquiring elastiscearch data, thereby not only reducing the difficulty of acquiring storage media of document types such as elastiscearch, but also improving the instantaneity of acquiring the storage media data of the document types such as elastiscearch.

The embodiment of the invention is realized by the following technical scheme:

a method of collecting storage medium change data into a data warehouse, the method comprising the steps of:

expanding the capacity of the storage medium cluster, installing a data acquisition plug-in at each server of the storage medium cluster, and then restarting each storage medium in the storage medium cluster in sequence;

calling an interface of a storage medium and configuring related parameters of a data acquisition plug-in;

creating an index, inserting the index into a storage medium, synchronizing the index with related parameters of a data acquisition plug-in, grabbing variable data of the storage medium based on the index, and sending the variable data into kafka;

consuming variable data in the kafka through a stream processing module, and writing the consumed variable data into a distributed file system;

and mapping variable data in the distributed file system to a data warehouse, and adding a corresponding date partition in the data warehouse to complete writing of storage medium change data.

Optionally, the storage medium is specifically an elastiscearch.

Optionally, the writing of the consumed variable data into the distributed file system, specifically writing the consumed variable data into the distributed file system according to the daily partition.

Optionally, the data in the distributed file system is mapped into a data warehouse, and the specific process is as follows:

creating an ods table in a data warehouse, mapping the data in the distributed file system into the ods table, and adding a corresponding day partition in the ods table to complete writing of storage medium change data.

Optionally, the variable data of the storage medium is grabbed based on the index and sent into kafka, wherein the variable data includes newly added data or updated data, and deleted data.

Optionally, capturing new data or updated data in variable data of the storage medium based on the index, wherein the specific process is as follows:

creating an index, inserting the index into a storage medium, synchronizing the index with related parameters of a data acquisition plug-IN, and forming an IN interface of the storage medium by the synchronized index;

the IN interface monitors the newly added event or the updated event IN the storage medium, analyzes the type of the newly added event or the updated event, and acquires the newly added data of the newly added event or the updated data of the updated event IN the index;

and analyzing the newly added data or the updated data into character strings and identifying the character strings, and sending the character strings or the updated data into the kafka.

Optionally, capturing deleted data in variable data of the storage medium based on the index, wherein the specific process is as follows:

the IN interface monitors the deletion event IN the storage medium, acquires the data to be deleted according to the deletion event, and stores the data into the currentHashMap;

and determining the deleted data, acquiring the deleted data in the currentHashMap according to the ID of the deleted data, converting the deleted data into a character string which is analyzed and identified, and sending the character string to the kafka.

The technical scheme of the embodiment of the invention has at least the following advantages and beneficial effects:

according to the embodiment of the invention, by designing the method for automatically collecting the elastiscearch data, variable data in elastiscearch can be effectively captured, the difficulty of collecting the storage media of document types such as elastiscearch is reduced, and the instantaneity of collecting the storage media of the document types such as elastiscearch is improved.

Drawings

Fig. 1 is a schematic overall flow chart of a method for collecting storage medium change data into a data warehouse according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Referring to fig. 1, fig. 1 is a schematic overall flow chart of a method for collecting storage medium change data into a data warehouse according to the present invention.

In some embodiments, a method of collecting storage medium change data into a data warehouse, the steps of the method comprising:

More specifically, the storage medium is specifically an elastiscearch.

In the implementation process, the first step is as follows: a custom data acquisition plug-in is installed on each server in the elastiscearch (storage medium) cluster and then the elastiscearch is restarted in turn. And a second step of: calling the api interface of cluster/settings provided by the elastic search, configuring parameters needed by the global acquisition plug-in, mainly configuring the configuration related to the kafka (module), such as the address of the kafka cluster, the acks parameters, etc. of the message to be sent. And thirdly, creating an index, setting some parameters required for acquiring the data of the current index in the settings of the index, such as whether data acquisition is enabled or not, and sending the data of the current index to topic of a target. Fourth step: by writing a flink streaming handler, the data in topic is consumed and written into hdfs (distributed file system) by day partition. Fifth step: an ods external table is created in the hive (data warehouse), and the files of the corresponding file directory of the hdfs are mapped into the table of the hive data warehouse. And adds the corresponding date partition in the data table. The change data in the elastesearch is written to the hive data store.

More specifically, the variable data consumed is written into the distributed file system, specifically, the variable data consumed is written into the distributed file system according to the daily partition.

More specifically, mapping data in a distributed file system into a data warehouse comprises the following specific processes:

In some embodiments, the index-based crawling of variable data of the storage medium and sending the variable data into kafka, wherein the variable data includes newly added data or updated data, and deleted data.

More specifically, the method is based on the new data or updated data in the variable data of the index grabbing storage medium, and comprises the following specific processes:

In the implementation described above, the first step is to inherit the elastisconsearinglistener interface of the elastiscearch, which provides the postIndex method. To provide an elastiscearch to snoop for insert events with index. And a second step of: the type in the Engineindex is parsed in the postIndex method to obtain the new data for the document in the index. After the newly added document data is obtained, the document data is resolved into jsonNode, and then a key is added to the jsonnnode to be named as operateor, and the value is 1, so that the data is used for identifying the newly added data. And a third step of: the jsonNode is serialized into a string, and the producer method of kafkaclient is called to send the data into the topic configured in the configuration.

More specifically, the deletion data in the variable data of the storage medium is grabbed based on the index, and the specific process is as follows:

In the implementation described above, the first step is to inherit the interface IndexingOperationListener of the elastomer search, which provides the postDelete method. The method provides for the listening of the elastiscearch to index deletion events, including before and after deletion. And a second step of: the method of prededelete is realized, the document content to be deleted is obtained through the Id of document, and the content is stored in a thread-safe currentHashMap, wherein key is the Id of document, and value is a document object. Thirdly, realizing a postDelete method, acquiring an Id of a document to be deleted from a Delete object exposed by the method, acquiring the determined deleted document from a currentHashMap stored by a pre-Delete method through the Id, converting the document object into a Jsonnode object, adding a key in the object, namely an operateor, and taking a value of 2 to indicate that the document is a deleted object. The Jsonnode object is then serialized into a string, calling the producer method of kafkaclient to send the data into the topic configured in the configuration, and removing the data for that Id in the currentHashMap.

In summary, the embodiment of the invention completely avoids the training pressure of the elastic search, can support the acquisition of the elastic search cluster data of the ultra-large cluster scale, and only needs to install a custom cdc data acquisition plug-in on the corresponding service. The embodiment of the invention can sense the data change when the event is inserted and deleted by the elastic search data based on the event monitoring mode, improves the real-time property of the data and provides technical support for the subsequent real-time analysis of the data. The message queue based on kafka can bear high concurrency data transmission, and relevant parameters of kafka are fully configured, so that the method is highly flexible, and reasonable parameters can be designed according to the actual data volume and the actual concurrency volume.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of collecting storage medium change data into a data warehouse, the method comprising the steps of:

expanding the capacity of the storage medium cluster, installing a data acquisition plug-in at each server of the storage medium cluster, and restarting each storage medium in the cluster in sequence;

creating an index, inserting the index into a storage medium, synchronizing the data acquisition plug-in and related parameters of the index, grabbing variable data of the storage medium based on the index, and sending the variable data into kafka;

mapping variable data in the distributed file system to a data warehouse, adding a corresponding date partition in the data warehouse, and finishing writing of storage medium change data;

the variable data of the storage medium is grabbed based on the index and sent into the kafka, wherein the variable data comprises newly added data or updated data, and deleted data;

the method for capturing the newly added data or the updated data comprises the following specific processes:

creating an index, inserting the index into a storage medium, synchronizing the data acquisition plug-IN and related parameters of the data acquisition plug-IN, and forming an IN interface of the storage medium by the synchronized index;

analyzing the newly added data or the updated data into character strings and identifying the character strings and sending the character strings or the updated data into kafka;

the data capture and deletion process comprises the following specific steps:

determining deleted data, acquiring the deleted data in the currentHashMap according to the ID of the deleted data, converting the deleted data into a character string which is analyzed and identified, and sending the character string to the kafka;

an interface IndexingOperationListener of inheritance elastomer search, which provides a postIndex method for providing elastomer search to monitor insertion events with index, analyzing types in Engineindex in the postIndex method, acquiring new data of the document in the index, analyzing the document data into jsonNode after acquiring the new document data, adding a key to the jsonNode to be named as operateor, taking a value of 1, identifying the data as the new data, serializing the jsonNode into character strings, and calling a producer method of kafka client to send the data into a topic configured in configuration;

inheriting an interface IndexingOperationListener of the elastsearch, wherein the interface provides postDelete, a predeDelete method, the method provides monitoring of an index deletion event by the elastsetarch, the method comprises the steps of realizing the predeDelete method before and after deletion, acquiring document content to be deleted through an Id of the docure, storing the content in a thread-safe currentHashMap, wherein the key is the Id of the docure, the value is a docure object, realizing the postDelete method, acquiring the Id of the docure to be deleted from the Delete object exposed by the method, then acquiring the docure to be deleted from the currentHashMap stored by the predeDelete method through the Id, determining the docure to be deleted, then converting the docure object into a JSONODE object, then adding an optional in the object, taking the value as 2, representing that the docure is deleted, then calling the JSONODE object into the docuquide string, and configuring the data of the docuquide into the docuquide by the method;

the storage medium is in particular an elastiscearch.

2. The method of claim 1, wherein the writing of the consumed data into the distributed file system is performed by writing the consumed data into the distributed file system in a daily partition.

3. The method for collecting storage medium change data into a data warehouse according to claim 2, wherein the data in the distributed file system is mapped into the data warehouse, and the specific process is as follows: