CN112100160B

CN112100160B - Elastic Search based double-activity real-time data warehouse construction method

Info

Publication number: CN112100160B
Application number: CN202011224108.0A
Authority: CN
Inventors: 谭巍; 陈卫; 田浩兵; 张奎; 李烨
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-09-07
Anticipated expiration: 2040-11-05
Also published as: CN112100160A

Abstract

The invention discloses a double-activity real-time data warehouse construction method based on Elastic Search, which relates to the technical field of big data real-time calculation and solves the problem that the consistency of data cannot be ensured in the real-time data warehouse construction in the prior art; the scheme comprises the following steps: acquiring an index main fragment on each node in an Elastic Search cluster A, and reading a pre-written log record of each main fragment under the node; judging the read pre-written log record, and writing the read data into an Elastic Search cluster B in a synchronous blocking mode; rewriting the data with write failure, detecting the data which exists on the disk and is persistent due to write failure at regular time, sending the abnormal error message in operation to the kafka cluster, and accessing to monitor and alarm in real time. The method can ensure that the data in the two clusters are completely consistent, and is mainly applied to the field of big data analysis.

Description

Elastic Search based double-activity real-time data warehouse construction method

Technical Field

The invention relates to the technical field of big data real-time calculation, in particular to a double-activity real-time data warehouse construction method based on Elastic Search.

Background

In the current age of big data, there are many data warehouses for storing massive data, and a distributed search Engine (ES) is one of them. The distributed search engine ElasticSearch is an open-source, distributed and Restful search server constructed based on Lucene and is generally used in cloud computing. It can conveniently make a large amount of data have the capability of searching, analyzing and exploring. The horizontal flexibility of the distributed search engine ElasticSearch is fully utilized, so that the data can become more valuable in a production environment.

With the wider application range of the big data technology in the financial field, the timeliness requirement on the data is higher and higher, such as real-time accurate marketing and real-time risk control anti-fraud. In order to meet the requirements of business scenes, a real-time data warehouse is basically established, but obvious business high-low peak fluctuation exists in financial industries such as banks and the like, so that higher requirements are put forward on the established real-time data warehouse, the high availability of the real-time data warehouse needs to be ensured, flow sharing needs to be considered in the business peak time, and the fluency of user experience is ensured. The Elastic Search cluster comprises a plurality of nodes, each node comprises more than one index, each index is divided into more than one index fragment, and the index fragments only comprise a main fragment or simultaneously comprise the main fragment and more than one copy.

In the prior art, the following two methods are mainly used for real-time data warehouse construction:

application layer double writing: data is written into 2 clusters through application layer codes, data is written into 2 clusters through deployment of 2 sets of service codes, and meanwhile the consistency of the data in the 2 clusters is guaranteed through an application layer. The method is simplest, but the later management and maintenance are troublesome, for example, once online rollback operation needs to be written twice, deployment needs to be performed twice, and meanwhile, the problem of data consistency exists.

Message queue pull:

the method includes the steps that data needing to be written are placed into a message queue, such as kafka, and then 2 clusters respectively pull the data from the same message queue.

The same problem exists with both of the above methods: the consistency of data cannot be guaranteed, the problem that the data can be successfully written in one cluster and fails to be written in the other cluster exists, the root cause is that the two-time writing makes 2 operations independent, and meanwhile, the later management and maintenance cost of the method is high.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a double-live real-time data warehouse construction method based on Elastic Search, which aims to:

the technical scheme adopted by the invention is as follows:

a double-activity real-time data warehouse construction method based on Elastic Search comprises an Elastic Search cluster A and an Elastic Search cluster B, and comprises the following steps:

a, acquiring an index main fragment on each node in an Elastic Search cluster A, wherein each main fragment stores the IP address of the node where the main fragment is located;

b, reading the pre-written log record of each main sub-slice under each node under the data disc directory on each node;

and C: judging the read pre-written log records, and writing the pre-written log records meeting the requirements into a circular buffer queue;

step D, reading data in the annular buffer queue in a multithread mode, and writing the read data into an Elastic Search cluster B in a synchronous blocking mode;

step E: judging whether all data are successfully written, if the data are unsuccessfully written, rewriting, and if the data are rewritten for more than a specified number of times, preferably 3 times, writing the data into a disk for persistence;

step F: detecting data which exist on the disk and are persisted due to write-in failure at regular time, preferably detecting the data once in five minutes, if the data exist, obtaining the persisted data in the disk and writing the data into an Elastic Search cluster B, and clearing the successfully written data content on the disk after the data are successfully written;

and G, sending the running abnormal error message to the kafka cluster, and accessing the monitoring real-time alarm.

The Elastic Search cluster B writes data into the Elastic Search cluster A through the steps, and the two clusters achieve real-time synchronization through mutual writing.

Further, the step a specifically comprises the following steps:

step A1: accessing the Elastic Search cluster A in an http mode to obtain a hash character corresponding to each index under all nodes, storing the obtained hash character into a Map object belonging to the current node, wherein the English name of the index is used as a key of the Map object, and the hash character corresponding to the index is used as a value of the Map object;

step A2: and transmitting the key values of the Map object into all nodes of an Elastic Search cluster A to obtain an indexed main fragment in an http batch request mode, comparing whether the IP address of the current node of the Elastic Search cluster A is consistent with the IP address of the node where the obtained indexed main fragment is located, and storing the consistent IP address into a main fragment Map object maintained by the current node.

The invention stores the IP address, which is convenient to determine the main fragment on the node, and the result returned by the query in the Elastic Search cluster contains the main fragments of all the nodes, so the main fragments are stored in the IP address to be separated, and the main fragments are determined to be the main fragments of the node.

Further, the step B specifically includes:

step B1: traversing and acquiring a key value of a Map object under a data disk directory of each node, transmitting the key value of the Map object into a current node, acquiring a path of a main fragment of each index on a disk, acquiring a file set of pre-written log records under the main fragment directory under the path of the main fragment, and sequencing the acquired pre-written log records according to update time, wherein the update time is closest to the current time and is arranged at the front;

step B2: traversing the obtained pre-written log file set, reading the content in each file in a Java nio mode to obtain the offset, the number of written pieces and the file algebra of each file, and judging the value of the offset of each file; if the value of the file offset is smaller than or equal to the specified value, skipping the file, and if the value of the file offset is larger than the specified value, finding the corresponding log record file according to the file algebra, reading the byte number content of the specified offset, and simultaneously adding or updating the offset into a Map object maintained by a current node, wherein the specified value is preferably 55. The offset 55 is defined because the content of the data can be completely successfully read at 55, which is advantageous.

According to the method, a file name set at the end of the segment of No. ckp is obtained through a path of a main segment, namely, the file name set is a pre-written log record, an Elastic Search can divide a data file by taking 65M as a standard, and the file at the end of the segment of No. ckp records some metadata information of the divided files; sorting the files according to the update time, wherein the latest time closest to the current time is ranked at the top; real-time updating can be realized, and the latest updating is obtained each time; when the data file is less than 65M, the data file is read once, after the data file is updated, the read data is skipped through the offset of the Map object record, and the data is read from the updated position.

Further, the step C specifically comprises:

and filtering and screening the data content of the read file, filtering the data content if the read data content contains a 'transclog' key word, selecting the data content if the read data content does not contain the 'transclog' key word, adding a 'transclog' key word to the data content, and writing the data content into the ring buffer queue.

By adding a "transclog" key, a message is prevented from being synchronized back and forth between two Elastic Search clusters.

Further, the step E specifically includes:

step E1, checking the written data, obtaining a returned result JSON character string after being submitted to the cluster B, obtaining the value of an alarm field in the JSON character string, if the value is true, the fact that the submission has errors is shown, if the value is false, the fact that the submission has no errors is shown, the submission is successful, on the premise that the value is true, obtaining the number of data volumes in the returned JSON character string, comparing the number with the data volumes before the writing, if the data volumes are equal, the fact that the submission is successful, then reading the data from the annular buffer queue for the next time, and if the submission has errors or the returned writing number is inconsistent, rewriting;

step E2: and outputting the data which is not successfully submitted to the disk, rewriting for a specified time, preferably 3 times, if the data is not successfully submitted, storing the data to the disk in a persistent mode, and then reading the data from the ring buffer queue for the next time.

The data can be rewritten to prevent the failure of submission caused by the error of reading the data, but if the data is rewritten for 3 times and still fails, the data is persisted to the disk.

Further, step F specifically comprises: detecting whether data which is write-failed and persisted exists on the disk at regular intervals, if so, reading all the persisted data, then submitting the data to an Elastic Search cluster B, and clearing the content of the data in the file after successful submission.

Further, the step G specifically comprises: in the operation process, the abnormal error can send the error content to the communication equipment appointed by the kafka cluster, then the communication equipment is accessed to monitor, the monitor sends the alarm message to the receiver, and the receiver processes the error content in time.

In the running process, for example, the opposite side cluster is unavailable, the network connection is overtime, and the written cluster has error contents such as data loss and the like. Then, the alarm message is sent to the receiver for processing in time through access monitoring.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: mutual synchronization details among internal data are shielded outwards by analyzing the pre-written log records of the bottom layer main fragment, the difficulty of an application layer is reduced firstly, the application layer does not need to write doubly any more, and the consistency of the data is not required to be ensured, so that the problems of double-write later-period management and maintenance are avoided, the problem of double-write inconsistency is solved secondly, the writing operation is ensured to be atomic, the data in two clusters are ensured to be completely consistent, the data is successfully written in one cluster and is bound to appear in the other cluster, and the scheme also has good wide applicability, and not only can the mutual real-time synchronization among the Elastic Search cluster data under different versions be supported.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is a schematic flow diagram of one embodiment of the present invention;

FIG. 2 is a schematic flow chart of writing persistent data to the cluster of the disk in FIG. 1.

Detailed Description

All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.

The present invention will be described in detail with reference to fig. 1 and 2, wherein "Y" represents "YES" and "N" represents "NO" in the flowchart.

The invention relates to a double-activity real-time data warehouse construction method based on Elastic Search, which comprises an Elastic Search cluster B and an Elastic Search cluster A, wherein data are mutually synchronized between the Elastic Search cluster B and the Elastic Search cluster A by analyzing pre-written log records of a bottom layer.

The method comprises the following specific steps:

The Map object maintains all master shards on the current node. The IP address is stored to facilitate the determination of the main fragment on the node, and because the result returned by the Elastic Search cluster contains all the nodes, the main fragment is separated from all other main fragments by the IP address.

Design of Map object: and splicing the hash character string with the index value with a new character string formed by the fragment number, wherein the key is the English name of the index. As follows:

(TnRucew-ThawtzMZcfMXpw/4,flinkxw123)。

where TnRucew-thawttzmzcfmxpw is a hash string indexing the english name flinkxw123, with 4 being denoted as master partition number 4.

b1, traversing the data disk directory of each node to obtain the key value of the Map object, transmitting the key value of the Map object to the current node to obtain the path of the main fragment of each index on the disk, obtaining the file name set at the end of the main fragment directory of the & ltSUB & gt & lt ckp & gt under the path of the main fragment, and sequencing the obtained file sets according to the update time, wherein the update time closest to the current time is arranged at the forefront;

step B2: traversing the obtained pre-written log file set, reading the content in each file in a Java nio mode to obtain the offset (offset), the number of written pieces (numOps) and the file algebra (generation) of each file, and judging the value of the offset of each file; if the offset value of the file is less than or equal to 55, skipping the file, and if the offset value of the file is greater than 55, finding a corresponding log record file according to the file algebra, reading the byte number content of the specified offset, and adding or updating the offset into a Map object maintained by a current node. The offset 55 is defined because the content of the data can be completely successfully read at 55, which is advantageous.

And C: and filtering and screening the data content of the read file, filtering the data content if the read data content contains a 'transclog' key word, selecting the data content if the read data content does not contain the 'transclog' key word, adding a 'transclog' key word to the data content, and writing the data content into the ring buffer queue.

The new addition of a "transclog" key avoids a piece of content from being synchronized back and forth between two Elastic Search clusters, a piece of newly written data is indicated by parsing if the entire "transclog" key is not included in the content, and no synchronization is required if the "transclog" key indicates that the piece of newly written data has been parsed.

step E: judging whether all data are successfully written, if the data are unsuccessfully written, rewriting, and if the data are rewritten more than a specified number of times, writing the data into a disk for persistence;

step E2: and outputting the data which is not successfully submitted to the disk, and after 3 times of rewriting, if the data is still not successfully submitted, persistently storing the data to the disk. And then the next reading of data from the ring buffer queue is performed.

The data can be rewritten to prevent the failure of submission caused by the error of reading the data, but the data is persisted to the disk if the data is still failed after being rewritten three times.

Step F: detecting data which exist on the disk and are persisted due to write-in failure in a timed manner, if the data exist, obtaining the persisted data in the disk and writing the data into an Elastic Search cluster B, and clearing the successfully written data content on the disk after the data are successfully written;

detecting whether data which is persisted due to write failure exists on the disk at regular intervals, if so, locking the file, reading all the persisted data, then submitting the data to an Elastic Search cluster B, clearing the content of the data in the file after successful submission, and then releasing the lock to avoid data exception caused by simultaneous modification of multiple threads.

And G, in the operation process, sending error contents to topic specified by the kafka cluster for errors, wherein the error contents comprise unavailable opposite cluster, overtime network connection, data loss of a write-in cluster and the like. Then, the alarm message is sent to the receiver for processing in time through access monitoring.

Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data of a consumer in a web site. The format of the message sent to the kafka cluster is a JSON string, and the format is defined as follows:

name of Chinese	Index name	Node IP	Cause of error	Time of occurrence
					Json's key	indexName	nodeIP	error	time

The method mainly provides a method for realizing real-time synchronization between data through an Elastic Search bottom layer to construct a double-living cluster, and not only reduces the difficulty of double-write later maintenance and management, but also can completely ensure the timely consistency of the data of two clusters.

The double live of Elastic Search is by way of parsing the underlying write-ahead log (WAL). This distinguishes which records are written externally and which are sent by parsing the write-ahead log (WAL); this avoids "dead cycles" of data synchronization. The method adopts a marking mode for distinguishing; if the data in the write-ahead log (WAL) does not contain a flag, such as the field mentioned in step C of the method, we consider the data written externally to be added into the queue, otherwise, the data is skipped.

Mutual synchronization details among internal data are shielded externally by analyzing the underlying pre-written log (WAL). Firstly, the difficulty of an application layer is reduced, the application layer does not need double writing any more, and the consistency of data does not need to be ensured, so that the problems of double writing later-stage management and maintenance do not exist; secondly, the problem of double-write inconsistency is solved, the write operation is ensured to be atomic, the data in the two clusters are ensured to be completely consistent, the data is successfully written in one cluster, and the data must appear in the other cluster; finally, the scheme has better wide applicability, can support the construction of the double-live clusters of the Elastic Search under different versions, and is also suitable for mutual real-time synchronization with 3 or more than 3 Elastic Search cluster data.

The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.

Claims

1. A double-activity real-time data warehouse construction method based on Elastic Search comprises an Elastic Search cluster A and an Elastic Search cluster B, and is characterized in that: the method comprises the following steps:

step A2: transmitting the key values of the Map objects into all nodes of an Elastic Search cluster A to obtain main fragments of the index in an http batch request mode, comparing whether the IP address of the current node of the Elastic Search cluster A is consistent with the IP address of the node where the main fragment of the obtained index is located, and storing the consistent IP addresses into one main fragment Map object maintained by the current node;

step B2: traversing the obtained pre-written log file set, reading the content in each file in a Java nio mode to obtain the offset, the number of written pieces and the file algebra of each file, and judging the value of the offset of each file; if the value of the offset of the file is less than or equal to a specified value, skipping the file, if the value of the offset of the file is greater than the specified value, finding a corresponding log record file according to a file algebra, reading the byte number content of the specified offset, and adding or updating the offset into a Map object maintained by a current node;

2. The Elastic Search based double-live real-time data warehouse construction method according to claim 1, characterized in that: the step C is specifically as follows:

3. The Elastic Search based double-live real-time data warehouse construction method according to claim 1, characterized in that: the step E specifically comprises the following steps:

step E2: outputting the data which is not successfully submitted to the disk, after rewriting for a specified number of times, if the data is still not successfully submitted, storing the data to the disk in a persistent mode, and then reading the data from the ring buffer queue for the next time.

4. The Elastic Search based double-live real-time data warehouse construction method according to claim 1, characterized in that: the step F is specifically as follows:

detecting whether data which is write-failed and persisted exists on the disk at regular intervals, if so, reading all the persisted data, then submitting the data to an Elastic Search cluster B, and clearing the content of the data in the file after successful submission.

5. The Elastic Search based double-live real-time data warehouse construction method according to claim 1, characterized in that: the step G specifically comprises the following steps: in the operation process, the abnormal error can send the error content to the communication equipment appointed by the kafka cluster, then the communication equipment is accessed to monitor, the monitor sends the alarm message to the receiver, and the receiver processes the error content in time.