CN112905308B

CN112905308B - High-availability deployment method for double computer rooms of es cluster

Info

Publication number: CN112905308B
Application number: CN202110495513.4A
Authority: CN
Inventors: 秦威伟; 曾令华; 龚建; 胡沛勇
Original assignee: Wuhan Zhongbang Bank Co Ltd
Current assignee: Wuhan Zhongbang Bank Co Ltd
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2021-07-30
Anticipated expiration: 2041-05-07
Also published as: CN112905308A

Abstract

The invention discloses a high-availability deployment method for es cluster double machine rooms, belongs to the field of computer service application research and development, and solves the problems that cluster deployment in the prior art cannot guarantee cluster data security, application availability and uninterrupted application service. The method comprises the steps of installing and deploying Es clusters in a main machine room A and a standby machine room B, respectively distributing copies of the same fragment in the main machine room A and the standby machine room B after installing and deploying, and setting parameters of the fragment and the copy of the fragment in different areas; installing and deploying the Nginx cluster in a main machine room A and a standby machine room B, and performing routing configuration; and realizing high-availability logic calling based on the configured main machine room A and the standby machine room B. The invention is used for realizing the es cluster dual-computer room deployment.

Description

High-availability deployment method for double computer rooms of es cluster

Technical Field

A high-availability deployment method for es cluster double machine rooms is used for achieving es cluster double machine room deployment and belongs to the field of computer business application research and development.

Background

In daily payment business, affected by multidimensional factors such as supervision, risk and cost, business data and log data of a bank system need to be stored for a long time, business rules are changeable, statistical analysis data become more difficult, and business needs to perform statistical analysis on historical data to develop innovative business. System operation and maintenance personnel need to analyze system log data to judge the system operation and maintenance condition; the analysis is mostly realized by using an es cluster.

However, most of the existing es clusters are deployed in the same machine room, and can meet the high availability and data security of general requirements, but the traditional es cluster deployment mode cannot meet the requirements well in the face of the harsh security of the financial industry and the property of disaster recovery across machine rooms in different places. According to the characteristics of the es cluster, the efficiency of writing and querying data is reduced when the es cluster is deployed across machine rooms, and the retrieval requirements of high concurrency and low time delay cannot be met. That is, the data is deployed in the same machine room, and when the data of the machine room is lost (for example, when the machine room has a natural disaster), the data cannot be recovered, and the service will be interrupted. If the cluster is simply deployed in the two different places, the cluster data writing and query efficiency is reduced.

CN202010099024.2 discloses a disaster recovery method and device for dual machine rooms, but the following technical problems exist:

data backup integrity is not reflected, namely, all data of the machine room A is lost, and whether the data can be recovered or not is judged, namely, the cluster data safety, the application availability and the application service are not ensured to be uninterrupted under the condition that one machine room is completely damaged;

when a fault of a certain node server is detected, if the corresponding master node is abnormally added into the cluster, the high availability is questioned, and the high availability of the application cannot be guaranteed.

Disclosure of Invention

Aiming at the problems of the research, the invention aims to provide a high-availability deployment method for double computer rooms of an es cluster, and solves the problems that cluster deployment in the prior art cannot guarantee cluster data security, application availability and uninterrupted application service.

In order to achieve the purpose, the invention adopts the following technical scheme:

a high-availability deployment method for double computer rooms of an es cluster comprises the following steps:

step 1: installing and deploying Es clusters in a main machine room A and a standby machine room B, and after installation and deployment are completed, performing parameter setting that copies of the same fragment are respectively distributed in the main machine room A and the standby machine room B, and the fragment and the copy of the fragment are located in different areas;

step 2: installing and deploying the Nginx cluster in a main machine room A and a standby machine room B, and performing routing configuration;

and step 3: and realizing high-availability logic calling based on the configured main machine room A and the standby machine room B.

Further, the specific steps of step 1 are:

step 1.1: 6 es nodes are installed and deployed in the main machine room A, and the 6 es nodes are Anode1-Anode6;

step 1.2: 3 es nodes are installed and deployed in a standby machine room B, wherein the 3 es nodes are Bnode1-Bnode3;

step 1.3: setting a sender node and a data node: modifying the configuration of the nodes of the ANode1, the ANode4 and the Bnode1, namely setting the parameter node in the nodes of the ANode1, the ANode4 and the Bnode1 as true, wherein the nodes of the ANode1, the ANode4 and the Bnode1 are master nodes after modification; modifying the configuration of the nodes of the Anode2, the Anode3, the Anode5, the Anode6, the Bnode2 and the Bnode3, namely setting parameter nodes data in the nodes of the Anode2, the Anode3, the Anode5, the Anode6, the Bnode2 and the Bnode3 as true, and after modification, the nodes of the Anode2, the Anode3, the Anode5, the Anode6, the Bnode2 and the Bnode3 are data nodes;

step 1.4: setting Es cluster distribution parameters according to regions, namely setting a parameter cluster which determines whether the cluster is distributed according to regions, and setting a parameter cluster which determines forced regions of Es cluster distribution, namely, setting a parameter cluster which determines forced regions of Es cluster, namely, routing, allocation, aspect, value, = z1, z2 and z3, so that only one copy of the same shard can be stored in one region, the shard and the copy are located in different regions, and the copy can be guaranteed not to be allocated across regions, wherein the shard represents a fragment used for storing data in Es cluster, each fragment has two copies, and allocation represents node migration;

step 1.5: after setting the Es cluster according to the area distribution parameters, dividing the serial numbers for each node: namely, node.attr.zone parameters in each node are set, node.attr.zone in the node1-ANode3 node is set as z1, node.attr.zone in the node ANode4-ANode6 node is set as z2, and node.attr.zone in the node Bnode1-Bnode3 node is set as z3, wherein node.attr.zone represents a region division number parameter, so that different nodes are set as the same region number, and the corresponding nodes are assigned to the same region.

Step 1.6: after the serial numbers are divided, setting that an Es cluster can provide services to the outside only by at least two master nodes, and adjusting the overtime time of the Es cluster for discovering other nodes;

step 1.7: if firewalls exist in the two machine rooms of the main machine room A and the standby machine room B or a network policy causes tcp to be interrupted in a certain time, setting a parameter network.

Further, the specific steps of step 2 are:

step 2.1: arranging an Nginx cluster at each of a main machine room A and a standby machine room B;

and 2.2, after deployment, configuring the routing weights of the Nginx clusters in the main machine room A and the standby machine room B into 2: 1.

Further, the step 3 specifically comprises:

based on the configured main machine room A and the standby machine room B:

when 1 data node in the main computer room A is down: extracting a node from the data nodes of the main computer room A to simulate the downtime condition, namely turning off the extracted node or turning off the virtual machine where the extracted node is located, and performing normal application query service;

when one node in each of two areas in the main computer room A is down: extracting two nodes in two different areas in the main computer room A to simulate the downtime condition, namely turning off the extracted nodes or turning off the virtual machines where the extracted nodes are located, and performing normal application query service;

when two data nodes in an area in the main computer room a are all down: two machines in one area are selected optionally in the main computer room A, the downtime condition is simulated, namely the extracted nodes are turned off or the virtual machines where the extracted nodes are located are turned off, and the application query service is carried out normally;

when a data node in each of the three areas is down: in the three areas, one data node is extracted to simulate the downtime condition, namely the extracted node is turned off or the virtual machine where the extracted node is located is turned off, and the application query service is carried out normally;

when the data nodes in the area corresponding to the standby machine room B are all down: extracting two data nodes of the standby machine room B to simulate the downtime condition, namely turning off the extracted nodes or turning off the virtual machines where the extracted nodes are located, and performing application query service normally;

when the master node in the area z1 is down: extracting the downtime of the node Anode1 in the main room A, namely turning off the extracted node or turning off the virtual machine where the extracted node is located, and performing normal application query service;

when the master node in the area z2 is down: extracting the downtime of the node Anode4 in the main room A, namely turning off the extracted node or turning off the virtual machine where the extracted node is located, and performing normal application query service;

when the master node in the area z3 is down: turning off the node Bnode1 in the standby machine room B, namely turning off the extracted node or turning off the virtual machine where the extracted node is located, and performing application query service normally;

when all the nodes in the standby machine room B are down: extracting the node Bnode1-Bnode3 in the standby machine room B, namely turning off the extracted node or turning off the virtual machine where the extracted node is located, and performing application query service normally;

when all the nodes in the main machine room A are down: and if the service needs to be provided, temporarily modifying one data node in the standby machine room into a generator node, namely setting the node parameter of one data node to true, otherwise, providing the service and reporting errors.

Compared with the prior art, the invention has the beneficial effects that:

1. in the invention, the copies of the same fragment are respectively distributed in the main machine room A and the standby machine room B, and the fragments and the copies of the fragment are positioned in different areas, thereby ensuring the integrity and the safety of data; the method ensures the recoverability and integrity of data under the conditions that all nodes of the machine room A (main machine room A) or the machine room B (main machine room B) are down and the data is lost due to emergencies such as natural disasters of the machine room A/B, i.e. the query efficiency is ensured, and meanwhile, the data is stored in different places, so that the data safety is improved;

2. the weight of Nginx forwarded to the service is changed into a main machine room A, namely a standby machine room B =2:1, and the query request of about 2/3 can be processed in the main machine room A by changing the weight, so that the query efficiency of the cluster under the normal condition can be ensured, and the cluster deployment is according to the proportion of 2:1 of the main machine room and the standby machine room;

3. based on the deployment and parameter setting of the invention, the parameter index of each node in the Es cluster is changed into _ routing.required = true, the query can be carried out according to a routing mechanism, namely, the query of the fragments is assigned according to the stored document id instead of traversing all the fragments, specifically, the query is routed through doc _ id of each record, all the shards do not need to be queried once, and each query can be routed to one shard or a copy node thereof through simple configuration, so that the problem that the query necessarily spans a machine room every time can be avoided;

4. based on the deployment and the setting of the invention, the filter can be used during the query, thereby avoiding scoring, saving time (namely screening out the desired data without influencing the scoring), and improving the response time of each query request;

5. the invention is configured to force to partition cluster, route, Allocation, enterprise, force, zone, values = z1, z2, z3, by this parameter, it can be ensured that three segments (one segment has two copies) of the main machine room (main machine room A and standby machine room B), the same copy will not fall on the virtual machine node in the same area, so as to ensure that the performance will not be affected during Allocation (node migration) when part of the nodes are down, and at the same time, in order to avoid node down, the corresponding shredders will be migrated to other nodes in the same area (the shredders stored on the down nodes will be migrated to other nodes), the migration waiting time of "8h" (namely, after a certain node in the Es cluster is down, the unassigned segments are distributed after delaying for 8 h), and the online management of the nodes will not affect the node recovery time;

6. according to the invention, Es cluster region partition setting is introduced, routing weight is set by inquiring routing setting and filtering scoring mechanism and matching with Nginx, so that cluster advantages are exerted to the maximum extent to meet the requirements of high availability, high concurrency and low delay, data retrieval efficiency and data security are greatly improved, and database use cost is reduced.

Drawings

Fig. 1 is a schematic diagram of Es cluster cross-room deployment in the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments.

The es cluster double-computer room high-availability deployment method comprises the following steps:

the method comprises the following specific steps:

The method specifically comprises the following steps:

based on the configured main machine room A and the standby machine room B:

Examples

Step 1: installing and deploying Es clusters in a main machine room A and a standby machine room B, and setting parameters after the Es clusters are installed and deployed;

step 1.1: selecting six servers of 16C +32G in the machine room A, respectively installing es (namely es nodes) on the six servers, modifying es configuration file elasticsearch.yml, and sequentially setting the node.name as a node Anode1-Anode6;

step 1.2: selecting three servers of 16C +32G in a machine room B, respectively installing es (namely es nodes) on the three servers, modifying es configuration file elasticsearch.yml, and sequentially setting the node.name as a node Bnode1-Bnode3;

step 1.3: setting configuration node.master in the nodes of Anode1, Anode4 and Bnode1 as true, setting configuration node.data in the nodes of Anode2, Anode3, Anode5, Anode6, Bnode2 and Bnode3 as false, and setting configuration node.data as true;

step 1.4: the configuration cluster, routing, allocation, aware, attribute in nodes of enode 1-enode 6, Bnode1-Bnode3 are set to zone. The parameters cluster. routing. allocation. artifact. force. zone. values are set to z1, z2, z 3.

Step 1.5: the method comprises the steps of setting a configured node in an anti 1-anti 3 node as z1, a configured node in an anti 4-anti 6 node as z2, and a configured node in a Bnode1-Bnode3 node as z 3.

Step 1.6: setting configuration discovery, zen, minimum _ master _ nodes in nodes of Anode1-Anode6 and Bnode1-Bnode3 to 2 means that an Es cluster needs at least two master nodes to normally provide service.

Step 1.7: the configuration network.tcp.key.alive in the nodes of the nodes 1-ANode6 and Bnode1-Bnode3 is set to true, and the transmission. ping _ schedule is set to 300 s.

Step 2: and installing and deploying Nginx clusters in the main machine room A and the standby machine room B.

Step 2.1: one server configured as 2C +8G is selected from the main room a and the backup room B, respectively, and Nginx is installed.

Step 2.2: modifying Nginx configuration, configuring the application request routing addresses of the main machine room A and the standby machine room B in the upstream, and setting the routing address weight of the main machine room A to be 2, weight =2, and the routing address weight of the standby machine room B to be 1, weight = 1.

One data node is selected from four data nodes of a main computer room, namely, an Anode2, an Anode3, an Anode5 and an Anode6, the es service of the data node is stopped, and the application query service is verified to be normal.

And (3) shutting down the es services of the data nodes Anode2 and Anode5 in the main computer room A, and verifying that the application query service is normal.

And (3) shutting down the es services of the data nodes Anode5 and Anode6 in the main computer room A, and verifying that the application query service is normal.

And (3) shutting down the es services of the data nodes Anode2 and Anode6 in the main machine room A and the data node Bnode2 in the standby machine room B, and verifying that the query service is normal.

And (3) shutting down the es services of the data nodes Bnode2 and Bnode3 in the standby computer room B, and verifying that the application query service is normal.

The es service of the master node Anode1 in the main computer room A is stopped, and the application query service is verified to be normal.

The es service of the master node Anode4 in the main computer room A is stopped, and the application query service is verified to be normal.

And (3) shutting down the es service of the master node Bnode1 in the standby machine room B, and verifying that the application query service is normal.

And (3) shutting down the es services of the Bnode1, the Bnode2 and the Bnode3 in the standby machine room B, and verifying that the application query service is normal.

And (3) shutting down the es service of all the nodes of the Anode1-Anode6 in the main machine room A, setting the node of the Anode3 node in the machine room of the standby machine room B to true, and verifying that the application query service is normal.

Based on the above implementation logics: and verifying the working condition of the cluster under network jitter. Because the network jitter situation among different machine rooms needs to be considered in the cross-machine-room deployment, the cluster working condition under the network jitter needs to be verified.

Setting a data node Anode2 in the main computer room A: the tc qdisc add dev eth0 root net delay is 50ms 20ms 50% (meaning that the transmission delay of the eth0 network card is set to 50ms, and at the same time, 50% of packets are randomly delayed for 30 (50-20) to 70 (50 + 20) ms), the data node Anode4 sets the tc qdisc add dev 0 root net loss 1% (meaning that the transmission of the eth0 network card is set to randomly drop 1% of data packets), the data node Bnode2 in the standby computer room B sets the tc qdisc add dev 0 root net delay 50ms 50%, and the application query service is verified to be normal.

The application address weight of the Nginx route to the main machine room A and the standby machine room B is set to be 2: 1;

setting an index parameter _ routing.required of each node in an Es cluster as true, comparing tps of the application query service before and after setting, remarkably improving tps of the application query service;

setting a use filter during index query in an Es cluster, comparing tps of application query service before and after setting, wherein tps of the application query service is obviously improved after the filter is set, and the use of the filter is higher than tps without the filter under the same condition;

the above are merely representative examples of the many specific applications of the present invention, and do not limit the scope of the invention in any way. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.

Claims

1. A high-availability deployment method for double computer rooms of an es cluster is characterized by comprising the following steps:

and step 3: realizing high-availability logic calling based on the configured main machine room A and the standby machine room B;

the specific steps of the step 2 are as follows:

step 2.2, after deployment, configuring the routing weight of the Nginx clusters in the main machine room A and the standby machine room B as 2: 1;

the specific steps of the step 1 are as follows:

step 1.5: after setting the Es cluster according to the area distribution parameters, dividing the serial numbers for each node: setting a node.attr.zone parameter in each node, setting the node.attr.zone in an anti 1-anti 3 node as z1, the node.attr.zone in an anti 4-anti 6 node as z2, and the node.attr.zone in a Bnode1-Bnode3 node as z3, wherein the node.attr.zone represents a region division number parameter, so that different nodes are set as the same region number, and the corresponding nodes are assigned to the same region;

2. The es cluster dual-room high availability deployment method according to claim 1, characterized in that: the step 3 is specifically as follows:

based on the configured main machine room A and the standby machine room B:

when all the nodes in the main machine room A are down: and if the service needs to be provided, temporarily modifying one data node in the standby machine room into a transmitter node, namely setting the node of one data node to true, otherwise, providing the service and reporting errors.