CN108509437B

CN108509437B - ElasticSearch query acceleration method

Info

Publication number: CN108509437B
Application number: CN201710102541.9A
Authority: CN
Inventors: 王磊; 王胤然; 徐寅; 穆宁
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2017-02-24
Filing date: 2017-02-24
Publication date: 2021-09-17
Anticipated expiration: 2037-02-24
Also published as: CN108509437A

Abstract

The invention discloses an elastic search query acceleration method, which belongs to the technical field of computer big data indexing and shows that a Payload load domain is added to each field, then filtering operation is carried out on the basis of a single sub-query condition through the Payload load domain, so that the problem that the calculation of intersection and union takes a large amount of time if the data quantity of each result set is large during ES original data query is solved, and the indexing efficiency is improved.

Description

ElasticSearch query acceleration method

Technical Field

The invention belongs to the technical field of computer big data indexing.

Background

Nowadays, an era of mass production, sharing and application of data is being opened, data is rapidly expanding and becoming large, and human beings have entered the internet era. In particular social networks, e-commerce and mobile communications bring human beings into a new era of massive structured and unstructured data information. The enormous amount of data results in a high complexity of these massive amounts of data, which are full of variations and very complex to process. How to analyze and process mass data and provide simple and convenient services to the outside becomes a problem that many IT enterprises and institutions must face.

The mass data is divided into structured data and unstructured data, the structured data refers to data such as enterprise financial accounts and production data, student score data, statistical report data and the like, and the unstructured data refers to multimedia data such as text data, images, sounds and the like. Wherein the unstructured data accounts for about 80% of the mass data. Structured data can be processed through a traditional relational database and a later-developed distributed No-SQL database, and unstructured data can provide query services through a full-text retrieval technology.

In the current full-text retrieval, the Lucene is the simplest and most convenient, and is a full-text information retrieval toolkit and uses an inverted file index structure. It is not a complete search application, but provides indexing and search functionality for your application. The full-text indexing/retrieval function for the application can be conveniently embedded into various applications. Currently, the cluster technology based on Lucene mainly includes Solr and an Elasticsearch (hereinafter, abbreviated as ES), and the Elasticsearch is a search server based on Lucene. The distributed multi-user full-text search engine supports RESTful web and java interfaces, can support real-time search, and has the characteristics of stability, reliability, quickness, convenience in installation and use and the like.

The ES original data query is to subdivide the combined condition into sub-conditions to issue the query, and then perform intersection or union operations on each result set, and at this time, if the data volume of each result set is large, the intersection and union operations will take a lot of time.

Disclosure of Invention

The invention aims to provide an elastic search query acceleration method, which solves the problem that in the process of ES original data query, if the data volume of each result set is large, the calculation of intersection and union takes a large amount of time, and improves the index efficiency.

In order to achieve the purpose, the invention adopts the following technical scheme:

an ElasticSearch query acceleration method comprises the following steps:

step 1: establishing a full-text index system, wherein the full-text index system comprises a Hadoop storage server cluster, a WEB interface server, a data import server and a data acquisition terminal, the data acquisition terminal is connected with the data import server through the Internet, and the WEB interface server and the data import server are both connected with the Hadoop storage server cluster through the Internet;

step 2: establishing a full-text retrieval platform in the Hadoop storage server cluster through a Lucene full-text information retrieval tool, and distributing an ES cluster in the Hadoop storage server cluster through the Lucene full-text information retrieval tool;

and step 3: the data acquisition terminal inputs stream data or text data into the data import server, and the data import server pours the stream data or the text data into the server and sends the stream data or the text data to the Hadoop storage server cluster for storage;

and 4, step 4: the ES cluster establishes an index data table of an inverted file index structure for data stored by the Hadoop storage server cluster through a Lucene full-text information retrieval tool, and provides a field area for storage for the index data table; the field area for storage comprises a plurality of document number storage field areas;

and 5: according to a bottom-layer storage structure provided by a Lucene full-text information retrieval tool, adding a plurality of Payload load domains in an inverted list linked list by an ES cluster, wherein all the Payload load domains are arranged behind a document number storage field area;

step 6: a user inputs a query condition through a WEB interface server, and the WEB interface server transmits the query condition to the ES cluster; the query conditions comprise an accurate query condition, a range query condition, a prefix query condition and a Payload range query condition;

and 7: the ES cluster firstly carries out retrieval according to the accurate query condition, the range query condition and the prefix query condition through a Lucene full-text information retrieval tool to correspondingly obtain an accurate query result, a range query result and a prefix query result;

and 8: the ES cluster respectively filters the accurate query result, the range query result and the prefix query result according to the Payload range query condition to obtain an accurate query result set, a range query result set and a prefix query result set;

and step 9: and the ES cluster performs intersection calculation on the accurate query result set, the range query result set and the prefix query result set to obtain a final retrieval result.

The ES cluster is an Elasticissearch server cluster.

The Payload field is a storage area that stores a scope query field, which includes a time field.

In step 4, the ES cluster provides a field area for storage for the index data table according to the following steps:

step S1: setting the fragments as a basic storage unit of each index data table, wherein each index data table comprises a plurality of fragments, and the ES cluster distributes and stores the index data tables into different storage media in the ES cluster according to the fragments of the index data table;

step S2: setting an index form as an index data table in the ES cluster, wherein the shard is a fragment of the index form; the index form comprises a plurality of shards; setting a fragmentation threshold value;

step S3: the ES cluster establishes an extended index form for the index form, reads the largest shard in the index form, and judges whether the shard reaches a shard threshold value: if yes, go to step S4, otherwise go to step S5;

the specific steps of establishing an extended index form for the index form by the ES cluster are as follows:

step A: the ES cluster acquires the index form, traverses each shard in the index form, and judges the following: if the shard exceeds the shard threshold value, executing the step C; if the shard does not exceed the shard threshold, executing the step B;

and B: inquiring whether the fragments of the expansion table under the shard exceed the fragment threshold value: if yes, executing step C; otherwise, go to step S4;

and C: the ES cluster calculates the number of the shards exceeding the shard threshold according to the size of the shard threshold, and checks whether the extended index form exists or whether the shards of the extended index form are full: if the new extended index form does not exist or the shard is full, continuing to extend the new extended index form, wherein the number of the shard is twice of the number of the existing shard, and updating the information of the newly added extended index form into the routing table; if the shard lists exceed the shard threshold, listing all the shard lists exceeding the shard threshold, and adding the shard lists into a task queue of the Zookeeper after descending the order; generating a plurality of job tasks by a task queue of the Zookeeper according to the shard list;

step S4: splitting the shard according to the following steps:

step D: after the ES cluster acquires a job task from the task queue of the Zookeeper, the ES cluster informs the Ares warehousing program to stop warehousing the list, and judges whether the Ares warehousing program returns a message: if yes, executing step E; if not, waiting for the response of the Ares warehousing program;

step E: the ES cluster starts to split the shards according to the following rules:

step E1: the ES cluster acquires the storage size of the shard;

step E2: dividing the storage size by 2 to obtain a fragmentation calculation result, and comparing the fragmentation calculation result with a fragmentation threshold value: if the number of times of dividing the storage size by 2 is greater than the slicing threshold, recording the number N of times of dividing the storage size by 2, and executing the step E2; if the number of the split pieces is smaller than the splitting threshold value, recording 2 multiplied by N as the number of the split pieces;

step E3: acquiring the total data amount total of the shard, wherein the divided data amount K: k total ÷ (2 × N);

step E4: giving a time T through an ES cluster query interface, wherein the unit of T is second, when data acquired in T seconds is recorded as m, the coefficient value is s, and the size of s is equal to K/m; the ES cluster splits the shards according to the split number, the data size K and the coefficient s;

step F: the ES cluster numbers the new shards after the shards are split, and the numbers of the new shards are set as the shards [0 ];

step G: deleting data in the shard, replacing the data in the shard [0] shard with the data in the shard, and adding the shard [0] shard of the information into the index form; writing the fragments except the (0) fragment in the shards into an NFS shared directory, expanding the fragments of the index form, performing recovery on the fragments of the index form according to the fragments in the NFS shared directory, and adding the shards exceeding the fragment threshold into a Zookeeper task queue according to the method in the step C after descending the order;

step H: recording the flow track of the splitting operation, updating the flow track into a routing table, generating a new routing rule by the routing table according to the new flow track, and warehousing or inquiring data by the ES cluster according to the new routing rule;

step S5: and (4) ending the slicing expansion, and repeatedly executing the step S1 to the step S4 until the ES cluster provides field sections for storage for all index forms.

The ElasticSearch query acceleration method solves the problem that a large amount of time is occupied by calculation of intersection and union if the data volume of each result set is large during ES original data query, and improves the index efficiency; the invention realizes the efficient operation of filtering on the basis of a single sub-condition and improves the concurrent query efficiency.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow chart of step 4 of the present invention;

FIG. 3 is a flowchart of step S3 of the present invention;

fig. 4 is a flowchart of step S4 of the present invention.

Detailed Description

As shown in fig. 1 to 4, an ElasticSearch query acceleration method includes the following steps:

and 5: according to a bottom-layer storage structure provided by a Lucene full-text information retrieval tool, adding a plurality of Payload fields in an inverted list chain table by an ES cluster, wherein all the Payload fields are arranged behind a document number storage field region;

The ES cluster is an Elasticissearch server cluster.

The ES fragment extension adopts a Master-Slave structure, generates a plurality of jobs through a fragment list of a data table (Index) depending on zookeeper, and each of the fragmentation modules schedules the jobs, executes the jobs and completes the operation of the fragmentation (Shard).

step S2: setting an index form as an index data table in the ES cluster, wherein the shard is a fragment of the index form; the index form comprises a plurality of shards; setting a fragmentation threshold value; the ES cluster establishes an extended index form for the index form on the premise that the index form has an alias, the established extended index form has the same alias, and the number of shards of the extended index form is the same as that of the index form;

and C: the ES cluster calculates the number of the shards exceeding the shard threshold according to the size of the shard threshold, and checks whether the extended index form exists or whether the shards of the extended index form are full: if the new extended index form does not exist or the shard is full, continuing to extend the new extended index form, wherein the number of the shard is twice of the number of the existing shard, and updating the information of the newly added extended index form into the routing table; if the shard lists exceed the shard threshold, listing all the shard lists exceeding the shard threshold, and adding the shard lists into a task queue of the Zookeeper after descending the order; generating a plurality of job tasks by a task queue of the Zookeeper according to the shard list; ZooKeeper is a distributed, open-source distributed application coordination service, is an open-source implementation of Chubby of Google, and is an important component of Hadoop and Hbase.

Step S4: splitting the shard according to the following steps:

step E1: the ES cluster acquires the storage size of the shard;

When the method is used, as shown in fig. 1, a Payload field is added to each field, and a data query mode is changed correspondingly, for example, a query condition AandCandDandB is provided, where a is an accurate query condition, C is a range query condition, D is a prefix query condition, and B is a Payload range query condition. According to the method provided by the invention, the sub-conditions A, C, D are issued, respective query results are respectively found out, after the results are respectively queried, the results are filtered through Payload conditions B, the data volume of each sub-result set is reduced, and finally the intersection of the filtered results of three batches is taken to obtain the final result.

A series of interfaces supporting Payload domain query are added in a Lucene full-text information retrieval tool (Lucene for short) and an ES cluster, so that a user can directly call the Payload interface of the ES like calling other Elasticisarch interfaces, a Payload domain storage structure and an interface of a bottom Lucene do not need to be sensed, and the Payload domain is effectively utilized; and performing Payload encapsulation on five query conditions, namely 'single condition equivalent + range', 'prefix condition + range', 'fuzzy condition + range', 'IN condition + range' and 'range + range'.

The data acquisition terminal is a ten-gigabit switch, the ten-gigabit switch can acquire a large number of data sources from the Internet, and the data sources are in the format of data files and streaming data;

the ES cluster provides a data storage, query analysis and management monitoring interface for the Hadoop storage server cluster, a storage medium is a local disk, and the ES cluster supports various Spark components; the WEB interface server is in butt joint with the ES cluster through Zues-client and Loki;

zues-client is the encapsulated ES interface for the upper layer to call; loki is the query middleware of the unified index, and is responsible for receiving the query requests of structured data, unstructured data and mixed data of an upper layer user, analyzing, segmenting and forwarding the requests to the ES, and acquiring data from the structured data system and the unstructured data system according to the returned data id.

Claims

1. An ElasticSearch query acceleration method is characterized by comprising the following steps: the method comprises the following steps:

2. The method of claim 1, wherein the method comprises: the ES cluster is an Elasticissearch server cluster.

3. The method of claim 1, wherein the method comprises: the Payload field is a storage area that stores a scope query field, which includes a time field.

4. The method of claim 1, wherein the method comprises: in step 4, the ES cluster provides a field area for storage for the index data table according to the following steps:

step S4: splitting the shard according to the following steps:

step E1: the ES cluster acquires the storage size of the shard;