CN108509437A

CN108509437A - A kind of ElasticSearch inquiries accelerated method

Info

Publication number: CN108509437A
Application number: CN201710102541.9A
Authority: CN
Inventors: 王磊; 王胤然; 徐寅; 穆宁
Original assignee: NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd
Current assignee: NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd
Priority date: 2017-02-24
Filing date: 2017-02-24
Publication date: 2018-09-07
Anticipated expiration: 2037-02-24
Also published as: CN108509437B

Abstract

The invention discloses a kind of ElasticSearch to inquire accelerated method, computer big data index technology field, it is that each field increases Payload load domain first to distribute bright, then filter operation is done on the basis of single subquery condition by Payload load domain again, if each result set data volume is very big when solving the inquiry of ES initial data, the problem of taking intersection and the calculating of union that can occupy the plenty of time, improves index efficiency.

Description

A kind of ElasticSearch inquiries accelerated method

Technical field

The invention belongs to computer big data index technology fields.

Background technology

Nowadays, one mass produces, shares and opened using the epoch of data, and data expand and become rapidly Greatly, the mankind come into Internet era.The mankind are brought into a magnanimity by especially social networks, e-commerce and mobile communication Structure and non-structural data information new era.Huge data volume causes these mass datas to have very high complexity, And it full of variation, deals with extremely complex.How analyzing processing is carried out to mass data, and externally provided simple and convenient Service, the problem that must be faced as many IT enterprises and mechanism.

Mass data is divided into structural data and unstructured data, structural data refer to such as business finance account and Creation data, student's fractional data, statistical report form data etc., unstructured data are then some text datas, video/audio Equal multi-medium datas etc..Wherein unstructured data accounts for 80% of mass data or so.Structural data can be by traditional Relevant database and the distributed No-SQL databases developed later are handled, and unstructured data can then pass through full text Retrieval technique externally provides inquiry service.

In current full-text search, Lucene is the most simple and convenient, and Lucene is a full text information retrieval kit, uses Be inverted file index structure.Its not instead of complete search for application, index is provided for your application program And function of search.Full-text index/the search function realized in various applications for application can be easily embedded into.Currently, with Clustering based on Lucene includes mainly Solr and Elasticsearch（Abbreviation ES below）, ElasticSearch is a search server based on Lucene.It provides the full text of a distributed multi-user ability Search engine supports RESTful web, java interfaces, can support to search in real time have and stablize, reliably, quickly, installation makes The features such as with facilitating.

ES initial data is inquired, and is that combination condition is subdivided into sub- condition one by one to issue inquiry, then to each result set Carry out the operations such as intersection or union takes the operations such as intersection and union that can occupy at this time if each result set data volume is very big Plenty of time.

Invention content

The object of the present invention is to provide a kind of ElasticSearch to inquire accelerated method, solves the inquiry of ES initial data The each result set data volumes of Shi Ruguo are very big, then the problem of taking intersection and the calculating of union that can occupy the plenty of time, improve rope Draw efficiency.

To achieve the above object, the present invention uses following technical scheme：

A kind of ElasticSearch inquiries accelerated method, includes the following steps：

Step 1：Full-text index system is established, full-text index system includes Hadoop storage servers cluster, WEB interface service Device, data import server and data collection station, and data collection station connects data by internet and imports server, WEB Interface server imports server with data and connects Hadoop storage server clusters by internet；

Step 2：Full-text search platform is established in Hadoop storage server clusters by Lucene full text information retrievals tool, And ES clusters are distributed in Hadoop storage server clusters by Lucene full text information retrievals tool；

Step 3：Flow data or text data are input to data and import server by data collection station, and data pour into server will Flow data or text data are sent to Hadoop storage server clusters and are stored；

Step 4：ES clusters are built by the data that Lucene full text information retrieval tools are Hadoop storage server cluster-based storages The index data table of vertical inverted file index structure, ES clusters provide the field area of storage for index data table；The storage In field area field area is stored comprising multiple number of documents；

Step 5：According to the bottom storage organization that Lucene full text information retrieval tools provide, ES clusters add in inverted list chained list A Payload load domain is added, all Payload load domain is set to number of documents and stores field area back；

Step 6：User sends querying condition to ES collection by WEB interface server input inquiry condition, WEB interface server Group；The querying condition includes precise inquiry conditions, range query condition, prefix lookups condition and Payload range query items Part；

Step 7：ES clusters by Lucene full text information retrievals tool first according to precise inquiry conditions, range query condition and Prefix lookups condition is retrieved, and accordingly obtains accurate query result, range query result and prefix lookups result；

Step 8：ES clusters are according to Payload range queries condition respectively to accurate query result, range query result and prefix Query result is filtered, and obtains accurate query results, range query result set and prefix lookups result set；

Step 9：Accurate query results, range query result set and prefix lookups result set are done intersection calculating by ES clusters, are obtained Go out final retrieval result.

The ES clusters are Elasticsearch server clusters.

Payload load domain is the memory block that memory range inquires field, and the range query field includes the time Field.

In the step 4, ES clusters provide the field area of storage for index data table according to the following steps：

Step S1：Setting fragment is the basic storage cell of each index data table, if each index data table includes Dry fragment, ES clusters store the distribution of index data table to the different storage mediums in ES clusters according to the fragment of index data table In；

Step S2：Index lists are set as an index data table in ES clusters, shard fragments are one of index lists Fragment；Include multiple shard fragments in index lists；Set a fragmentation threshold；

Step S3：ES clusters establish an extension index list to index lists, and ES clusters read maximum in index lists Shard fragments, judge whether shard fragments reach fragmentation threshold：It is to then follow the steps S4, it is no, then follow the steps S5；

ES clusters are established an extension index list to index lists and are as follows：

Step A：ES clusters obtain index lists, and traverse each shard fragment in index lists, and do and following sentence It is disconnected：If shard fragments exceed fragmentation threshold, C is thened follow the steps；If shard fragments then follow the steps B without departing from fragmentation threshold；

Step B：Whether the fragment of the expansion table under inquiry shard fragments has beyond fragmentation threshold：It is to then follow the steps C；It is no, Then follow the steps S4；

Step C：ES clusters according to the size of fragmentation threshold be calculated over fragmentation threshold shard fragments should cutting number, Whether the shard fragments that verification extension index lists whether there is or extend index lists have expired：If being not present or shard dividing Piece has been expired, then continues to extend new extension index lists, and shard fragment numbers are the two of the number of existing shard fragments Times, in newly-increased extension index form informations update to routing table；If in the presence of had more than fragmentation threshold is listed Shard fragment lists, and be added in the task queue of Zookeeper after descending arrangement；The task queue of Zookeeper according to Shard fragment lists generate multiple job tasks；

Step S4：Shard fragments are divided according to following steps：

Step D：After obtaining a job task in the task queue of Zookeeper, notice Ares enters library and stops ES clusters In-stockroom operation only is carried out to the table, judges that Ares enters whether library returns to message：It is to then follow the steps E；It is no, then it waits for Ares enters library response；

Step E：ES clusters are started to carry out splitting operation to shard fragments by following rule：

Step E1：ES clusters obtain the storage size of the shard fragments；

Step E2：Fragment result of calculation will be obtained behind the storage size divided by 2, by fragment result of calculation compared with fragmentation threshold Compared with：If it is greater than fragmentation threshold, the storage size divided by 2 times N are executed step E2 by record；If it is less than fragment threshold Value, 2 × N of record is the number to be divided；

Step E3：The total amount of data total for obtaining the shard fragments, the data volume size K after division：K=total÷(2× N)；

Step E4：A time T is given by ES cluster query interfaces, T unit is the second, and when T seconds, the interior data obtained were denoted as m, Coefficient value is s, and the size of s is equal to K ÷ m；ES clusters are according to the number of the division, data volume size K and coefficient s to shard Fragment is into line splitting；

Step F：New fragment after ES clusters divide shard fragments is numbered, and sets the number of new fragment as shard [0] fragment；

Step G：The data in shard fragments are deleted, the data in shard [0] fragment are substituted into the data in shard fragments, And the shard of information [0] fragment is added in index lists；Simultaneously will in shard fragments except shard [0] fragments with Outer fragment is written in the shared catalogues of NFS, and extends the fragment of index lists, and dividing in the catalogue shared according to NFS It, will be more than the shard of fragmentation threshold again according to the method for step C after piece does recovery recoveries to the fragment of index lists Fragment list, and be added in the task queue of Zookeeper after descending arrangement；

Step H：The flow path track of splitting operation is recorded, and is updated in flow path track to routing table, routing table is according to new flow Data are put in storage or are inquired according to new routing rule by the new routing rule of Track Pick-up, ES clusters；

Step S5：Fragment extension terminates, and repeats step S1 to step S4, until ES clusters are that all index lists carry For the field area of storage.

A kind of ElasticSearch of the present invention inquires accelerated method, if solve the inquiry of ES initial data Each result set data volume is very big, then the problem of taking intersection and the calculating of union that can occupy the plenty of time, improves index efficiency； The present invention is realized does the efficient operation filtered on single sub- conditioned basic, improves concurrent search efficiency.

Description of the drawings

Fig. 1 is the overview flow chart of the present invention；

Fig. 2 is the flow chart of the step 4 of the present invention；

Fig. 3 is the flow chart of the step S3 of the present invention；

Fig. 4 is the flow chart of the step S4 of the present invention.

Specific implementation mode

A kind of ElasticSearch as shown in Figure 1 to 4 inquires accelerated method, includes the following steps：

Step 5：According to the bottom storage organization that Lucene full text information retrieval tools provide, ES clusters add in inverted list chained list A Payload payload fields are added, all Payload load domain is set to number of documents and stores field area back；

The ES clusters are Elasticsearch server clusters.

The extension of ES fragments uses Master-Slave structures, passes through tables of data dependent on zookeeper（Index）Fragment List generates multiple operations, and each these operations of division module schedules execute operation, complete division fragment（Shard）Operation.

Step S2：Index lists are set as an index data table in ES clusters, shard fragments are one of index lists Fragment；Include multiple shard fragments in index lists；Set a fragmentation threshold；ES clusters establish one to index lists The premise of a extension index lists, which is index lists, alias, and the extension index lists of foundation have same alias, and expand The shard fragments number for opening up index lists is identical as index lists；

Step C：ES clusters according to the size of fragmentation threshold be calculated over fragmentation threshold shard fragments should cutting number, Whether the shard fragments that verification extension index lists whether there is or extend index lists have expired：If being not present or shard dividing Piece has been expired, then continues to extend new extension index lists, and shard fragment numbers are the two of the number of existing shard fragments Times, in newly-increased extension index form informations update to routing table；If in the presence of had more than fragmentation threshold is listed Shard fragment lists, and be added in the task queue of Zookeeper after descending arrangement；The task queue of Zookeeper according to Shard fragment lists generate multiple job tasks；ZooKeeper is one distributed, the Distributed Application journey of open source code Sequence coordination service is mono- realization increased income of Chubby of Google, is the significant components of Hadoop and Hbase.

Step S4：Shard fragments are divided according to following steps：

Step E1：ES clusters obtain the storage size of the shard fragments；

In use, as shown in Figure 1, increasing the domains Payload for each field, data query mode accordingly changes, example If any querying condition A and C and D and B, wherein A is precise inquiry conditions, C is range query condition, D is that prefix is looked into Inquiry condition, B are Payload range query conditions.According to method provided by the invention, bundle condition A, C, D are issued, and are found respectively Respective query result after respectively inquiring result, then by the B progress result filterings of Payload conditions, reduces each height knot The data volume of fruit collection, finally takes three batches of filtered result intersections to obtain final result.

In Lucene full text information retrieval tools（Abbreviation Lucene）With a series of domains support Payload are added in ES clusters The interface of inquiry so that user can directly invoke connecing for the Payload of ES as calling other Elasticsearch interfaces Mouthful, it is not required to the domains the Payload storage organization and interface of perception bottom Lucene, realizes and the domains Payload is efficiently used；It is right " single part equivalence+range ", " prefix condition+range ", " hazy condition+range ", " IN conditions+range ", " range This five kinds of request for information of+range " do Payload encapsulation.

The data collection station is 10,000,000,000 interchangers, and 10,000,000,000 interchangers can obtain a large amount of data source from internet, The format of data source is data file and stream data；

ES clusters provide data loading, query analysis and management and monitoring interface, storage medium to Hadoop storage server clusters For local disk, ES clusters support various Spark components；WEB interface server passes through Zues-client and Loki and ES clusters Docking；Zues-client is the ES interfaces encapsulated, is called for upper layer；Loki is the inquiry middleware of unified index, is responsible for reception The structural data of upper-layer user, unstructured data, blended data inquiry request, analysis cutting forward the request to ES, and root According to the data id of return data are obtained from structural data system, unstructured data system.

Claims

1. a kind of ElasticSearch inquires accelerated method, it is characterised in that：Include the following steps：

2. a kind of ElasticSearch as described in claim 1 inquires accelerated method, it is characterised in that：The ES clusters are Elasticsearch server clusters.

3. a kind of ElasticSearch as described in claim 1 inquires accelerated method, it is characterised in that：The Payload is carried Lotus domain is the memory block that memory range inquires field, and the range query field includes time field.

4. a kind of ElasticSearch as described in claim 1 inquires accelerated method, it is characterised in that：In the step 4 In, ES clusters provide the field area of storage for index data table according to the following steps：

Step S4：Shard fragments are divided according to following steps：

Step E1：ES clusters obtain the storage size of the shard fragments；

Step E3：The total amount of data total for obtaining the shard fragments, the data volume size K after division：K=total ÷ (2 × N)；

Step G：The data in shard fragments are deleted, the data in shard [0] fragment are substituted into the data in shard fragments, And the shard of information [0] fragment is added in index lists；Simultaneously by shard fragments in addition to shard [0] fragment Fragment is written in the shared catalogues of NFS, and extends the fragment of index lists, and according to the fragment pair in catalogue shared NFS It, will be more than the shard fragments of fragmentation threshold again according to the method for step C after the fragment of index lists does recovery recoveries List, and be added in the task queue of Zookeeper after descending arrangement；