CN116680300A

CN116680300A - Reverse index query optimization method and device based on Es

Info

Publication number: CN116680300A
Application number: CN202310730101.3A
Authority: CN
Inventors: 牛煜超
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-01

Abstract

The embodiment of the application provides a query optimization method and device based on Es inverted index, and relates to the technical field of Internet. The method includes storing business data to a LongObjectHashMap data structure; and acquiring the business data to be extracted through a get method. The method realizes indexing based on the LongObjectHashMap data structure, improves the query efficiency, and solves the problem of low query efficiency of the existing method.

Description

Reverse index query optimization method and device based on Es

Technical Field

The application relates to the technical field of Internet, in particular to a query optimization method and device based on Es inverted index.

Background

The current intelligent management system search engine is mainly an elastic search, and the service characteristics of the intelligent management system search engine are that the intelligent management system search engine can be used for inquiring according to the serial numbers of an organization structure, and inquiry sentences such as terms are often used. The terms inside is normally very efficient, with no performance bottleneck. However, with subdivision of the data dimension, and expansion of organization teams, tens of thousands of terms occur, which can lead to significant performance and resource issues, affecting query efficiency.

Disclosure of Invention

The embodiment of the application aims to provide a query optimization method and device based on Es inverted index, which realize index based on a longObjectHashMap data structure, improve query efficiency and solve the problem of low query efficiency of the existing method.

The embodiment of the application provides a query optimization method based on Es inverted index, which comprises the following steps:

storing the business data into a LongObjectHashMap data structure;

and acquiring the business data to be extracted through a get method.

In the implementation process, after the LongObjectHashMap data structure is used for storing data, the data can be directly obtained through the get method provided by the LongObjectHashMap data structure, compared with the method using terms and the like, the method can directly obtain the needed data without traversing and searching, achieves the purpose of changing the time with space, improves the query efficiency even if the space occupation is slightly large, and solves the problem of low query efficiency of the existing method.

Further, the storing the business data in the LongObjectHashMap data structure includes:

storing the employee ID to a Key array;

and completely storing employee information corresponding to the employee ID into a Value array, wherein the subscript of the employee information in the Value array is the same as the subscript of the employee ID in the Key array.

In the implementation process, the data is integrally stored in the Value array instead of a simple storage mode of storing the difference part between the data and the adjacent data, a direct searching mode can be adopted, no traversing query is needed, and the query efficiency is improved.

Further, the obtaining the business data to be fetched through the get method includes:

when employee information corresponding to a current employee ID is acquired, calculating a hash code of the current employee ID;

determining a subscript of the current employee ID based on the hash code;

and acquiring corresponding employee information in the Value array according to the subscript.

In the implementation process, only the index of the array where the current employee ID is located is obtained by calculating the hash code, and the index is consistent, so that the index is the index of the information to be queried, the query information corresponding to the index can be quickly found according to the index, and the query efficiency is high.

Further, before the step of storing the business data in the LongObjectHashMap data structure, the method further includes:

PostingsFormat is rewritten based on the longObjectHashMap data structure.

In the implementation process, the query function can be realized by rewriting the PostingsFormat at the bottom layer, namely rewriting the original PostingsFormat structure into a longObjectHashMap data structure, and the query function has no influence on an upstream system, has a lower technical threshold and is easy to realize.

The embodiment of the application also provides a query optimizing device based on the Es inverted index, which comprises:

the data storage module is used for storing the operation data into a LongObjectHashMap data structure;

and the data acquisition module is used for acquiring the business data to be extracted through the get method.

In the implementation process, after the LongObjectHashMap data structure is used for storing data, the data can be directly obtained through the get method provided by the LongObjectHashMap data structure, traversal search is not needed, space is replaced, and even if space occupation is slightly large, the query efficiency is improved, and the problem of low query efficiency of the existing method is solved.

Further, the data storage module includes:

the first storage module is used for storing the employee ID to the Key array;

and the second storage module is used for completely storing the employee information corresponding to the employee ID into a Value array, and the subscript of the employee information in the Value array is the same as the subscript of the employee ID in the Key array.

In the implementation process, the data is integrally stored in the Value array instead of only storing the difference part between the data and the adjacent data, so that the inquiry is facilitated.

Further, the data acquisition module includes:

the hash code calculation module is used for calculating the hash code of the current employee ID when the employee information corresponding to the current employee ID is acquired;

a subscript obtaining module, configured to determine a subscript of the current employee ID based on the hash code;

and the employee information acquisition module is used for acquiring corresponding employee information in the Value array according to the subscript.

In the implementation process, the index of the information to be queried is obtained only by calculating the hash code, so that the query information corresponding to the index can be obtained quickly, and the query efficiency is high.

Further, the apparatus further comprises:

and the rewriting module is used for rewriting PostingsFormat based on the longObjectHashMap data structure.

In the implementation process, the query function can be realized by rewriting PostingsFormat at the bottom layer, and the upstream system is not influenced.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic equipment to execute the Es-based inverted index query optimization method.

The embodiment of the application also provides a readable storage medium, wherein the readable storage medium stores computer program instructions, and when the computer program instructions are read and run by a processor, the method for optimizing query based on Es inverted index according to any one of the above is executed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a query optimization method based on Es inverted indexes provided by an embodiment of the application;

FIG. 2 is a flowchart showing a specific process of storing business data in a longObjectHashMap data structure according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the Key array and the Value array according to the embodiment of the present application;

FIG. 4 is a flow chart of operation data acquisition provided by an embodiment of the present application;

FIG. 5 is a block diagram of a query optimization device based on Es inverted index according to an embodiment of the present application;

fig. 6 is a block diagram of another query optimization device based on Es inverted index according to an embodiment of the present application.

Icon:

100-a data storage module; 101-a first memory module; 102-a second memory module; 200-a data acquisition module; 201-a hash code calculation module; 202-a subscript acquisition module; 203-an employee information acquisition module; 300-overwrite module.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Example 1

Referring to fig. 1, fig. 1 is a flowchart of a query optimization method based on Es inverted index according to an embodiment of the present application.

The elastomer search, es, is implemented using a data structure named inverted index. The computer indexing program establishes an index for each word by scanning each word in the article, indicates the number of times and the position of the word in the article, and when the user inquires, the indexing program searches according to the index established in advance and feeds the search result back to the user, and the index establishment mode is called inverted index.

However, the application scenario of the application is an inverted index query established based on the intelligent management system.

The existing reverse index method is usually implemented by terms, and the reverse index method specifically includes:

acquiring the position of a Block where the Term is positioned on a disk from a Term Index in a memory;

reading the terminate of the Block from the disk into a memory;

decoding the inverted chain storage format to generate an inverted chain that is available for merging.

Each of the steps described above is a very time consuming operation, in which finding the target Term is accomplished using a binary search, with a time complexity of O (log N), N being the number of terms searched. The third decoding step involves the underlying design FST (PostingsFormat) of Es, which is a data structure with a temporal complexity of O (len (term)), and O (M x len (term)) for M term.

But the FST data structure cannot meet the service requirement of the intelligent management system in the present application:

the data types coded by the organization architecture in the intelligent management system are long types;

the range search is not required to be precisely matched;

there is no prefix matching, fuzzy search requirement, and no prefix tree characteristic is needed.

Therefore, the application uses the LongObjectHashMap with low time complexity and high performance compression data structure.

Because the maintenance linked list of HashMap data structures in K-V type data structures that meet low time complexity takes up a large amount of space, a high performance compressed data structure LongObjectHashMap in Netty (a java open source framework, a web application framework and tools that provide asynchronous, event driven, to rapidly develop high performance, high reliability web servers and client programs) is used.

The LongObjectHashMap data structure uses open addressing instead of maintaining a linked list to maintain the relationships of elements like HashMap (maintaining a linked list also requires space) and has better performance.

The method specifically comprises the following steps:

step S100: storing the business data into a LongObjectHashMap data structure;

step S200: and acquiring the business data to be extracted through a get method.

After the LongObjectHashMap data structure is used for storing data, the device can directly acquire the data through the get method provided by the LongObjectHashMap data structure, compared with a method using terms and the like, the device can directly acquire the needed data without traversing and searching, achieves the purpose of changing the time with space, improves the query efficiency even if the space occupation is slightly large, and solves the problem of low query efficiency of the existing method.

In the PostingsFormat, a plurality of methods, such as a method for storing data, are built in, a longObjectHashMap data structure is changed, and for a method for taking out data, a get method provided by the longObjectHashMap data structure is changed to obtain needed data directly, and traversing searching is not needed.

For example, in a normal case, when data is stored, if the data stored by the employee 1 is ABC and the data required to be stored by the employee 2 is ABCD, then only D is stored by the employee 2 during storage, saving space, so that searching needs to be performed in a traversal-by-traversal manner when using terms query, and if the data required to be queried is located in the fifth one of the arrays, the traversal needs to be continuously performed for five times until the data traversed to be queried is located in the position of the array, and the multi-traversal manner reduces the searching efficiency.

After the LongObjectHashMap data structure is used, all the data ABCD of the staff 2 are stored instead of being simply stored in a common storage mode, so that the inquiry can be directly performed, and the inquiry speed is improved.

In this way, the storage mode of LongObjectHashMap data structure is used during storage, so the space occupation is slightly larger, but the get method can be directly adopted to take the required value during the fetching, the time complexity is only O (1), and the mode can be understood as using space to change the time, even if the storage space is larger, the searching efficiency is improved.

It should be noted that, the business data may be any data in the intelligent business system, for example, management data of employee information, or may be financial data in the financial science and technology field, for example, when inquiring account data of a certain user, the employee ID in the present application may also be a customer ID, a customer account ID, etc., and information such as ID and attribution corresponding to the data to be inquired in the financial science and technology field, which is not limited in any way.

As shown in fig. 2, a specific flowchart of storing business data in LongObjectHashMap data structure is shown.

In step S100, the storing of the business data in the LongObjectHashMap data structure may specifically include the following steps:

step S101: storing the employee ID to a Key array;

step S102: and completely storing employee information corresponding to the employee ID into a Value array, wherein the subscript of the employee information in the Value array is the same as the subscript of the employee ID in the Key array.

The employee information is integrally stored in the Value array instead of a shorthand storage mode of storing the difference part between the employee information and the adjacent data, and the storage mode can be a direct search mode without traversing the query, so that the query efficiency is improved.

As shown in fig. 3, the Key array is a hash table, and the subscripts in the Value array and the subscripts in the Key array are in one-to-one correspondence, so that the target data to be searched can be obtained according to the subscripts as long as the subscripts of the target data are obtained by calculation, and the one-to-one correspondence setting mode of the subscripts can improve the query efficiency.

As shown in fig. 4, in order to obtain the operation data, in step S200, the operation data to be taken out is obtained by the get method, which may specifically include the following steps:

step S201: when employee information corresponding to a current employee ID is acquired, calculating a hash code of the current employee ID;

step S202: determining a subscript of the current employee ID based on the hash code;

step S203: and acquiring corresponding employee information in the Value array according to the subscript.

The index of the array where the current employee ID is located is obtained only by calculating the hash code, and the index is consistent, so that the index is the index of the information to be queried, the query information corresponding to the index can be quickly searched according to the index, and the query efficiency is high.

For example, when the information (in the Value array) corresponding to 11 (in the Key array) needs to be acquired, the hash code (hash code) of 11 is calculated first, and if the calculation result is 6, the subscript of the target data to be acquired in the Value array may be considered to be 6, so that the subscript 6 may be used to acquire the corresponding information found in the Value array.

It should be noted that, the method for calculating the hash code belongs to the prior art, and is not described in detail herein, as long as the calculated hash code value is dispersed as much as possible.

Before the step of storing the business data in the LongObjectHashMap data structure, the method further comprises:

PostingsFormat is rewritten based on the longObjectHashMap data structure.

The query function can be realized by rewriting PostingsFormat at the bottom layer, and the query function has no influence on an upstream system.

The Es bottom layer realizes Lucene (a full-text index retrieval tool kit of open-source and pure java language and has strong expansibility) and embeds a plurality of index types, wherein PostingsFormat is used for processing inverted indexes, the FST structure is changed into a longObjectHashMap data structure only by rewriting PostingsFormat, so that the query function can be realized, the upstream system is not influenced, the technical threshold is lower, and the implementation is easy, and therefore, the method has wide use conditions.

When the PostingsFormat structure is rewritten on the bottom layer, the PostingsFormat structure is not perceived by an upstream system, and therefore, the PostingsFormat structure does not affect the upstream system.

The method changes the original FST structure into the longObjectHashMap data structure, uses space to change time, greatly improves the query speed, and greatly reduces the CPU utilization rate.

Example 2

The embodiment of the application provides an Es-based inverted index query optimization device, which is applied to the Es-based inverted index query optimization method in embodiment 1.

Fig. 5 is a block diagram of a query optimizing apparatus based on Es inverted index.

The elastomer search, es, is implemented using a data structure named inverted index. The existing method is usually implemented by terms, and the inverted index specifically comprises the following steps when performing terms search:

reading the terminate of the Block from the disk into a memory;

But the FST structure cannot meet the service requirement of the intelligent management system in the present application:

the range search is not required to be precisely matched;

Therefore, the application uses the LongObjectHashMap with low time complexity and high performance compression data structure. Because the maintenance linked list of HashMap data structures in K-V type data structures that meet low time complexity takes up a large amount of space, a high performance compressed data structure LongObjectHashMap in Netty (a java open source framework, a web application framework and tools that provide asynchronous, event driven, to rapidly develop high performance, high reliability web servers and client programs) is used.

The devices include, but are not limited to:

a data storage module 100, configured to store the business data into a LongObjectHashMap data structure;

the data obtaining module 200 is configured to obtain the business data to be fetched by the get method.

For example, postingsFormat has many built-in methods, such as a method of storing data, changed to put in a longObjectHashMap data structure, and a method of retrieving data is changed to obtain required data directly through a get method provided by the longObjectHashMap data structure, without performing traversal search.

In general, when data is stored, if the data stored by the employee 1 is ABC and the data required to be stored by the employee 2 is ABCD, then the employee 2 stores only D during storage, saving space, so that a one-by-one traversal mode is required to search for when using terms query, and if the data required to be queried is located at the fifth one in the array, the traversal is required to be continuously performed for five times until the data required to be traversed to the position of the queried in the array, and the multiple traversal mode reduces the searching efficiency.

As shown in fig. 6, which is a block diagram of another reverse index query optimization device based on Es, the data storage module 100 may further include, but is not limited to:

a first storage module 101, configured to store an employee ID into a Key array;

and the second storage module 102 is configured to store the employee information corresponding to the employee ID to a Value array completely, where the subscript of the employee information in the Value array is the same as the subscript of the employee ID in the Key array.

Particularly, the subscripts in the Value array and the subscripts in the Key array are in one-to-one correspondence, so that the target data to be searched can be obtained according to the subscripts as long as the subscripts of the target data are obtained through calculation, and the one-to-one correspondence setting mode of the subscripts can improve the query efficiency.

The data acquisition module 200 includes, but is not limited to:

a hash code calculation module 201, configured to calculate a hash code of a current employee ID when employee information corresponding to the current employee ID is obtained;

a subscript obtaining module 202, configured to determine a subscript of the current employee ID based on the hash code;

and the employee information acquiring module 203 is configured to acquire corresponding employee information in the Value array according to the subscript.

The apparatus further comprises:

and a rewriting module 300, configured to rewrite the PostingsFormat based on the LongObjectHashMap data structure.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic equipment to execute the query optimization method based on the Es inverted index according to the embodiment 1.

The embodiment of the application also provides a readable storage medium, in which computer program instructions are stored, and when the computer program instructions are read and executed by a processor, the Es-based inverted index query optimization method described in embodiment 1 is executed.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An Es-based reverse index query optimization method, comprising:

storing the business data into a LongObjectHashMap data structure;

and acquiring the business data to be extracted through a get method.

2. The Es-based reverse index query optimization method of claim 1, wherein storing the business data into a LongObjectHashMap data structure comprises:

storing the employee ID to a Key array;

3. The Es-based inverted index query optimization method according to claim 2, wherein the obtaining the business data to be fetched by the get method comprises:

determining a subscript of the current employee ID based on the hash code;

4. The Es-based reverse index query optimization method of claim 2, wherein prior to the step of storing the business data into LongObjectHashMap data structure, the method further comprises:

PostingsFormat is rewritten based on the longObjectHashMap data structure.

5. An Es-based reverse index query optimization apparatus, the apparatus comprising:

6. The Es-based inverted index query optimization device of claim 5, wherein the data storage module comprises:

the first storage module is used for storing the employee ID to the Key array;

7. The Es-based inverted index query optimization device of claim 6, wherein the data acquisition module comprises:

8. The Es-based inverted index query optimization device of claim 7, further comprising:

9. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the Es-based inverted index query optimization method according to any one of claims 1 to 4.

10. A readable storage medium having stored therein computer program instructions which, when read and executed by a processor, perform the Es-based inverted index query optimization method of any one of claims 1 to 4.