CN110716942A

CN110716942A - Large-index rapid splitting method based on Lucene

Info

Publication number: CN110716942A
Application number: CN201911026343.4A
Authority: CN
Inventors: 王帅
Original assignee: Nanjing Letter Recording Software Technology Co Ltd
Current assignee: Nanjing Letter Recording Software Technology Co Ltd
Priority date: 2019-10-26
Filing date: 2019-10-26
Publication date: 2020-01-21

Abstract

The invention discloses a large-index rapid splitting method based on Lucene, which comprises the following steps of: marking the current newly-built index fragment directory by using a soft connection principle of files under Linux, and marking the storage position of an original index file pointed by the current file; deleting half of the designated index data on the current newly-built index fragment and deleting the opposite half of the designated index data on the other index fragment by means of the delete from property of Lucene, thereby completing the process of splitting the index file from one into two; after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and re-storing the subsequent data; the method provided by the invention does not need additional copy overhead in the splitting process, has high efficiency when deleting the specified index data, and accelerates the process of index splitting; after the index splitting process is completed, the positioning rule of the index data in the subsequent data storage is related to the deleting condition of the deleting operation, an additional algorithm is not needed, and the method is simple, convenient and quick.

Description

Large-index rapid splitting method based on Lucene

Technical Field

The invention relates to the technical field of file indexes, in particular to a large-index rapid splitting method based on Lucene.

Background

With the advent of the big data era, the amount of data has increased explosively. The retrieval performance of the data is greatly improved after the index is established when the data is put in storage. Unfortunately, indexing a table is costly. Firstly, the establishment of indexes needs to occupy physical space, and when more and more data exist, index files are also larger and larger; secondly, it takes time to create and maintain the index, and the time is increased along with the increase of the data volume; when data in the table is added, deleted and modified, the index is also maintained dynamically, and the more the data amount is, the larger the index file is, and the lower the data maintenance efficiency is.

If only one index file or index is initially created with insufficient number of fragments, when a certain degree is reached, if data is written or updated, the reconstruction of the index tree is very slow, which makes data storage very difficult. At this time, the number of slices needs to be increased through reconstruction index reconstruction.

The traditional scheme is that a new index fragment is added to a current index, then current index data copy is added to a newly constructed fragment, and data rearrangement is carried out through a certain algorithm so as to achieve the purpose of index splitting. However, when the amount of data is too large, the process is time-consuming, and if the original data is modified in the splitting process, data loss may be caused, and certain measures are needed to ensure the safety and novelty of the data.

The prior art has the following defects: 1. the existing index splitting technology is to directly copy index data onto a new fragment, and when the data size is large, because an additional copy is provided, the overhead caused by the copy is large and is not necessary. 2. After setting the new number of slices, it may be necessary to rearrange all data, which is time consuming if the amount of data is large. 3. The existing index splitting technology needs to adopt a certain algorithm to perform index splitting to complete positioning, the number of the splits is one part of the algorithm, and the cost of modifying the number of the splits is very expensive. 4. Project data growth is unpredictable and it is difficult to set the exact number of slices. 5. If the fragmentation is performed during the use process, the fragmentation process is very long, and if the original data is modified during the fragmentation process, the modifications may be lost. 6. If the locked fragment cannot be modified before the split, the locked fragment cannot be modified again until the split is completed, and a calling service side generates a large number of abnormal requests because the splitting process is too long.

Disclosure of Invention

The invention aims to provide a large-index rapid splitting method based on Lucene, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a large-index rapid splitting method based on Lucene comprises the following steps:

s1: marking the current newly-built index fragment directory by using a soft connection principle of files under Linux, and marking the storage position of an original index file pointed by the current file;

s2: deleting half of the designated index data on the current newly-built index fragment and deleting the opposite half of the designated index data on the other index fragment by means of the delete from property of Lucene, thereby completing the process of splitting the index file from one into two;

s3: and after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and warehousing the subsequent data again.

Preferably, the original index data can be accessed on the new index slice through the tag in S1).

Preferably, the method performs fast index splitting in idle time through manual control.

Compared with the prior art, the invention has the beneficial effects that: the method comprises the following aspects:

1. the splitting process does not need additional copy overhead, and when the data volume is too large, the IO pressure of the system is reduced;

2. the data does not need to be rearranged, the efficiency is high when the designated index data is deleted, and the process of index splitting is greatly accelerated by combining the previous point;

3. after the index splitting process is completed, the positioning rule of the index data in the subsequent data storage is related to the deleting condition of the deleting operation, no additional algorithm is needed, and the method is simple, convenient and quick;

4. the index splitting process is manually controlled, so that the unpredictability of data growth in the project is eliminated;

5. index splitting is avoided in the using process, data loss cannot be caused by idle time operation, and the data security is guaranteed;

6. the index splitting process is extremely fast, does not occupy more time and does not influence the request calling of a subsequent service party.

Drawings

FIG. 1 is a schematic flow chart of deletion of index data based on Lucene designation according to the present invention;

FIG. 2 is a schematic diagram of an index marking process according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "vertical", "upper", "lower", "horizontal", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Referring to fig. 1-2, the present invention provides a technical solution: a large-index rapid splitting method based on Lucene comprises the following steps:

Further, the original index data may be accessed on the new index slice by tagging in S1).

Furthermore, the method performs rapid index splitting in idle time through manual control.

The working principle is as follows: by utilizing a soft connection principle of files under Linux, marking a current newly-built index fragment directory to mark an original index file storage position pointed by the current file, avoiding copy overhead, simultaneously accessing original index data on a new index fragment, deleting half of specified index data on the current newly-built index fragment by virtue of the delete from characteristic of Lucene, deleting the opposite other half of data on the other index fragment, and completing the process of splitting the index file from one into two; because the source file data of the link file is deleted without being influenced by the soft connection principle, the process of splitting the index file from one into two can be completed, the splitting process does not need additional copy overhead, and the efficiency of deleting the designated index data is high, so the index splitting process is extremely fast, and the time consumption is not too much even if the data volume is large;

and after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and warehousing the subsequent data again.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A large-index rapid splitting method based on Lucene is characterized in that: the method comprises the following steps:

2. The large index fast splitting method based on Lucene as claimed in claim 1, wherein: said S1) can access the original index data on the new index slice by the tag.

3. The large index fast splitting method based on Lucene as claimed in claim 1, wherein: the method performs rapid index splitting in idle time through manual control.