CN110716942A - Large-index rapid splitting method based on Lucene - Google Patents

Large-index rapid splitting method based on Lucene Download PDF

Info

Publication number
CN110716942A
CN110716942A CN201911026343.4A CN201911026343A CN110716942A CN 110716942 A CN110716942 A CN 110716942A CN 201911026343 A CN201911026343 A CN 201911026343A CN 110716942 A CN110716942 A CN 110716942A
Authority
CN
China
Prior art keywords
index
splitting
data
deleting
lucene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911026343.4A
Other languages
Chinese (zh)
Inventor
王帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Letter Recording Software Technology Co Ltd
Original Assignee
Nanjing Letter Recording Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Letter Recording Software Technology Co Ltd filed Critical Nanjing Letter Recording Software Technology Co Ltd
Priority to CN201911026343.4A priority Critical patent/CN110716942A/en
Publication of CN110716942A publication Critical patent/CN110716942A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a large-index rapid splitting method based on Lucene, which comprises the following steps of: marking the current newly-built index fragment directory by using a soft connection principle of files under Linux, and marking the storage position of an original index file pointed by the current file; deleting half of the designated index data on the current newly-built index fragment and deleting the opposite half of the designated index data on the other index fragment by means of the delete from property of Lucene, thereby completing the process of splitting the index file from one into two; after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and re-storing the subsequent data; the method provided by the invention does not need additional copy overhead in the splitting process, has high efficiency when deleting the specified index data, and accelerates the process of index splitting; after the index splitting process is completed, the positioning rule of the index data in the subsequent data storage is related to the deleting condition of the deleting operation, an additional algorithm is not needed, and the method is simple, convenient and quick.

Description

Large-index rapid splitting method based on Lucene
Technical Field
The invention relates to the technical field of file indexes, in particular to a large-index rapid splitting method based on Lucene.
Background
With the advent of the big data era, the amount of data has increased explosively. The retrieval performance of the data is greatly improved after the index is established when the data is put in storage. Unfortunately, indexing a table is costly. Firstly, the establishment of indexes needs to occupy physical space, and when more and more data exist, index files are also larger and larger; secondly, it takes time to create and maintain the index, and the time is increased along with the increase of the data volume; when data in the table is added, deleted and modified, the index is also maintained dynamically, and the more the data amount is, the larger the index file is, and the lower the data maintenance efficiency is.
If only one index file or index is initially created with insufficient number of fragments, when a certain degree is reached, if data is written or updated, the reconstruction of the index tree is very slow, which makes data storage very difficult. At this time, the number of slices needs to be increased through reconstruction index reconstruction.
The traditional scheme is that a new index fragment is added to a current index, then current index data copy is added to a newly constructed fragment, and data rearrangement is carried out through a certain algorithm so as to achieve the purpose of index splitting. However, when the amount of data is too large, the process is time-consuming, and if the original data is modified in the splitting process, data loss may be caused, and certain measures are needed to ensure the safety and novelty of the data.
The prior art has the following defects: 1. the existing index splitting technology is to directly copy index data onto a new fragment, and when the data size is large, because an additional copy is provided, the overhead caused by the copy is large and is not necessary. 2. After setting the new number of slices, it may be necessary to rearrange all data, which is time consuming if the amount of data is large. 3. The existing index splitting technology needs to adopt a certain algorithm to perform index splitting to complete positioning, the number of the splits is one part of the algorithm, and the cost of modifying the number of the splits is very expensive. 4. Project data growth is unpredictable and it is difficult to set the exact number of slices. 5. If the fragmentation is performed during the use process, the fragmentation process is very long, and if the original data is modified during the fragmentation process, the modifications may be lost. 6. If the locked fragment cannot be modified before the split, the locked fragment cannot be modified again until the split is completed, and a calling service side generates a large number of abnormal requests because the splitting process is too long.
Disclosure of Invention
The invention aims to provide a large-index rapid splitting method based on Lucene, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a large-index rapid splitting method based on Lucene comprises the following steps:
s1: marking the current newly-built index fragment directory by using a soft connection principle of files under Linux, and marking the storage position of an original index file pointed by the current file;
s2: deleting half of the designated index data on the current newly-built index fragment and deleting the opposite half of the designated index data on the other index fragment by means of the delete from property of Lucene, thereby completing the process of splitting the index file from one into two;
s3: and after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and warehousing the subsequent data again.
Preferably, the original index data can be accessed on the new index slice through the tag in S1).
Preferably, the method performs fast index splitting in idle time through manual control.
Compared with the prior art, the invention has the beneficial effects that: the method comprises the following aspects:
1. the splitting process does not need additional copy overhead, and when the data volume is too large, the IO pressure of the system is reduced;
2. the data does not need to be rearranged, the efficiency is high when the designated index data is deleted, and the process of index splitting is greatly accelerated by combining the previous point;
3. after the index splitting process is completed, the positioning rule of the index data in the subsequent data storage is related to the deleting condition of the deleting operation, no additional algorithm is needed, and the method is simple, convenient and quick;
4. the index splitting process is manually controlled, so that the unpredictability of data growth in the project is eliminated;
5. index splitting is avoided in the using process, data loss cannot be caused by idle time operation, and the data security is guaranteed;
6. the index splitting process is extremely fast, does not occupy more time and does not influence the request calling of a subsequent service party.
Drawings
FIG. 1 is a schematic flow chart of deletion of index data based on Lucene designation according to the present invention;
FIG. 2 is a schematic diagram of an index marking process according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "vertical", "upper", "lower", "horizontal", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
Referring to fig. 1-2, the present invention provides a technical solution: a large-index rapid splitting method based on Lucene comprises the following steps:
s1: marking the current newly-built index fragment directory by using a soft connection principle of files under Linux, and marking the storage position of an original index file pointed by the current file;
s2: deleting half of the designated index data on the current newly-built index fragment and deleting the opposite half of the designated index data on the other index fragment by means of the delete from property of Lucene, thereby completing the process of splitting the index file from one into two;
s3: and after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and warehousing the subsequent data again.
Further, the original index data may be accessed on the new index slice by tagging in S1).
Furthermore, the method performs rapid index splitting in idle time through manual control.
The working principle is as follows: by utilizing a soft connection principle of files under Linux, marking a current newly-built index fragment directory to mark an original index file storage position pointed by the current file, avoiding copy overhead, simultaneously accessing original index data on a new index fragment, deleting half of specified index data on the current newly-built index fragment by virtue of the delete from characteristic of Lucene, deleting the opposite other half of data on the other index fragment, and completing the process of splitting the index file from one into two; because the source file data of the link file is deleted without being influenced by the soft connection principle, the process of splitting the index file from one into two can be completed, the splitting process does not need additional copy overhead, and the efficiency of deleting the designated index data is high, so the index splitting process is extremely fast, and the time consumption is not too much even if the data volume is large;
and after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and warehousing the subsequent data again.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (3)

1. A large-index rapid splitting method based on Lucene is characterized in that: the method comprises the following steps:
s1: marking the current newly-built index fragment directory by using a soft connection principle of files under Linux, and marking the storage position of an original index file pointed by the current file;
s2: deleting half of the designated index data on the current newly-built index fragment and deleting the opposite half of the designated index data on the other index fragment by means of the delete from property of Lucene, thereby completing the process of splitting the index file from one into two;
s3: and after the index splitting is completed, determining a storage directory positioned by the current index data according to a deleting condition used in the deleting operation, and warehousing the subsequent data again.
2. The large index fast splitting method based on Lucene as claimed in claim 1, wherein: said S1) can access the original index data on the new index slice by the tag.
3. The large index fast splitting method based on Lucene as claimed in claim 1, wherein: the method performs rapid index splitting in idle time through manual control.
CN201911026343.4A 2019-10-26 2019-10-26 Large-index rapid splitting method based on Lucene Pending CN110716942A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911026343.4A CN110716942A (en) 2019-10-26 2019-10-26 Large-index rapid splitting method based on Lucene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911026343.4A CN110716942A (en) 2019-10-26 2019-10-26 Large-index rapid splitting method based on Lucene

Publications (1)

Publication Number Publication Date
CN110716942A true CN110716942A (en) 2020-01-21

Family

ID=69213259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911026343.4A Pending CN110716942A (en) 2019-10-26 2019-10-26 Large-index rapid splitting method based on Lucene

Country Status (1)

Country Link
CN (1) CN110716942A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN108509438A (en) * 2017-02-24 2018-09-07 南京烽火星空通信发展有限公司 A kind of ElasticSearch fragments extended method
CN110032549A (en) * 2019-01-28 2019-07-19 阿里巴巴集团控股有限公司 Subregion splitting method, device, electronic equipment and readable storage medium storing program for executing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN108509438A (en) * 2017-02-24 2018-09-07 南京烽火星空通信发展有限公司 A kind of ElasticSearch fragments extended method
CN110032549A (en) * 2019-01-28 2019-07-19 阿里巴巴集团控股有限公司 Subregion splitting method, device, electronic equipment and readable storage medium storing program for executing

Similar Documents

Publication Publication Date Title
CN102629247B (en) Method, device and system for data processing
US9779023B1 (en) Storing inline-compressed data in segments of contiguous physical blocks
EP3812915B1 (en) Big data statistics at data-block level
US20130297570A1 (en) Method and apparatus for deleting duplicate data
US10198321B1 (en) System and method for continuous data protection
US20120166400A1 (en) Techniques for processing operations on column partitions in a database
CN104376053B (en) A kind of storage and retrieval method based on magnanimity meteorological data
US20150347477A1 (en) Streaming File System
CN102567427A (en) Method and device for processing object data
CN104217174A (en) Safety storage system and safety storage method for distributed files
CN106980680B (en) Data storage method and storage device
CN103678715A (en) Snapshot supporting metadata information management method for distributed file system
CN107291768A (en) It is a kind of to index the method and device set up
JP2018511861A (en) Method and device for processing data blocks in a distributed database
CN106990914B (en) Data deleting method and device
CN107273449B (en) Breakpoint processing method and system based on memory database
CN110716942A (en) Large-index rapid splitting method based on Lucene
CN113612705B (en) Hash algorithm slicing and recombination-based power grid monitoring system data transmission method
CN114116612A (en) B + tree index-based access method for archived files
CN102819570B (en) A kind of data access method, Apparatus and system
KR101666440B1 (en) Data processing method in In-memory Database System based on Circle-Queue
CN108021472B (en) Format recovery method of ReFS file system and storage medium
CN105975567A (en) Method and device for processing internal file of application program
WO2024021491A1 (en) Data slicing method, apparatus and system
CN105468733A (en) Source end data deduplication-based volume replication method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200121