CN106649461A

CN106649461A - Method for automatically cleaning and maintaining elastic search log index file

Info

Publication number: CN106649461A
Application number: CN201610849348.7A
Authority: CN
Inventors: 金洪殿; 赵仁明; 亓开元
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2016-09-26
Filing date: 2016-09-26
Publication date: 2017-05-10

Abstract

The invention particularly relates to a method for automatically cleaning and maintaining an ElasticSearch log index file. According to the method for automatically cleaning and maintaining the ElasticSearch log index file, the index file is stored separately according to the time dimension, a log index deleting strategy is made according to the service requirement and becomes a scheduling task, the log deleting task is scheduled by using a scheduling frame, when the historical data index is required to be deleted, the index which meets the strategy is deleted integrally according to the log index deleting strategy, and the problem of efficiency of deleting according to a DeleteByquery mode can be solved. The method for automatically cleaning and maintaining the ElasticSearch log index file can quickly and efficiently delete the index file, cannot cause performance influence on current index and query, and solves the problem that the ElasticSearch has low efficiency when a DeleteByquery mode is adopted to delete a large data volume index.

Description

A method for automatically cleaning and maintaining ElasticSearch log index files

技术领域technical field

本发明涉及大数据技术领域，特别涉及一种自动化清理维护ElasticSearch日志索引文件的方法。The invention relates to the technical field of big data, in particular to a method for automatically cleaning and maintaining ElasticSearch log index files.

背景技术Background technique

在信息技术中，大数据（Big data）是指无法在一定时间内，用常规的工具软件（如现有数据库管理工具或数据处理应用）对其内容进行抓取、管理、存储、搜索、共享、分析和可视化处理的由数量巨大、结构复杂、类型众多数据构成的大型复杂数据集合。大数据具有四大特点，即高容量（Volume）、快速性（Velocity）、多样性（Variety）和价值密度低（Value）。大数据带来的挑战在于它的实时处理，而数据本身也从结构性数据转向了非结构性数据，因此使用关系数据库对大数据进行处理是非常困难的。In information technology, big data refers to the inability to capture, manage, store, search, and share its content within a certain period of time with conventional software tools (such as existing database management tools or data processing applications). Large and complex data collections consisting of huge quantities, complex structures, and many types of data that are processed, analyzed, and visualized. Big data has four characteristics, namely high volume (Volume), rapidity (Velocity), diversity (Variety) and low value density (Value). The challenge brought by big data lies in its real-time processing, and the data itself has changed from structured data to unstructured data, so it is very difficult to use relational databases to process big data.

大数据通常用来形容一个公司创造的大量非结构化数据和半结构化数据，这些数据在下载到关系型数据库用于分析时会花费过多时间和金钱。大数据分析常和云计算联系到一起，因为实时的大型数据集分析需要像MapReduce、HBase一样的框架来向数十、数百或甚至数千的电脑分配工作。大数据分析相比于传统的数据仓库应用，具有数据量大、查询分析复杂等特点。大数据需要特殊的技术，以有效地处理大量的容忍经过时间内的数据。适用于大数据的技术，包括大规模并行处理（MPP）数据库、数据挖掘电网、分布式文件系统、分布式数据库、云计算平台、互联网和可扩展的存储系统。Big data is often used to describe the large volumes of unstructured and semi-structured data that a company creates that would take too much time and money to download to a relational database for analysis. Big data analysis is often associated with cloud computing, because real-time analysis of large data sets requires frameworks like MapReduce and HBase to distribute work to tens, hundreds, or even thousands of computers. Compared with traditional data warehouse applications, big data analysis has the characteristics of large data volume and complex query and analysis. Big data requires special techniques to efficiently handle large volumes of data that tolerate elapsed time. Technologies applicable to big data, including massively parallel processing (MPP) databases, data mining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.

ElasticSearch是一个基于Lucene的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎，基于RESTful web接口。Elasticsearch是用Java开发的，便于与企业应用进行集成，是当前流行的企业搜索引擎，能够满足实时搜索，稳定，可靠，快速等要求。ElasticSearch is a Lucene-based search server. It provides a distributed multi-user capable full-text search engine based on a RESTful web interface. Elasticsearch is developed in Java and is easy to integrate with enterprise applications. It is a popular enterprise search engine and can meet the requirements of real-time search, stability, reliability, and speed.

但是，由于Elasticsearch底层实现的原因，当索引文件过大，需要大量删除索引时，需要很多索引文件的底层操作，造成了这一过程需要耗时比较长，往往对应用造成很大的影响。However, due to the underlying implementation of Elasticsearch, when the index file is too large and a large number of indexes need to be deleted, many underlying operations on the index file are required. This process takes a long time and often has a great impact on the application.

在当前的IT运维领域，基于ELK（ElasticSearch+Logstash+Kibana）平台的日志分析和监控工具被越来越多的运维人员使用。由于该系统的特殊性与所监控的系统的规模，往往会有大量的日志文件产生，并对其时效性要求较高。因此在数据量比较大并且增量数据也很多的情况下，索引文件就会很大，就会给索引与查询带来性能上的影响并对存储空间造成了一定的压力。在查询日志的过程中，一般只关注近期的数据，历史数据可以删除，因此如何自动化快速的删除历史索引数据成为这一架构实现的关键。基于上述情况，本发明提出了一种自动化清理维护ElasticSearch日志索引文件的方法。In the current IT operation and maintenance field, log analysis and monitoring tools based on the ELK (ElasticSearch+Logstash+Kibana) platform are used by more and more operation and maintenance personnel. Due to the particularity of the system and the scale of the monitored system, a large number of log files are often generated, and the timeliness requirements are high. Therefore, when the amount of data is relatively large and there are many incremental data, the index file will be large, which will affect the performance of indexing and query and put a certain pressure on the storage space. In the process of querying logs, we generally only focus on recent data, and historical data can be deleted. Therefore, how to automatically and quickly delete historical index data becomes the key to the implementation of this architecture. Based on the above situation, the present invention proposes a method for automatically cleaning and maintaining ElasticSearch log index files.

发明内容Contents of the invention

本发明为了弥补现有技术的缺陷，提供了一种简单高效的自动化清理维护ElasticSearch日志索引文件的方法。In order to make up for the defects of the prior art, the present invention provides a simple and efficient method for automatically cleaning and maintaining ElasticSearch log index files.

本发明是通过如下技术方案实现的：The present invention is achieved through the following technical solutions:

一种自动化清理维护ElasticSearch日志索引文件的方法，其特征在于：将索引文件按照时间维度来分开存储，根据业务需要制定日志索引删除策略，并使之成为一个调度任务，利用调度框架调度日志删除任务，当需要删除历史数据索引时，只需根据日志索引删除策略整体删除符合策略的索引即可，能够解决按DeleteByquery方式删除的效率问题。A method for automatically cleaning and maintaining ElasticSearch log index files, characterized in that: the index files are stored separately according to the time dimension, the log index deletion strategy is formulated according to business needs, and it becomes a scheduling task, and the log deletion task is scheduled using the scheduling framework , when you need to delete the historical data index, you only need to delete the index that meets the policy as a whole according to the log index deletion strategy, which can solve the efficiency problem of deleting by DeleteByquery.

所述索引删除策略根据业务需要来制定日志索引删除策略，确定保留索引的最长有效时间或者保留索引的最大存储空间。The index deletion strategy formulates the log index deletion strategy according to business needs, and determines the longest valid time of the reserved index or the maximum storage space of the reserved index.

本发明自动化清理维护ElasticSearch日志索引文件的方法，包括以下步骤：The method for automatically cleaning and maintaining the ElasticSearch log index file of the present invention comprises the following steps:

（1）创建日志索引删除策略，并根据日志索引删除策略创建调度任务；(1) Create a log index deletion policy, and create a scheduling task according to the log index deletion policy;

（2）启动调度任务，根据日志索引删除策略，执行相应的后台任务进行日志清理的工作；(2) Start the scheduling task, and execute the corresponding background task to clean up the log according to the log index deletion strategy;

（3）判断是否按照时间策略调度任务，若按照时间策略调度任务，则遍历索引，删除符合时间策略的索引；若不按照照时间策略调度任务，则根据存储空间要求删除索引；删除索引后返回步骤（2）。(3) Determine whether the task is scheduled according to the time policy. If the task is scheduled according to the time policy, then traverse the index and delete the index that meets the time policy; if the task is not scheduled according to the time policy, delete the index according to the storage space requirements; return after deleting the index Step (2).

本发明的有益效果是：该自动化清理维护ElasticSearch日志索引文件的方法，能够快速高效的删除索引文件，不会对当前的索引和查询造成性能上的影响，解决了Elasticsearch在采用DeleteByquery方式删除大数据量索引时效率低下的问题。The beneficial effects of the present invention are: the method for automatically cleaning and maintaining the ElasticSearch log index file can quickly and efficiently delete the index file without affecting the performance of the current index and query, and solves the problem that Elasticsearch uses DeleteByquery to delete large data The problem of inefficiency in volume indexing.

附图说明Description of drawings

附图1为本发明自动化清理维护ElasticSearch日志索引文件的方法示意图。Accompanying drawing 1 is a schematic diagram of the method for automatically cleaning and maintaining ElasticSearch log index files according to the present invention.

具体实施方式detailed description

为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚明白，以下结合附图和实施例，对本发明进行详细的说明。应当说明的是，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and embodiments. It should be noted that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

该自动化清理维护ElasticSearch日志索引文件的方法，将索引文件按照时间维度来分开存储，根据业务需要制定日志索引删除策略，并使之成为一个调度任务，利用调度框架如Quartz等调度日志删除任务，当需要删除历史数据索引时，只需根据日志索引删除策略整体删除符合策略的索引即可，解决了按DeleteByquery方式删除的效率问题。The method for automatically cleaning and maintaining ElasticSearch log index files stores index files separately according to the time dimension, formulates log index deletion strategies according to business needs, and makes it a scheduling task, and uses scheduling frameworks such as Quartz to schedule log deletion tasks. When you need to delete the historical data index, you only need to delete the index that meets the policy as a whole according to the log index deletion strategy, which solves the efficiency problem of deleting by DeleteByquery.

Claims

1. A method for automatically cleaning and maintaining ElasticSearch log index files, characterized in that the index files are stored separately according to the time dimension, the log index deletion strategy is formulated according to business needs, and it becomes a scheduling task, and the scheduling framework is used to schedule logs For deletion tasks, when you need to delete historical data indexes, you only need to delete the indexes that meet the policy as a whole according to the log index deletion strategy, which can solve the efficiency problem of deleting by DeleteByquery.

2. The method for automatically cleaning and maintaining ElasticSearch log index files according to claim 1, characterized in that: the index deletion strategy formulates the log index deletion strategy according to business needs, and determines the longest valid time for retaining the index or the duration of the retaining index. Maximum storage space.

3. according to claim 1 and the method for automatic cleaning maintenance ElasticSearch log index file, it is characterized in that, comprising the following steps:

(1) Create a log index deletion policy, and create a scheduling task according to the log index deletion policy;

(2) Start the scheduling task, and execute the corresponding background task to clean up the log according to the log index deletion strategy;

(3) Determine whether the task is scheduled according to the time policy. If the task is scheduled according to the time policy, then traverse the index and delete the index that meets the time policy; if the task is not scheduled according to the time policy, delete the index according to the storage space requirements; return after deleting the index Step (2).