CN110737667A

CN110737667A - indexing method based on Spark

Info

Publication number: CN110737667A
Application number: CN201911026342.XA
Authority: CN
Inventors: 王帅
Original assignee: Nanjing Letter Recording Software Technology Co Ltd
Current assignee: Nanjing Letter Recording Software Technology Co Ltd
Priority date: 2019-10-26
Filing date: 2019-10-26
Publication date: 2020-01-31

Abstract

The invention discloses an indexing method based on Spark, which comprises the following steps of S1, consuming real-time data through a custom consumption interface in a Spark process, and establishing an index through an index interface, S2, modifying a native index interface of lucene, S3, combining a whole indexing program, a query interface and Spark service, S4, carrying out data processing and then querying data, wherein layers of index functions are mainly added to the original Spark retrieval, so that the retrieval performance is accelerated, the function uses native lucene in the index layer, the Spark or an off-line program can index the data onto HDFS through the custom interface, the subsequent statistical analysis can use the Spark index query of Spark to return quickly, and Spark SQL is selected in the use of the query language, the invention is perfectly combined with a Spark SQL engine in , the language based on structured data query simplifies the query difficulty and reduces the learning cost.

Description

indexing method based on Spark

Technical Field

The invention relates to the technical field of big data query processing, in particular to Spark-based indexing methods.

Background

In recent years, with the continuous integration of technologies such as internet of things, social networks, cloud computing and the like into our lives and the rapid development of the existing computing power, storage space and network bandwidth, data accumulated by human beings are continuously increased and accumulated in various fields such as internet, communication, finance, commerce, medical treatment and the like.

, for example, the original Hadoop can store TB-level data and analyze TB-level data, and then Hive, Hbase, Pig, etc. built around the Hadoop ecosphere later, make the selection and processing technology simple and diversified.

In recent years, big data technology has great breakthrough, and many excellent projects are shown, but there are still many difficulties in processing and query analysis which need to be breakthrough, now the summary has the following disadvantages:

1. data query is poor in reality, for example, hive is needed to be carried out on a hive bottom layer, and as is known, the mapreduce of hadoop is too frequent on IO interaction, and efficiency is poor;

2. to guarantee the timeliness of the query, compromises have to be made over the data storage formats. The Hbase, Cassandra, MongoDB and the like which are popular at present are unstructured;

3. for example, using ElasticSearch, Solr, etc., complete API sets need to be learned again, which is high in cost;

4. storage form , no better compression ratio;

5. based on the original index component development, optimization is lacked, so that the index functions such as lucene and the like do not play a better role;

6. the original component analysis data can only be processed after being read, and an index function is lacked in data retrieval, so that even if the index function is provided, compromise can be made on retrieval usability.

Disclosure of Invention

The present invention aims to provide Spark-based indexing methods to solve the above-mentioned problems in the background art.

In order to achieve the above purpose, the invention provides indexing methods based on Spark, comprising the following steps,

s1: consuming real-time data through a custom consumption interface in a Spark process, and establishing an index for the data through an index interface;

s2: modifying a native index interface of the lucene;

s3: combining the whole indexing program, the query interface and the Spark service;

s4: and processing the data, and then inquiring the data.

Preferably, in step S1, after the index is established, the merging, deleting, and updating of the index are maintained in the spare process.

Preferably, in step S2, after modifying the native index interface, the data is indexed on the HDFS to satisfy the storage query of TB-level data.

Preferably, in step S3, after the indexer, the query interface and the Spark are combined, the data is queried in the Spark using sql statement.

Compared with the prior art, the invention has the beneficial effects that:

1. structured or unstructured data can be stored on demand;

2. heterogeneous storage of data is supported, namely the data can be selectively stored on an SATA disk or an SSD disk according to requirements, and the data query is accelerated;

3. an index layer is added during data storage, so that the index can accelerate the query efficiency and eliminate the delay problem of data query in T +1 days;

4. through SQL language query, the method is compatible with standard SQL, supports characteristics such as multidimensional statistics and self-defined functions, and greatly reduces learning cost;

5. by improving lucene, the data storage on the HDFS is supported, the storage requirement of large-scale data is met, and the storage of trillion data is supported;

6. the index is made through the lucene, so that the query of data is accelerated, and the billions of data second-level query can be realized.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of the storage of the Spark maintenance HDFS data index;

FIG. 3 is a diagram illustrating that Sql is used by Spark to retrieve index and return data.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.

Referring to fig. 1-3, the present invention provides technical solutions, namely Spark-based indexing methods, including the following steps,

s2: modifying a native index interface of the lucene;

s4: and processing the data, and then inquiring the data.

, in step S1, the index is merged, deleted and updated in the Spark process after the index is established.

, in step S2, after modifying the native index interface, the data is indexed to HDFS to satisfy the storage query of TB level data.

Step , in the step S3, after the indexer, the query interface and the Spark are combined, the data is queried in the Spark using sql statements.

The working principle is as follows: consuming real-time data through a custom consumption interface in a Spark process, establishing an index through an index interface for the data, and maintaining the combination, deletion and updating of the index in Spark to ensure the effectiveness of the index; modifying a native index interface of the lucene, and indexing the data onto the HDFS, so that the storage query of the TB level data is met; the whole index program, the query interface and the spark service are combined, so that data can be queried in a spark SQL mode, and the usability is enhanced; the timeliness of the data which can be retrieved after combination is greatly improved, and the data can be inquired and counted within 1-2 minutes after the data is processed.

The invention has the overall idea that layer index function is added to the original Spark retrieval, thereby accelerating the retrieval performance, the function uses the original lucene in the index layer, Spark or off-line program can index the data on the HDFS through the self-defined interface, the subsequent statistical analysis can use Spark index query to return quickly, Spark SQL is selected in the use of query language, the invention is perfectly combined with Spark SQL engine in , the query language based on structured data simplifies the query difficulty and reduces the learning cost.

As shown in FIG. 2, in the process of maintaining hdfs data index storage for Spark, a distributed process program is maintained by Spark, offline data or real-time data is maintained by a custom interface on an analyzer of Saprk through an interface, and the Spark process maintains the index to realize import, export, merging and the like of the distributed control index.

As shown in fig. 3, in the process of using Sql to retrieve indexes and return data for Spark, the query interface of Spark Sql is used to distribute Sql and aggregate data, and the combination of the custom lucene query interface realizes efficient and rapid retrieval, and also ensures the usability of Spark.

Similar technologies in the industry such as elastic search, Solr and the like are indexing on the basis of lucene, and can also realize the retrieval of mass data search, but the scheme set forth by the invention has quite obvious advantages in performance.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1, A Spark-based indexing method, comprising the steps of,

s2: modifying a native index interface of the lucene;

s4: and processing the data, and then inquiring the data.

2. The Spark-based indexing method according to claim 1, wherein in step S1, after the index is built, the merging, deletion and updating of the index are maintained in the Spark process.

3. The Spark-based indexing method as claimed in claim 1, wherein in step S2, after modifying the native index interface, the data is indexed on top of HDFS to satisfy the storage query of TB-level data.

4. The Spark-based indexing method as claimed in claim 1, wherein in step S3, after the indexer, the query interface and Spark are combined, sql statements are used to query the Spark for data.