CN110737667A - indexing method based on Spark - Google Patents

indexing method based on Spark Download PDF

Info

Publication number
CN110737667A
CN110737667A CN201911026342.XA CN201911026342A CN110737667A CN 110737667 A CN110737667 A CN 110737667A CN 201911026342 A CN201911026342 A CN 201911026342A CN 110737667 A CN110737667 A CN 110737667A
Authority
CN
China
Prior art keywords
spark
index
data
interface
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911026342.XA
Other languages
Chinese (zh)
Inventor
王帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Letter Recording Software Technology Co Ltd
Original Assignee
Nanjing Letter Recording Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Letter Recording Software Technology Co Ltd filed Critical Nanjing Letter Recording Software Technology Co Ltd
Priority to CN201911026342.XA priority Critical patent/CN110737667A/en
Publication of CN110737667A publication Critical patent/CN110737667A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an indexing method based on Spark, which comprises the following steps of S1, consuming real-time data through a custom consumption interface in a Spark process, and establishing an index through an index interface, S2, modifying a native index interface of lucene, S3, combining a whole indexing program, a query interface and Spark service, S4, carrying out data processing and then querying data, wherein layers of index functions are mainly added to the original Spark retrieval, so that the retrieval performance is accelerated, the function uses native lucene in the index layer, the Spark or an off-line program can index the data onto HDFS through the custom interface, the subsequent statistical analysis can use the Spark index query of Spark to return quickly, and Spark SQL is selected in the use of the query language, the invention is perfectly combined with a Spark SQL engine in , the language based on structured data query simplifies the query difficulty and reduces the learning cost.

Description

indexing method based on Spark
Technical Field
The invention relates to the technical field of big data query processing, in particular to Spark-based indexing methods.
Background
In recent years, with the continuous integration of technologies such as internet of things, social networks, cloud computing and the like into our lives and the rapid development of the existing computing power, storage space and network bandwidth, data accumulated by human beings are continuously increased and accumulated in various fields such as internet, communication, finance, commerce, medical treatment and the like.
, for example, the original Hadoop can store TB-level data and analyze TB-level data, and then Hive, Hbase, Pig, etc. built around the Hadoop ecosphere later, make the selection and processing technology simple and diversified.
In recent years, big data technology has great breakthrough, and many excellent projects are shown, but there are still many difficulties in processing and query analysis which need to be breakthrough, now the summary has the following disadvantages:
1. data query is poor in reality, for example, hive is needed to be carried out on a hive bottom layer, and as is known, the mapreduce of hadoop is too frequent on IO interaction, and efficiency is poor;
2. to guarantee the timeliness of the query, compromises have to be made over the data storage formats. The Hbase, Cassandra, MongoDB and the like which are popular at present are unstructured;
3. for example, using ElasticSearch, Solr, etc., complete API sets need to be learned again, which is high in cost;
4. storage form , no better compression ratio;
5. based on the original index component development, optimization is lacked, so that the index functions such as lucene and the like do not play a better role;
6. the original component analysis data can only be processed after being read, and an index function is lacked in data retrieval, so that even if the index function is provided, compromise can be made on retrieval usability.
Disclosure of Invention
The present invention aims to provide Spark-based indexing methods to solve the above-mentioned problems in the background art.
In order to achieve the above purpose, the invention provides indexing methods based on Spark, comprising the following steps,
s1: consuming real-time data through a custom consumption interface in a Spark process, and establishing an index for the data through an index interface;
s2: modifying a native index interface of the lucene;
s3: combining the whole indexing program, the query interface and the Spark service;
s4: and processing the data, and then inquiring the data.
Preferably, in step S1, after the index is established, the merging, deleting, and updating of the index are maintained in the spare process.
Preferably, in step S2, after modifying the native index interface, the data is indexed on the HDFS to satisfy the storage query of TB-level data.
Preferably, in step S3, after the indexer, the query interface and the Spark are combined, the data is queried in the Spark using sql statement.
Compared with the prior art, the invention has the beneficial effects that:
1. structured or unstructured data can be stored on demand;
2. heterogeneous storage of data is supported, namely the data can be selectively stored on an SATA disk or an SSD disk according to requirements, and the data query is accelerated;
3. an index layer is added during data storage, so that the index can accelerate the query efficiency and eliminate the delay problem of data query in T +1 days;
4. through SQL language query, the method is compatible with standard SQL, supports characteristics such as multidimensional statistics and self-defined functions, and greatly reduces learning cost;
5. by improving lucene, the data storage on the HDFS is supported, the storage requirement of large-scale data is met, and the storage of trillion data is supported;
6. the index is made through the lucene, so that the query of data is accelerated, and the billions of data second-level query can be realized.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of the storage of the Spark maintenance HDFS data index;
FIG. 3 is a diagram illustrating that Sql is used by Spark to retrieve index and return data.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.
Referring to fig. 1-3, the present invention provides technical solutions, namely Spark-based indexing methods, including the following steps,
s1: consuming real-time data through a custom consumption interface in a Spark process, and establishing an index for the data through an index interface;
s2: modifying a native index interface of the lucene;
s3: combining the whole indexing program, the query interface and the Spark service;
s4: and processing the data, and then inquiring the data.
, in step S1, the index is merged, deleted and updated in the Spark process after the index is established.
, in step S2, after modifying the native index interface, the data is indexed to HDFS to satisfy the storage query of TB level data.
Step , in the step S3, after the indexer, the query interface and the Spark are combined, the data is queried in the Spark using sql statements.
The working principle is as follows: consuming real-time data through a custom consumption interface in a Spark process, establishing an index through an index interface for the data, and maintaining the combination, deletion and updating of the index in Spark to ensure the effectiveness of the index; modifying a native index interface of the lucene, and indexing the data onto the HDFS, so that the storage query of the TB level data is met; the whole index program, the query interface and the spark service are combined, so that data can be queried in a spark SQL mode, and the usability is enhanced; the timeliness of the data which can be retrieved after combination is greatly improved, and the data can be inquired and counted within 1-2 minutes after the data is processed.
The invention has the overall idea that layer index function is added to the original Spark retrieval, thereby accelerating the retrieval performance, the function uses the original lucene in the index layer, Spark or off-line program can index the data on the HDFS through the self-defined interface, the subsequent statistical analysis can use Spark index query to return quickly, Spark SQL is selected in the use of query language, the invention is perfectly combined with Spark SQL engine in , the query language based on structured data simplifies the query difficulty and reduces the learning cost.
As shown in FIG. 2, in the process of maintaining hdfs data index storage for Spark, a distributed process program is maintained by Spark, offline data or real-time data is maintained by a custom interface on an analyzer of Saprk through an interface, and the Spark process maintains the index to realize import, export, merging and the like of the distributed control index.
As shown in fig. 3, in the process of using Sql to retrieve indexes and return data for Spark, the query interface of Spark Sql is used to distribute Sql and aggregate data, and the combination of the custom lucene query interface realizes efficient and rapid retrieval, and also ensures the usability of Spark.
Similar technologies in the industry such as elastic search, Solr and the like are indexing on the basis of lucene, and can also realize the retrieval of mass data search, but the scheme set forth by the invention has quite obvious advantages in performance.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (4)

1, A Spark-based indexing method, comprising the steps of,
s1: consuming real-time data through a custom consumption interface in a Spark process, and establishing an index for the data through an index interface;
s2: modifying a native index interface of the lucene;
s3: combining the whole indexing program, the query interface and the Spark service;
s4: and processing the data, and then inquiring the data.
2. The Spark-based indexing method according to claim 1, wherein in step S1, after the index is built, the merging, deletion and updating of the index are maintained in the Spark process.
3. The Spark-based indexing method as claimed in claim 1, wherein in step S2, after modifying the native index interface, the data is indexed on top of HDFS to satisfy the storage query of TB-level data.
4. The Spark-based indexing method as claimed in claim 1, wherein in step S3, after the indexer, the query interface and Spark are combined, sql statements are used to query the Spark for data.
CN201911026342.XA 2019-10-26 2019-10-26 indexing method based on Spark Pending CN110737667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911026342.XA CN110737667A (en) 2019-10-26 2019-10-26 indexing method based on Spark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911026342.XA CN110737667A (en) 2019-10-26 2019-10-26 indexing method based on Spark

Publications (1)

Publication Number Publication Date
CN110737667A true CN110737667A (en) 2020-01-31

Family

ID=69271592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911026342.XA Pending CN110737667A (en) 2019-10-26 2019-10-26 indexing method based on Spark

Country Status (1)

Country Link
CN (1) CN110737667A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122443A (en) * 2017-04-24 2017-09-01 中国科学院软件研究所 A kind of distributed full-text search system and method based on Spark SQL
CN108009270A (en) * 2017-12-18 2018-05-08 江苏润和软件股份有限公司 A kind of text searching method calculated based on distributed memory

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122443A (en) * 2017-04-24 2017-09-01 中国科学院软件研究所 A kind of distributed full-text search system and method based on Spark SQL
CN108009270A (en) * 2017-12-18 2018-05-08 江苏润和软件股份有限公司 A kind of text searching method calculated based on distributed memory

Similar Documents

Publication Publication Date Title
US11093466B2 (en) Incremental out-of-place updates for index structures
CN103246749B (en) The matrix database system and its querying method that Based on Distributed calculates
CN107291807B (en) SPARQL query optimization method based on graph traversal
CN102521406B (en) Distributed query method and system for complex task of querying massive structured data
CN102521405B (en) Massive structured data storage and query methods and systems supporting high-speed loading
CN103366015B (en) A kind of OLAP data based on Hadoop stores and querying method
CN103631870B (en) System and method used for large-scale distributed data processing
CN107040422A (en) A kind of network big data method for visualizing cached based on materialization
CN103488704A (en) Method and device for storing data
US9229961B2 (en) Database management delete efficiency
CN105574054B (en) A kind of distributed caching range query method, apparatus and system
CN104850640A (en) HBase based storage and query method and system for power equipment status monitoring data
US20170068675A1 (en) Method and system for adapting a database kernel using machine learning
US11294816B2 (en) Evaluating SQL expressions on dictionary encoded vectors
CN103744913A (en) Database retrieval method based on search engine technology
Sarlis et al. Datix: A system for scalable network analytics
El Alami et al. Supply of a key value database redis in-memory by data from a relational database
Patel et al. Raw data processing framework for IoT
Brezany et al. An elastic OLAP cloud platform
CN108319604B (en) Optimization method for association of large and small tables in hive
Sawyer et al. Understanding query performance in Accumulo
Colmenares et al. A single-node datastore for high-velocity multidimensional sensor data
Mo et al. Asynchronous index strategy for high performance real-time big data stream storage
CN110737667A (en) indexing method based on Spark
CN110990368A (en) Full-link data management system and management method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200131