CN110737667A - indexing method based on Spark - Google Patents
indexing method based on Spark Download PDFInfo
- Publication number
- CN110737667A CN110737667A CN201911026342.XA CN201911026342A CN110737667A CN 110737667 A CN110737667 A CN 110737667A CN 201911026342 A CN201911026342 A CN 201911026342A CN 110737667 A CN110737667 A CN 110737667A
- Authority
- CN
- China
- Prior art keywords
- spark
- index
- data
- interface
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000008569 process Effects 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000012217 deletion Methods 0.000 claims description 2
- 230000037430 deletion Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 abstract description 8
- 238000007619 statistical method Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an indexing method based on Spark, which comprises the following steps of S1, consuming real-time data through a custom consumption interface in a Spark process, and establishing an index through an index interface, S2, modifying a native index interface of lucene, S3, combining a whole indexing program, a query interface and Spark service, S4, carrying out data processing and then querying data, wherein layers of index functions are mainly added to the original Spark retrieval, so that the retrieval performance is accelerated, the function uses native lucene in the index layer, the Spark or an off-line program can index the data onto HDFS through the custom interface, the subsequent statistical analysis can use the Spark index query of Spark to return quickly, and Spark SQL is selected in the use of the query language, the invention is perfectly combined with a Spark SQL engine in , the language based on structured data query simplifies the query difficulty and reduces the learning cost.
Description
Technical Field
The invention relates to the technical field of big data query processing, in particular to Spark-based indexing methods.
Background
In recent years, with the continuous integration of technologies such as internet of things, social networks, cloud computing and the like into our lives and the rapid development of the existing computing power, storage space and network bandwidth, data accumulated by human beings are continuously increased and accumulated in various fields such as internet, communication, finance, commerce, medical treatment and the like.
, for example, the original Hadoop can store TB-level data and analyze TB-level data, and then Hive, Hbase, Pig, etc. built around the Hadoop ecosphere later, make the selection and processing technology simple and diversified.
In recent years, big data technology has great breakthrough, and many excellent projects are shown, but there are still many difficulties in processing and query analysis which need to be breakthrough, now the summary has the following disadvantages:
1. data query is poor in reality, for example, hive is needed to be carried out on a hive bottom layer, and as is known, the mapreduce of hadoop is too frequent on IO interaction, and efficiency is poor;
2. to guarantee the timeliness of the query, compromises have to be made over the data storage formats. The Hbase, Cassandra, MongoDB and the like which are popular at present are unstructured;
3. for example, using ElasticSearch, Solr, etc., complete API sets need to be learned again, which is high in cost;
4. storage form , no better compression ratio;
5. based on the original index component development, optimization is lacked, so that the index functions such as lucene and the like do not play a better role;
6. the original component analysis data can only be processed after being read, and an index function is lacked in data retrieval, so that even if the index function is provided, compromise can be made on retrieval usability.
Disclosure of Invention
The present invention aims to provide Spark-based indexing methods to solve the above-mentioned problems in the background art.
In order to achieve the above purpose, the invention provides indexing methods based on Spark, comprising the following steps,
s1: consuming real-time data through a custom consumption interface in a Spark process, and establishing an index for the data through an index interface;
s2: modifying a native index interface of the lucene;
s3: combining the whole indexing program, the query interface and the Spark service;
s4: and processing the data, and then inquiring the data.
Preferably, in step S1, after the index is established, the merging, deleting, and updating of the index are maintained in the spare process.
Preferably, in step S2, after modifying the native index interface, the data is indexed on the HDFS to satisfy the storage query of TB-level data.
Preferably, in step S3, after the indexer, the query interface and the Spark are combined, the data is queried in the Spark using sql statement.
Compared with the prior art, the invention has the beneficial effects that:
1. structured or unstructured data can be stored on demand;
2. heterogeneous storage of data is supported, namely the data can be selectively stored on an SATA disk or an SSD disk according to requirements, and the data query is accelerated;
3. an index layer is added during data storage, so that the index can accelerate the query efficiency and eliminate the delay problem of data query in T +1 days;
4. through SQL language query, the method is compatible with standard SQL, supports characteristics such as multidimensional statistics and self-defined functions, and greatly reduces learning cost;
5. by improving lucene, the data storage on the HDFS is supported, the storage requirement of large-scale data is met, and the storage of trillion data is supported;
6. the index is made through the lucene, so that the query of data is accelerated, and the billions of data second-level query can be realized.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of the storage of the Spark maintenance HDFS data index;
FIG. 3 is a diagram illustrating that Sql is used by Spark to retrieve index and return data.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.
Referring to fig. 1-3, the present invention provides technical solutions, namely Spark-based indexing methods, including the following steps,
s1: consuming real-time data through a custom consumption interface in a Spark process, and establishing an index for the data through an index interface;
s2: modifying a native index interface of the lucene;
s3: combining the whole indexing program, the query interface and the Spark service;
s4: and processing the data, and then inquiring the data.
, in step S1, the index is merged, deleted and updated in the Spark process after the index is established.
, in step S2, after modifying the native index interface, the data is indexed to HDFS to satisfy the storage query of TB level data.
Step , in the step S3, after the indexer, the query interface and the Spark are combined, the data is queried in the Spark using sql statements.
The working principle is as follows: consuming real-time data through a custom consumption interface in a Spark process, establishing an index through an index interface for the data, and maintaining the combination, deletion and updating of the index in Spark to ensure the effectiveness of the index; modifying a native index interface of the lucene, and indexing the data onto the HDFS, so that the storage query of the TB level data is met; the whole index program, the query interface and the spark service are combined, so that data can be queried in a spark SQL mode, and the usability is enhanced; the timeliness of the data which can be retrieved after combination is greatly improved, and the data can be inquired and counted within 1-2 minutes after the data is processed.
The invention has the overall idea that layer index function is added to the original Spark retrieval, thereby accelerating the retrieval performance, the function uses the original lucene in the index layer, Spark or off-line program can index the data on the HDFS through the self-defined interface, the subsequent statistical analysis can use Spark index query to return quickly, Spark SQL is selected in the use of query language, the invention is perfectly combined with Spark SQL engine in , the query language based on structured data simplifies the query difficulty and reduces the learning cost.
As shown in FIG. 2, in the process of maintaining hdfs data index storage for Spark, a distributed process program is maintained by Spark, offline data or real-time data is maintained by a custom interface on an analyzer of Saprk through an interface, and the Spark process maintains the index to realize import, export, merging and the like of the distributed control index.
As shown in fig. 3, in the process of using Sql to retrieve indexes and return data for Spark, the query interface of Spark Sql is used to distribute Sql and aggregate data, and the combination of the custom lucene query interface realizes efficient and rapid retrieval, and also ensures the usability of Spark.
Similar technologies in the industry such as elastic search, Solr and the like are indexing on the basis of lucene, and can also realize the retrieval of mass data search, but the scheme set forth by the invention has quite obvious advantages in performance.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (4)
1, A Spark-based indexing method, comprising the steps of,
s1: consuming real-time data through a custom consumption interface in a Spark process, and establishing an index for the data through an index interface;
s2: modifying a native index interface of the lucene;
s3: combining the whole indexing program, the query interface and the Spark service;
s4: and processing the data, and then inquiring the data.
2. The Spark-based indexing method according to claim 1, wherein in step S1, after the index is built, the merging, deletion and updating of the index are maintained in the Spark process.
3. The Spark-based indexing method as claimed in claim 1, wherein in step S2, after modifying the native index interface, the data is indexed on top of HDFS to satisfy the storage query of TB-level data.
4. The Spark-based indexing method as claimed in claim 1, wherein in step S3, after the indexer, the query interface and Spark are combined, sql statements are used to query the Spark for data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911026342.XA CN110737667A (en) | 2019-10-26 | 2019-10-26 | indexing method based on Spark |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911026342.XA CN110737667A (en) | 2019-10-26 | 2019-10-26 | indexing method based on Spark |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110737667A true CN110737667A (en) | 2020-01-31 |
Family
ID=69271592
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911026342.XA Pending CN110737667A (en) | 2019-10-26 | 2019-10-26 | indexing method based on Spark |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110737667A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107122443A (en) * | 2017-04-24 | 2017-09-01 | 中国科学院软件研究所 | A kind of distributed full-text search system and method based on Spark SQL |
CN108009270A (en) * | 2017-12-18 | 2018-05-08 | 江苏润和软件股份有限公司 | A kind of text searching method calculated based on distributed memory |
-
2019
- 2019-10-26 CN CN201911026342.XA patent/CN110737667A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107122443A (en) * | 2017-04-24 | 2017-09-01 | 中国科学院软件研究所 | A kind of distributed full-text search system and method based on Spark SQL |
CN108009270A (en) * | 2017-12-18 | 2018-05-08 | 江苏润和软件股份有限公司 | A kind of text searching method calculated based on distributed memory |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11093466B2 (en) | Incremental out-of-place updates for index structures | |
CN103246749B (en) | The matrix database system and its querying method that Based on Distributed calculates | |
CN107291807B (en) | SPARQL query optimization method based on graph traversal | |
CN102521406B (en) | Distributed query method and system for complex task of querying massive structured data | |
CN102521405B (en) | Massive structured data storage and query methods and systems supporting high-speed loading | |
CN103366015B (en) | A kind of OLAP data based on Hadoop stores and querying method | |
CN103631870B (en) | System and method used for large-scale distributed data processing | |
CN107040422A (en) | A kind of network big data method for visualizing cached based on materialization | |
CN103488704A (en) | Method and device for storing data | |
US9229961B2 (en) | Database management delete efficiency | |
CN105574054B (en) | A kind of distributed caching range query method, apparatus and system | |
CN104850640A (en) | HBase based storage and query method and system for power equipment status monitoring data | |
US20170068675A1 (en) | Method and system for adapting a database kernel using machine learning | |
US11294816B2 (en) | Evaluating SQL expressions on dictionary encoded vectors | |
CN103744913A (en) | Database retrieval method based on search engine technology | |
Sarlis et al. | Datix: A system for scalable network analytics | |
El Alami et al. | Supply of a key value database redis in-memory by data from a relational database | |
Patel et al. | Raw data processing framework for IoT | |
Brezany et al. | An elastic OLAP cloud platform | |
CN108319604B (en) | Optimization method for association of large and small tables in hive | |
Sawyer et al. | Understanding query performance in Accumulo | |
Colmenares et al. | A single-node datastore for high-velocity multidimensional sensor data | |
Mo et al. | Asynchronous index strategy for high performance real-time big data stream storage | |
CN110737667A (en) | indexing method based on Spark | |
CN110990368A (en) | Full-link data management system and management method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200131 |