CN111680043A - Method for rapidly searching mass data - Google Patents

Method for rapidly searching mass data Download PDF

Info

Publication number
CN111680043A
CN111680043A CN202010505012.5A CN202010505012A CN111680043A CN 111680043 A CN111680043 A CN 111680043A CN 202010505012 A CN202010505012 A CN 202010505012A CN 111680043 A CN111680043 A CN 111680043A
Authority
CN
China
Prior art keywords
data
index
hbase
mass data
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010505012.5A
Other languages
Chinese (zh)
Other versions
CN111680043B (en
Inventor
徐晓贝
陈胡
陈宽
陶伟洋
叶兆裕
王远友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing LES Information Technology Co. Ltd
Original Assignee
Nanjing LES Information Technology Co. Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing LES Information Technology Co. Ltd filed Critical Nanjing LES Information Technology Co. Ltd
Priority to CN202010505012.5A priority Critical patent/CN111680043B/en
Publication of CN111680043A publication Critical patent/CN111680043A/en
Application granted granted Critical
Publication of CN111680043B publication Critical patent/CN111680043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation

Abstract

The invention discloses a method for quickly searching mass data, which comprises the following steps: constructing a mass data storage system; establishing a secondary index for the data in the mass data storage system; starting a data retrieval service and monitoring an Http request; analyzing the received Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result; and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data. The method can quickly retrieve mass data according to multiple conditions and return the query result in a very short time range, thereby solving the defects of the prior technical scheme at the minimum cost.

Description

Method for rapidly searching mass data
Technical Field
The invention belongs to the technical field of big data quick retrieval, and particularly relates to a quick retrieval method for mass data.
Background
With the development of society and technology, a huge amount of data is generated in different fields every day, and the storage and the use of the data become a very challenging technical problem. For example, in the transportation industry, a county city with 300 ten thousand people may have 1000 thousand cars passing by the vehicle data generated by the video detector. The common transaction type information management system stores the data through the relational data, the retrieval of the data can be normally carried out in the first year, and the query method and the storage design can be found to be optimized to the utmost extent when the data amount is accumulated for more than two years or even longer, but the data to be searched can not be queried in a short time. How to more effectively store mass data and realize quick retrieval through a certain technology becomes an urgent problem to be solved.
At present, most of the solutions in the system under construction are to increase storage nodes of a relational database and build a great number of indexes to realize quick retrieval, but the maintenance cost of the indexes is very high, and once data changes, the indexes are rebuilt in a large batch, because the indexes and the data are operated in the same database instance, the rebuilding of the indexes directly affects the performance of the database, and the query operation being executed is affected.
At present, there are two schemes for implementing fast query based on big data technology, as follows:
1. the data is stored in a distributed columnar database HBase, in order to realize rapid query according to conditions, the RowKey of the HBase needs to be designed according to query conditions, all query conditions are contained in the RowKey, and the function of global unique index is realized through the RowKey. However, there is a significant drawback that once the query condition changes, the RowKey needs to be redesigned, the original main data cannot be used, and data needs to be generated again according to a new RowKey, which results in that the same service data needs to be stored in multiple copies according to different rowkeys, and a huge waste of storage space is caused. This is almost a fatal problem.
2. The data is also required to be stored in the distributed column-type database HBase, and a secondary index is required to be designed according to query conditions, the primary data is stored in the table, and the secondary index and the primary data are simultaneously stored in a storage area, so that when the data is queried, the secondary index is firstly positioned, and then the primary data is directly positioned in the same area according to the secondary index. This has the advantage that the index and the main data are in the same storage area, saving time for retrieving the main data again across nodes. The design of the secondary indexes solves the problem that storage space is wasted after the query conditions are changed in step 1, but a new problem is generated, the RowKey matching principle in HBase is that the RowKey is matched from front to back according to Rowkey ASCII, therefore, if a plurality of query conditions exist, in order to adapt to various combined queries, the number of the secondary indexes is very large, when the conditions reach 7 to 8, the number of the indexes is too large, and the occupied amount of the storage space of the indexes is more than that of main data.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a method for fast searching mass data; the method can quickly retrieve mass data according to multiple conditions and return the query result in a very short time range, thereby solving the defects of the prior technical scheme at the minimum cost.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention relates to a method for quickly searching mass data, which comprises the following steps:
1) constructing a mass data storage system;
2) establishing a secondary index for the data in the mass data storage system;
3) starting a data retrieval service and monitoring an Http request;
4) analyzing a received Http request sent by a Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result (ROWKEY set);
5) and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data.
Further, the mass data storage system in step 1) is Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, and stores the service data to be retrieved into HBase according to respective service design, and designs a proper RowKey to be used as a unique identifier of the record; the data are uniformly distributed to a plurality of region servers of the HBase, so that the performance of concurrent processing is improved, and the condition of local overheating is avoided.
Further, the step 2) specifically includes:
21) adopting an ElasticSearch as a carrier of a secondary index of a mass data storage system, and designing an index field according to respective query conditions by index design; the index field comprises all query fields, and adds a RowKey field which represents the value of the corresponding unique primary key after all conditions are matched.
22) Extracting data in all HBase, and inserting a value corresponding to an index of an ElasticSearch in the data into the index, wherein the process is index data;
23) when creating an ElasticSearch index, different index types are set according to whether each field query is fuzzy matching or full word matching, an IK word splitter is used as the condition of the fuzzy query, and the field type of the full word matching is set as keyword.
Further, the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.
Further, the step 4) specifically includes:
the back-end service interface analyzes the request content, creates an ElasticSearch Client API instance and specifies the used index, requests the ElasticSearch index service and returns the result set of the RowKey query.
Further, the step 5) specifically includes:
creating an HBase Client API example, bringing a RowKey set into the HBase Client API, accessing HBase service, and packaging a returned result set into a result list;
the result list is returned to the client along the call stack.
The invention has the beneficial effects that:
1. any change of the secondary index is irrelevant to the main data, and any storage of the main data cannot be influenced.
2. The second-level index only stores the query condition and the RowKey of the main data, and occupies a very small storage space.
3. When the query condition changes, only a new index of the ElasticSearch needs to be established.
4. Due to some reasons of the distributed storage design of the HBase, accurate paging cannot be realized in data query of the HBase; however, the elastic search realizes an accurate paging function and makes up for the defect that the HBase data cannot be paged when being directly inquired.
5. The data retrieval realized by the technology can not cause great fluctuation of query efficiency due to the change of the data size; the query time for 10 hundred million data volumes and 100 hundred million data volumes is almost the same.
Drawings
FIG. 1 shows a schematic diagram of the method of the present invention.
Detailed Description
In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.
Referring to fig. 1, a method for fast retrieving massive data according to the present invention includes the following steps:
1) constructing a mass data storage system;
the mass data storage system is Apache HBase which is a distributed and telescopic mass data storage system constructed based on Hadoop, the service data to be retrieved is stored in HBase according to respective service design, and a proper RowKey is designed to be used as a unique recorded identifier; the data are uniformly distributed to a plurality of region servers of the HBase, so that the performance of concurrent processing is improved, and the condition of local overheating is avoided.
2) Establishing a secondary index for the data in the mass data storage system;
21) configuring settings:
211) setting the number of partitions, namely the value of the number _ of _ shares parameter, and setting the value as the number of cluster nodes;
212) setting the number of copies, namely the value of a number _ of _ copies parameter, which represents the redundant backup number of data, and setting the number of the redundant backups to 0, namely not backing up;
213) setting a data compression mode, namely setting the value of the codec parameter as best _ compression; data can be more effectively compressed, so that the occupation of a disk is remarkably reduced;
22) configuring custom analyzers (one analyzer has at least one token, which may be zero or more token filters):
221) setting a self-defined token, wherein the type is edge _ ngram, the parameter min _ gram is set to be 1, the parameter max _ gram is set to be 10, and the token _ char is set to be letter and digit, so that the token can be split when meeting characters and numbers, and can be better suitable for index fields in requirements;
222) setting a custom index analyzer of a field PLATE _ NO, selecting the tokenizer customized in the previous step as a tokenizer setting item, and setting the token filters as the Lowercase token filters owned by the system, so that the method can be better suitable for the field PLATE _ NO;
223) setting a custom query analyzer of a field PLATE _ NO, setting a token to be a keyword token of the system, setting a token filters to be lowercase token filters of the system, and matching the custom query analyzer of the field PLATE _ NO with a custom index analyzer; the method can be better suitable for the requirement, and meanwhile, the required data can be more efficiently and accurately retrieved when the data is inquired according to the PLATE _ NO;
23) configuration field mappings:
231) setting the type of the ROWKEY field as object, setting the enabled parameter value as false, wherein the value of the field is only provided for HBase to retrieve data, the data does not need to be retrieved according to the field in the ElasticSearch, and if enabled is set as false, the ElasticSearch completely skips the parsing of the field content, but can still obtain a specific value from the _ source field, and only cannot be searched and does not index the data or store the data in any other way, so that the occupation of the disk can be reduced;
232) the TYPE TYPEs of the CROSSING _ INDEX field, the PLATE _ TYPE field and the PLATE _ COLOR field are set as keywords, and since these fields are structured contents and are usually used for filtering, sorting and aggregation, it is more appropriate to use keywords as the TYPE, and the keywords field can only be retrieved according to the exact values thereof.
233) The type of the PLATE _ NO field is set to text, and the text field is suitable for the field needing to be subjected to full-text retrieval. the text field stores normalization factors in the index to enable scoring of the documents, and if only one text field needs to be matched, but the resulting score is not of interest, the norm parameter value can be set to false. By default, the text field also stores the frequency and position in the index, the frequency is used to calculate the score, the position is used to run the phrase query, if the phrase query does not need to be run, the index _ options parameter value can be set to freqs so that the Elasticsearch does not index the position. The above arrangement can speed up the query and reduce the disk occupation. Setting an index analyzer parameter value of a PLATE _ NO field as a custom index analyzer and a query search _ analyzer parameter value as a custom query analyzer; and mapping the text field into a keyword field for sorting or aggregation by means of multi-fields;
234) the type of the PASS _ TIME field is set to date, and the format thereof is set by the format parameter.
3) Starting a data retrieval service and monitoring an Http request;
and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.
4) Analyzing a received Http request sent by a Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result (ROWKEY set);
the back-end service interface analyzes the request content, creates an ElasticSearch Client API instance and specifies the used index, requests the ElasticSearch index service and returns the result set of the RowKey query.
5) And reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data.
Creating an HBase Client API example, bringing a RowKey set into the HBase Client API, accessing HBase service, and packaging a returned result set into a result list;
the result list is returned to the client along the call stack.
The method comprises the steps of storing mass data in a distributed type column database Hbase, establishing a secondary index for the data by using a distributed full-text index engine ElasticSearch, not directly inquiring the HBase when retrieving the data, but firstly inquiring RowKey of the data through the secondary index, then inquiring the data in the HBase through the RowKey, and returning a result meeting conditions.
While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (6)

1. A method for rapidly searching mass data is characterized by comprising the following steps:
1) constructing a mass data storage system;
2) establishing a secondary index for the data in the mass data storage system;
3) starting a data retrieval service and monitoring an Http request;
4) analyzing the received Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result;
5) and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data.
2. The method for rapidly retrieving mass data according to claim 1, wherein the mass data storage system in step 1) is Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, the service data to be retrieved is stored in HBase according to respective service design, and a proper RowKey is designed to be used as a unique identifier of a record; and the data are uniformly distributed to a plurality of regions servers of the HBase, so that the performance of concurrent processing is improved.
3. The method for rapidly retrieving mass data according to claim 1, wherein the step 2) specifically comprises:
21) adopting an ElasticSearch as a carrier of a secondary index of a mass data storage system, and designing an index field according to respective query conditions by index design; the index field comprises all query fields, and a RowKey field is added to represent the value of a corresponding unique primary key after all conditions are matched;
22) extracting data in all HBase, and inserting a value corresponding to an index of an ElasticSearch in the data into the index, wherein the process is index data;
23) when creating an ElasticSearch index, different index types are set according to whether each field query is fuzzy matching or full word matching, an IK word splitter is used as the condition of the fuzzy query, and the field type of the full word matching is set as keyword.
4. The method for rapidly retrieving mass data according to claim 1, wherein the step 3) specifically comprises: and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.
5. The method for rapidly retrieving mass data according to claim 1, wherein the step 4) specifically comprises:
the back-end service interface analyzes the request content, creates an ElasticSearch Client API instance and specifies the used index, requests the ElasticSearch index service and returns the result set of the RowKey query.
6. The method for rapidly retrieving mass data according to claim 1, wherein the step 5) specifically comprises:
creating an HBase Client API example, bringing a RowKey set into the HBase Client API, accessing HBase service, and packaging a returned result set into a result list;
the result list is returned to the client along the call stack.
CN202010505012.5A 2020-06-05 2020-06-05 Method for quickly retrieving mass data Active CN111680043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010505012.5A CN111680043B (en) 2020-06-05 2020-06-05 Method for quickly retrieving mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010505012.5A CN111680043B (en) 2020-06-05 2020-06-05 Method for quickly retrieving mass data

Publications (2)

Publication Number Publication Date
CN111680043A true CN111680043A (en) 2020-09-18
CN111680043B CN111680043B (en) 2023-11-28

Family

ID=72435070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010505012.5A Active CN111680043B (en) 2020-06-05 2020-06-05 Method for quickly retrieving mass data

Country Status (1)

Country Link
CN (1) CN111680043B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100510A (en) * 2020-11-18 2020-12-18 树根互联技术有限公司 Mass data query method and device based on Internet of vehicles platform
CN112632157A (en) * 2021-03-11 2021-04-09 全时云商务服务股份有限公司 Multi-condition paging query method under distributed system
WO2023143095A1 (en) * 2022-01-25 2023-08-03 Zhejiang Dahua Technology Co., Ltd. Method and system for data query

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682073A (en) * 2016-11-14 2017-05-17 上海轻维软件有限公司 HBase fuzzy retrieval system based on Elastic Search
CN109165222A (en) * 2018-08-20 2019-01-08 福州大学 A kind of HBase secondary index creation method and system based on coprocessor
CN109299102A (en) * 2018-10-23 2019-02-01 中国电子科技集团公司第二十八研究所 A kind of HBase secondary index system and method based on Elastcisearch
CN109800222A (en) * 2018-12-11 2019-05-24 中国科学院信息工程研究所 A kind of HBase secondary index adaptive optimization method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682073A (en) * 2016-11-14 2017-05-17 上海轻维软件有限公司 HBase fuzzy retrieval system based on Elastic Search
CN109165222A (en) * 2018-08-20 2019-01-08 福州大学 A kind of HBase secondary index creation method and system based on coprocessor
CN109299102A (en) * 2018-10-23 2019-02-01 中国电子科技集团公司第二十八研究所 A kind of HBase secondary index system and method based on Elastcisearch
CN109800222A (en) * 2018-12-11 2019-05-24 中国科学院信息工程研究所 A kind of HBase secondary index adaptive optimization method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100510A (en) * 2020-11-18 2020-12-18 树根互联技术有限公司 Mass data query method and device based on Internet of vehicles platform
CN112632157A (en) * 2021-03-11 2021-04-09 全时云商务服务股份有限公司 Multi-condition paging query method under distributed system
WO2023143095A1 (en) * 2022-01-25 2023-08-03 Zhejiang Dahua Technology Co., Ltd. Method and system for data query

Also Published As

Publication number Publication date
CN111680043B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
US11748323B2 (en) System and method of search indexes using key-value attributes to searchable metadata
CN111680043B (en) Method for quickly retrieving mass data
US9619571B2 (en) Method for searching related entities through entity co-occurrence
Chakrabarti et al. Ranking objects based on relationships
US10372718B2 (en) Systems and methods for enterprise data search and analysis
US20130151498A1 (en) Search Engine Data Structure
CN106326429A (en) Hbase second-level query scheme based on solr
CN107491487A (en) A kind of full-text database framework and bitmap index establishment, data query method, server and medium
US10747795B2 (en) Cognitive retrieve and rank search improvements using natural language for product attributes
US10915543B2 (en) Systems and methods for enterprise data search and analysis
CN105912609A (en) Data file processing method and device
CN107291964B (en) A method of fuzzy query is realized based on HBase
US20080059432A1 (en) System and method for database indexing, searching and data retrieval
CN106708814B (en) Retrieval method and device based on relational database
Yafooz et al. Managing unstructured data in relational databases
Cheng et al. Supporting entity search: a large-scale prototype search engine
CN113553491A (en) Industrial big data search optimization method based on inverted index
CN107291938A (en) Order Query System and method
CN114218347A (en) Method for quickly searching index of multiple file contents
CN111680072B (en) System and method for dividing social information data
Li et al. Design of a Global Retrieval System for Characteristic Data Based on SOLR
Alam Data Migration: Relational Rdbms To Non-Relational Nosql
Löser et al. Ad-Hoc Queries over Document Collections–A Case Study
CN115617905A (en) Method and system for quickly retrieving cloud disk metadata
Xiao-Shu et al. Cloud computing oriented retrieval technology based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant