CN111680043A - Method for rapidly searching mass data - Google Patents
Method for rapidly searching mass data Download PDFInfo
- Publication number
- CN111680043A CN111680043A CN202010505012.5A CN202010505012A CN111680043A CN 111680043 A CN111680043 A CN 111680043A CN 202010505012 A CN202010505012 A CN 202010505012A CN 111680043 A CN111680043 A CN 111680043A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- hbase
- mass data
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000013500 data storage Methods 0.000 claims abstract description 16
- 230000004044 response Effects 0.000 claims abstract description 8
- 230000000977 initiatory effect Effects 0.000 claims abstract description 4
- 238000012544 monitoring process Methods 0.000 claims abstract description 4
- 238000013461 design Methods 0.000 claims description 9
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000013021 overheating Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/221—Column-oriented storage; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
Abstract
The invention discloses a method for quickly searching mass data, which comprises the following steps: constructing a mass data storage system; establishing a secondary index for the data in the mass data storage system; starting a data retrieval service and monitoring an Http request; analyzing the received Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result; and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data. The method can quickly retrieve mass data according to multiple conditions and return the query result in a very short time range, thereby solving the defects of the prior technical scheme at the minimum cost.
Description
Technical Field
The invention belongs to the technical field of big data quick retrieval, and particularly relates to a quick retrieval method for mass data.
Background
With the development of society and technology, a huge amount of data is generated in different fields every day, and the storage and the use of the data become a very challenging technical problem. For example, in the transportation industry, a county city with 300 ten thousand people may have 1000 thousand cars passing by the vehicle data generated by the video detector. The common transaction type information management system stores the data through the relational data, the retrieval of the data can be normally carried out in the first year, and the query method and the storage design can be found to be optimized to the utmost extent when the data amount is accumulated for more than two years or even longer, but the data to be searched can not be queried in a short time. How to more effectively store mass data and realize quick retrieval through a certain technology becomes an urgent problem to be solved.
At present, most of the solutions in the system under construction are to increase storage nodes of a relational database and build a great number of indexes to realize quick retrieval, but the maintenance cost of the indexes is very high, and once data changes, the indexes are rebuilt in a large batch, because the indexes and the data are operated in the same database instance, the rebuilding of the indexes directly affects the performance of the database, and the query operation being executed is affected.
At present, there are two schemes for implementing fast query based on big data technology, as follows:
1. the data is stored in a distributed columnar database HBase, in order to realize rapid query according to conditions, the RowKey of the HBase needs to be designed according to query conditions, all query conditions are contained in the RowKey, and the function of global unique index is realized through the RowKey. However, there is a significant drawback that once the query condition changes, the RowKey needs to be redesigned, the original main data cannot be used, and data needs to be generated again according to a new RowKey, which results in that the same service data needs to be stored in multiple copies according to different rowkeys, and a huge waste of storage space is caused. This is almost a fatal problem.
2. The data is also required to be stored in the distributed column-type database HBase, and a secondary index is required to be designed according to query conditions, the primary data is stored in the table, and the secondary index and the primary data are simultaneously stored in a storage area, so that when the data is queried, the secondary index is firstly positioned, and then the primary data is directly positioned in the same area according to the secondary index. This has the advantage that the index and the main data are in the same storage area, saving time for retrieving the main data again across nodes. The design of the secondary indexes solves the problem that storage space is wasted after the query conditions are changed in step 1, but a new problem is generated, the RowKey matching principle in HBase is that the RowKey is matched from front to back according to Rowkey ASCII, therefore, if a plurality of query conditions exist, in order to adapt to various combined queries, the number of the secondary indexes is very large, when the conditions reach 7 to 8, the number of the indexes is too large, and the occupied amount of the storage space of the indexes is more than that of main data.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a method for fast searching mass data; the method can quickly retrieve mass data according to multiple conditions and return the query result in a very short time range, thereby solving the defects of the prior technical scheme at the minimum cost.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention relates to a method for quickly searching mass data, which comprises the following steps:
1) constructing a mass data storage system;
2) establishing a secondary index for the data in the mass data storage system;
3) starting a data retrieval service and monitoring an Http request;
4) analyzing a received Http request sent by a Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result (ROWKEY set);
5) and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data.
Further, the mass data storage system in step 1) is Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, and stores the service data to be retrieved into HBase according to respective service design, and designs a proper RowKey to be used as a unique identifier of the record; the data are uniformly distributed to a plurality of region servers of the HBase, so that the performance of concurrent processing is improved, and the condition of local overheating is avoided.
Further, the step 2) specifically includes:
21) adopting an ElasticSearch as a carrier of a secondary index of a mass data storage system, and designing an index field according to respective query conditions by index design; the index field comprises all query fields, and adds a RowKey field which represents the value of the corresponding unique primary key after all conditions are matched.
22) Extracting data in all HBase, and inserting a value corresponding to an index of an ElasticSearch in the data into the index, wherein the process is index data;
23) when creating an ElasticSearch index, different index types are set according to whether each field query is fuzzy matching or full word matching, an IK word splitter is used as the condition of the fuzzy query, and the field type of the full word matching is set as keyword.
Further, the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.
Further, the step 4) specifically includes:
the back-end service interface analyzes the request content, creates an ElasticSearch Client API instance and specifies the used index, requests the ElasticSearch index service and returns the result set of the RowKey query.
Further, the step 5) specifically includes:
creating an HBase Client API example, bringing a RowKey set into the HBase Client API, accessing HBase service, and packaging a returned result set into a result list;
the result list is returned to the client along the call stack.
The invention has the beneficial effects that:
1. any change of the secondary index is irrelevant to the main data, and any storage of the main data cannot be influenced.
2. The second-level index only stores the query condition and the RowKey of the main data, and occupies a very small storage space.
3. When the query condition changes, only a new index of the ElasticSearch needs to be established.
4. Due to some reasons of the distributed storage design of the HBase, accurate paging cannot be realized in data query of the HBase; however, the elastic search realizes an accurate paging function and makes up for the defect that the HBase data cannot be paged when being directly inquired.
5. The data retrieval realized by the technology can not cause great fluctuation of query efficiency due to the change of the data size; the query time for 10 hundred million data volumes and 100 hundred million data volumes is almost the same.
Drawings
FIG. 1 shows a schematic diagram of the method of the present invention.
Detailed Description
In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.
Referring to fig. 1, a method for fast retrieving massive data according to the present invention includes the following steps:
1) constructing a mass data storage system;
the mass data storage system is Apache HBase which is a distributed and telescopic mass data storage system constructed based on Hadoop, the service data to be retrieved is stored in HBase according to respective service design, and a proper RowKey is designed to be used as a unique recorded identifier; the data are uniformly distributed to a plurality of region servers of the HBase, so that the performance of concurrent processing is improved, and the condition of local overheating is avoided.
2) Establishing a secondary index for the data in the mass data storage system;
21) configuring settings:
211) setting the number of partitions, namely the value of the number _ of _ shares parameter, and setting the value as the number of cluster nodes;
212) setting the number of copies, namely the value of a number _ of _ copies parameter, which represents the redundant backup number of data, and setting the number of the redundant backups to 0, namely not backing up;
213) setting a data compression mode, namely setting the value of the codec parameter as best _ compression; data can be more effectively compressed, so that the occupation of a disk is remarkably reduced;
22) configuring custom analyzers (one analyzer has at least one token, which may be zero or more token filters):
221) setting a self-defined token, wherein the type is edge _ ngram, the parameter min _ gram is set to be 1, the parameter max _ gram is set to be 10, and the token _ char is set to be letter and digit, so that the token can be split when meeting characters and numbers, and can be better suitable for index fields in requirements;
222) setting a custom index analyzer of a field PLATE _ NO, selecting the tokenizer customized in the previous step as a tokenizer setting item, and setting the token filters as the Lowercase token filters owned by the system, so that the method can be better suitable for the field PLATE _ NO;
223) setting a custom query analyzer of a field PLATE _ NO, setting a token to be a keyword token of the system, setting a token filters to be lowercase token filters of the system, and matching the custom query analyzer of the field PLATE _ NO with a custom index analyzer; the method can be better suitable for the requirement, and meanwhile, the required data can be more efficiently and accurately retrieved when the data is inquired according to the PLATE _ NO;
23) configuration field mappings:
231) setting the type of the ROWKEY field as object, setting the enabled parameter value as false, wherein the value of the field is only provided for HBase to retrieve data, the data does not need to be retrieved according to the field in the ElasticSearch, and if enabled is set as false, the ElasticSearch completely skips the parsing of the field content, but can still obtain a specific value from the _ source field, and only cannot be searched and does not index the data or store the data in any other way, so that the occupation of the disk can be reduced;
232) the TYPE TYPEs of the CROSSING _ INDEX field, the PLATE _ TYPE field and the PLATE _ COLOR field are set as keywords, and since these fields are structured contents and are usually used for filtering, sorting and aggregation, it is more appropriate to use keywords as the TYPE, and the keywords field can only be retrieved according to the exact values thereof.
233) The type of the PLATE _ NO field is set to text, and the text field is suitable for the field needing to be subjected to full-text retrieval. the text field stores normalization factors in the index to enable scoring of the documents, and if only one text field needs to be matched, but the resulting score is not of interest, the norm parameter value can be set to false. By default, the text field also stores the frequency and position in the index, the frequency is used to calculate the score, the position is used to run the phrase query, if the phrase query does not need to be run, the index _ options parameter value can be set to freqs so that the Elasticsearch does not index the position. The above arrangement can speed up the query and reduce the disk occupation. Setting an index analyzer parameter value of a PLATE _ NO field as a custom index analyzer and a query search _ analyzer parameter value as a custom query analyzer; and mapping the text field into a keyword field for sorting or aggregation by means of multi-fields;
234) the type of the PASS _ TIME field is set to date, and the format thereof is set by the format parameter.
3) Starting a data retrieval service and monitoring an Http request;
and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.
4) Analyzing a received Http request sent by a Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result (ROWKEY set);
the back-end service interface analyzes the request content, creates an ElasticSearch Client API instance and specifies the used index, requests the ElasticSearch index service and returns the result set of the RowKey query.
5) And reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data.
Creating an HBase Client API example, bringing a RowKey set into the HBase Client API, accessing HBase service, and packaging a returned result set into a result list;
the result list is returned to the client along the call stack.
The method comprises the steps of storing mass data in a distributed type column database Hbase, establishing a secondary index for the data by using a distributed full-text index engine ElasticSearch, not directly inquiring the HBase when retrieving the data, but firstly inquiring RowKey of the data through the secondary index, then inquiring the data in the HBase through the RowKey, and returning a result meeting conditions.
While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.
Claims (6)
1. A method for rapidly searching mass data is characterized by comprising the following steps:
1) constructing a mass data storage system;
2) establishing a secondary index for the data in the mass data storage system;
3) starting a data retrieval service and monitoring an Http request;
4) analyzing the received Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result;
5) and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data.
2. The method for rapidly retrieving mass data according to claim 1, wherein the mass data storage system in step 1) is Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, the service data to be retrieved is stored in HBase according to respective service design, and a proper RowKey is designed to be used as a unique identifier of a record; and the data are uniformly distributed to a plurality of regions servers of the HBase, so that the performance of concurrent processing is improved.
3. The method for rapidly retrieving mass data according to claim 1, wherein the step 2) specifically comprises:
21) adopting an ElasticSearch as a carrier of a secondary index of a mass data storage system, and designing an index field according to respective query conditions by index design; the index field comprises all query fields, and a RowKey field is added to represent the value of a corresponding unique primary key after all conditions are matched;
22) extracting data in all HBase, and inserting a value corresponding to an index of an ElasticSearch in the data into the index, wherein the process is index data;
23) when creating an ElasticSearch index, different index types are set according to whether each field query is fuzzy matching or full word matching, an IK word splitter is used as the condition of the fuzzy query, and the field type of the full word matching is set as keyword.
4. The method for rapidly retrieving mass data according to claim 1, wherein the step 3) specifically comprises: and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.
5. The method for rapidly retrieving mass data according to claim 1, wherein the step 4) specifically comprises:
the back-end service interface analyzes the request content, creates an ElasticSearch Client API instance and specifies the used index, requests the ElasticSearch index service and returns the result set of the RowKey query.
6. The method for rapidly retrieving mass data according to claim 1, wherein the step 5) specifically comprises:
creating an HBase Client API example, bringing a RowKey set into the HBase Client API, accessing HBase service, and packaging a returned result set into a result list;
the result list is returned to the client along the call stack.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010505012.5A CN111680043B (en) | 2020-06-05 | 2020-06-05 | Method for quickly retrieving mass data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010505012.5A CN111680043B (en) | 2020-06-05 | 2020-06-05 | Method for quickly retrieving mass data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111680043A true CN111680043A (en) | 2020-09-18 |
CN111680043B CN111680043B (en) | 2023-11-28 |
Family
ID=72435070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010505012.5A Active CN111680043B (en) | 2020-06-05 | 2020-06-05 | Method for quickly retrieving mass data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680043B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100510A (en) * | 2020-11-18 | 2020-12-18 | 树根互联技术有限公司 | Mass data query method and device based on Internet of vehicles platform |
CN112632157A (en) * | 2021-03-11 | 2021-04-09 | 全时云商务服务股份有限公司 | Multi-condition paging query method under distributed system |
WO2023143095A1 (en) * | 2022-01-25 | 2023-08-03 | Zhejiang Dahua Technology Co., Ltd. | Method and system for data query |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682073A (en) * | 2016-11-14 | 2017-05-17 | 上海轻维软件有限公司 | HBase fuzzy retrieval system based on Elastic Search |
CN109165222A (en) * | 2018-08-20 | 2019-01-08 | 福州大学 | A kind of HBase secondary index creation method and system based on coprocessor |
CN109299102A (en) * | 2018-10-23 | 2019-02-01 | 中国电子科技集团公司第二十八研究所 | A kind of HBase secondary index system and method based on Elastcisearch |
CN109800222A (en) * | 2018-12-11 | 2019-05-24 | 中国科学院信息工程研究所 | A kind of HBase secondary index adaptive optimization method and system |
-
2020
- 2020-06-05 CN CN202010505012.5A patent/CN111680043B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682073A (en) * | 2016-11-14 | 2017-05-17 | 上海轻维软件有限公司 | HBase fuzzy retrieval system based on Elastic Search |
CN109165222A (en) * | 2018-08-20 | 2019-01-08 | 福州大学 | A kind of HBase secondary index creation method and system based on coprocessor |
CN109299102A (en) * | 2018-10-23 | 2019-02-01 | 中国电子科技集团公司第二十八研究所 | A kind of HBase secondary index system and method based on Elastcisearch |
CN109800222A (en) * | 2018-12-11 | 2019-05-24 | 中国科学院信息工程研究所 | A kind of HBase secondary index adaptive optimization method and system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100510A (en) * | 2020-11-18 | 2020-12-18 | 树根互联技术有限公司 | Mass data query method and device based on Internet of vehicles platform |
CN112632157A (en) * | 2021-03-11 | 2021-04-09 | 全时云商务服务股份有限公司 | Multi-condition paging query method under distributed system |
WO2023143095A1 (en) * | 2022-01-25 | 2023-08-03 | Zhejiang Dahua Technology Co., Ltd. | Method and system for data query |
Also Published As
Publication number | Publication date |
---|---|
CN111680043B (en) | 2023-11-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11748323B2 (en) | System and method of search indexes using key-value attributes to searchable metadata | |
CN111680043B (en) | Method for quickly retrieving mass data | |
US9619571B2 (en) | Method for searching related entities through entity co-occurrence | |
Chakrabarti et al. | Ranking objects based on relationships | |
US10372718B2 (en) | Systems and methods for enterprise data search and analysis | |
US20130151498A1 (en) | Search Engine Data Structure | |
CN106326429A (en) | Hbase second-level query scheme based on solr | |
CN107491487A (en) | A kind of full-text database framework and bitmap index establishment, data query method, server and medium | |
US10747795B2 (en) | Cognitive retrieve and rank search improvements using natural language for product attributes | |
US10915543B2 (en) | Systems and methods for enterprise data search and analysis | |
CN105912609A (en) | Data file processing method and device | |
CN107291964B (en) | A method of fuzzy query is realized based on HBase | |
US20080059432A1 (en) | System and method for database indexing, searching and data retrieval | |
CN106708814B (en) | Retrieval method and device based on relational database | |
Yafooz et al. | Managing unstructured data in relational databases | |
Cheng et al. | Supporting entity search: a large-scale prototype search engine | |
CN113553491A (en) | Industrial big data search optimization method based on inverted index | |
CN107291938A (en) | Order Query System and method | |
CN114218347A (en) | Method for quickly searching index of multiple file contents | |
CN111680072B (en) | System and method for dividing social information data | |
Li et al. | Design of a Global Retrieval System for Characteristic Data Based on SOLR | |
Alam | Data Migration: Relational Rdbms To Non-Relational Nosql | |
Löser et al. | Ad-Hoc Queries over Document Collections–A Case Study | |
CN115617905A (en) | Method and system for quickly retrieving cloud disk metadata | |
Xiao-Shu et al. | Cloud computing oriented retrieval technology based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |