CN111680043A

CN111680043A - Method for rapidly searching mass data

Info

Publication number: CN111680043A
Application number: CN202010505012.5A
Authority: CN
Inventors: 徐晓贝; 陈胡; 陈宽; 陶伟洋; 叶兆裕; 王远友
Original assignee: Nanjing LES Information Technology Co. Ltd
Current assignee: Nanjing LES Information Technology Co. Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-09-18
Anticipated expiration: 2040-06-05
Also published as: CN111680043B

Abstract

The invention discloses a method for quickly searching mass data, which comprises the following steps: constructing a mass data storage system; establishing a secondary index for the data in the mass data storage system; starting a data retrieval service and monitoring an Http request; analyzing the received Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result; and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data. The method can quickly retrieve mass data according to multiple conditions and return the query result in a very short time range, thereby solving the defects of the prior technical scheme at the minimum cost.

Description

Method for rapidly searching mass data

Technical Field

The invention belongs to the technical field of big data quick retrieval, and particularly relates to a quick retrieval method for mass data.

Background

With the development of society and technology, a huge amount of data is generated in different fields every day, and the storage and the use of the data become a very challenging technical problem. For example, in the transportation industry, a county city with 300 ten thousand people may have 1000 thousand cars passing by the vehicle data generated by the video detector. The common transaction type information management system stores the data through the relational data, the retrieval of the data can be normally carried out in the first year, and the query method and the storage design can be found to be optimized to the utmost extent when the data amount is accumulated for more than two years or even longer, but the data to be searched can not be queried in a short time. How to more effectively store mass data and realize quick retrieval through a certain technology becomes an urgent problem to be solved.

At present, most of the solutions in the system under construction are to increase storage nodes of a relational database and build a great number of indexes to realize quick retrieval, but the maintenance cost of the indexes is very high, and once data changes, the indexes are rebuilt in a large batch, because the indexes and the data are operated in the same database instance, the rebuilding of the indexes directly affects the performance of the database, and the query operation being executed is affected.

At present, there are two schemes for implementing fast query based on big data technology, as follows:

1. the data is stored in a distributed columnar database HBase, in order to realize rapid query according to conditions, the RowKey of the HBase needs to be designed according to query conditions, all query conditions are contained in the RowKey, and the function of global unique index is realized through the RowKey. However, there is a significant drawback that once the query condition changes, the RowKey needs to be redesigned, the original main data cannot be used, and data needs to be generated again according to a new RowKey, which results in that the same service data needs to be stored in multiple copies according to different rowkeys, and a huge waste of storage space is caused. This is almost a fatal problem.

2. The data is also required to be stored in the distributed column-type database HBase, and a secondary index is required to be designed according to query conditions, the primary data is stored in the table, and the secondary index and the primary data are simultaneously stored in a storage area, so that when the data is queried, the secondary index is firstly positioned, and then the primary data is directly positioned in the same area according to the secondary index. This has the advantage that the index and the main data are in the same storage area, saving time for retrieving the main data again across nodes. The design of the secondary indexes solves the problem that storage space is wasted after the query conditions are changed in step 1, but a new problem is generated, the RowKey matching principle in HBase is that the RowKey is matched from front to back according to Rowkey ASCII, therefore, if a plurality of query conditions exist, in order to adapt to various combined queries, the number of the secondary indexes is very large, when the conditions reach 7 to 8, the number of the indexes is too large, and the occupied amount of the storage space of the indexes is more than that of main data.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a method for fast searching mass data; the method can quickly retrieve mass data according to multiple conditions and return the query result in a very short time range, thereby solving the defects of the prior technical scheme at the minimum cost.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention relates to a method for quickly searching mass data, which comprises the following steps:

1) constructing a mass data storage system;

2) establishing a secondary index for the data in the mass data storage system;

3) starting a data retrieval service and monitoring an Http request;

4) analyzing a received Http request sent by a Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result (ROWKEY set);

5) and reading the structured data corresponding to the ROWKEY from the Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data.

Further, the mass data storage system in step 1) is Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, and stores the service data to be retrieved into HBase according to respective service design, and designs a proper RowKey to be used as a unique identifier of the record; the data are uniformly distributed to a plurality of region servers of the HBase, so that the performance of concurrent processing is improved, and the condition of local overheating is avoided.

Further, the step 2) specifically includes:

21) adopting an ElasticSearch as a carrier of a secondary index of a mass data storage system, and designing an index field according to respective query conditions by index design; the index field comprises all query fields, and adds a RowKey field which represents the value of the corresponding unique primary key after all conditions are matched.

22) Extracting data in all HBase, and inserting a value corresponding to an index of an ElasticSearch in the data into the index, wherein the process is index data;

23) when creating an ElasticSearch index, different index types are set according to whether each field query is fuzzy matching or full word matching, an IK word splitter is used as the condition of the fuzzy query, and the field type of the full word matching is set as keyword.

Further, the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.

Further, the step 4) specifically includes:

the back-end service interface analyzes the request content, creates an ElasticSearch Client API instance and specifies the used index, requests the ElasticSearch index service and returns the result set of the RowKey query.

Further, the step 5) specifically includes:

creating an HBase Client API example, bringing a RowKey set into the HBase Client API, accessing HBase service, and packaging a returned result set into a result list;

the result list is returned to the client along the call stack.

The invention has the beneficial effects that:

1. any change of the secondary index is irrelevant to the main data, and any storage of the main data cannot be influenced.

2. The second-level index only stores the query condition and the RowKey of the main data, and occupies a very small storage space.

3. When the query condition changes, only a new index of the ElasticSearch needs to be established.

4. Due to some reasons of the distributed storage design of the HBase, accurate paging cannot be realized in data query of the HBase; however, the elastic search realizes an accurate paging function and makes up for the defect that the HBase data cannot be paged when being directly inquired.

5. The data retrieval realized by the technology can not cause great fluctuation of query efficiency due to the change of the data size; the query time for 10 hundred million data volumes and 100 hundred million data volumes is almost the same.

Drawings

FIG. 1 shows a schematic diagram of the method of the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

Referring to fig. 1, a method for fast retrieving massive data according to the present invention includes the following steps:

1) constructing a mass data storage system;

the mass data storage system is Apache HBase which is a distributed and telescopic mass data storage system constructed based on Hadoop, the service data to be retrieved is stored in HBase according to respective service design, and a proper RowKey is designed to be used as a unique recorded identifier; the data are uniformly distributed to a plurality of region servers of the HBase, so that the performance of concurrent processing is improved, and the condition of local overheating is avoided.

2) Establishing a secondary index for the data in the mass data storage system;

21) configuring settings:

211) setting the number of partitions, namely the value of the number _ of _ shares parameter, and setting the value as the number of cluster nodes;

212) setting the number of copies, namely the value of a number _ of _ copies parameter, which represents the redundant backup number of data, and setting the number of the redundant backups to 0, namely not backing up;

213) setting a data compression mode, namely setting the value of the codec parameter as best _ compression; data can be more effectively compressed, so that the occupation of a disk is remarkably reduced;

22) configuring custom analyzers (one analyzer has at least one token, which may be zero or more token filters):

221) setting a self-defined token, wherein the type is edge _ ngram, the parameter min _ gram is set to be 1, the parameter max _ gram is set to be 10, and the token _ char is set to be letter and digit, so that the token can be split when meeting characters and numbers, and can be better suitable for index fields in requirements;

222) setting a custom index analyzer of a field PLATE _ NO, selecting the tokenizer customized in the previous step as a tokenizer setting item, and setting the token filters as the Lowercase token filters owned by the system, so that the method can be better suitable for the field PLATE _ NO;

223) setting a custom query analyzer of a field PLATE _ NO, setting a token to be a keyword token of the system, setting a token filters to be lowercase token filters of the system, and matching the custom query analyzer of the field PLATE _ NO with a custom index analyzer; the method can be better suitable for the requirement, and meanwhile, the required data can be more efficiently and accurately retrieved when the data is inquired according to the PLATE _ NO;

23) configuration field mappings:

231) setting the type of the ROWKEY field as object, setting the enabled parameter value as false, wherein the value of the field is only provided for HBase to retrieve data, the data does not need to be retrieved according to the field in the ElasticSearch, and if enabled is set as false, the ElasticSearch completely skips the parsing of the field content, but can still obtain a specific value from the _ source field, and only cannot be searched and does not index the data or store the data in any other way, so that the occupation of the disk can be reduced;

232) the TYPE TYPEs of the CROSSING _ INDEX field, the PLATE _ TYPE field and the PLATE _ COLOR field are set as keywords, and since these fields are structured contents and are usually used for filtering, sorting and aggregation, it is more appropriate to use keywords as the TYPE, and the keywords field can only be retrieved according to the exact values thereof.

233) The type of the PLATE _ NO field is set to text, and the text field is suitable for the field needing to be subjected to full-text retrieval. the text field stores normalization factors in the index to enable scoring of the documents, and if only one text field needs to be matched, but the resulting score is not of interest, the norm parameter value can be set to false. By default, the text field also stores the frequency and position in the index, the frequency is used to calculate the score, the position is used to run the phrase query, if the phrase query does not need to be run, the index _ options parameter value can be set to freqs so that the Elasticsearch does not index the position. The above arrangement can speed up the query and reduce the disk occupation. Setting an index analyzer parameter value of a PLATE _ NO field as a custom index analyzer and a query search _ analyzer parameter value as a custom query analyzer; and mapping the text field into a keyword field for sorting or aggregation by means of multi-fields;

234) the type of the PASS _ TIME field is set to date, and the format thereof is set by the format parameter.

3) Starting a data retrieval service and monitoring an Http request;

and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.

the result list is returned to the client along the call stack.

The method comprises the steps of storing mass data in a distributed type column database Hbase, establishing a secondary index for the data by using a distributed full-text index engine ElasticSearch, not directly inquiring the HBase when retrieving the data, but firstly inquiring RowKey of the data through the secondary index, then inquiring the data in the HBase through the RowKey, and returning a result meeting conditions.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A method for rapidly searching mass data is characterized by comprising the following steps:

1) constructing a mass data storage system;

2) establishing a secondary index for the data in the mass data storage system;

3) starting a data retrieval service and monitoring an Http request;

4) analyzing the received Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an ElasticSearch index service and obtaining a response result;

2. The method for rapidly retrieving mass data according to claim 1, wherein the mass data storage system in step 1) is Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, the service data to be retrieved is stored in HBase according to respective service design, and a proper RowKey is designed to be used as a unique identifier of a record; and the data are uniformly distributed to a plurality of regions servers of the HBase, so that the performance of concurrent processing is improved.

3. The method for rapidly retrieving mass data according to claim 1, wherein the step 2) specifically comprises:

21) adopting an ElasticSearch as a carrier of a secondary index of a mass data storage system, and designing an index field according to respective query conditions by index design; the index field comprises all query fields, and a RowKey field is added to represent the value of a corresponding unique primary key after all conditions are matched;

4. The method for rapidly retrieving mass data according to claim 1, wherein the step 3) specifically comprises: and writing a back-end service interface, wherein the service interface is provided for other programs to use so as to monitor the request and return data results required by other programs.

5. The method for rapidly retrieving mass data according to claim 1, wherein the step 4) specifically comprises:

6. The method for rapidly retrieving mass data according to claim 1, wherein the step 5) specifically comprises:

the result list is returned to the client along the call stack.