CN111221814A - Secondary index construction method, device and equipment - Google Patents

Secondary index construction method, device and equipment Download PDF

Info

Publication number
CN111221814A
CN111221814A CN201811426358.5A CN201811426358A CN111221814A CN 111221814 A CN111221814 A CN 111221814A CN 201811426358 A CN201811426358 A CN 201811426358A CN 111221814 A CN111221814 A CN 111221814A
Authority
CN
China
Prior art keywords
secondary index
index table
data
primary key
key column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811426358.5A
Other languages
Chinese (zh)
Other versions
CN111221814B (en
Inventor
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811426358.5A priority Critical patent/CN111221814B/en
Publication of CN111221814A publication Critical patent/CN111221814A/en
Application granted granted Critical
Publication of CN111221814B publication Critical patent/CN111221814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method for constructing a secondary index, which comprises the following steps: reading original data in a service node through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table; and writing the data of the secondary index table into the secondary index table. By adopting the method, the problem of occupation of original service resources when the secondary index is constructed in the prior art is solved.

Description

Secondary index construction method, device and equipment
Technical Field
The application relates to the technical field of distributed databases, in particular to a method and a device for constructing a secondary index, electronic equipment and storage equipment.
Background
In a distributed database, each product will usually provide the best query mode under a certain scenario, taking HBase (an open source distributed non-relational storage system) as an example: HBase is a NoSQL (non-relational database) database supporting KV queries, supporting primary key (rowkey) queries, range (scan) queries, and full-table queries. The most common of which is a primary key based query. The main key needs to be determined in advance for each service, and needs to be designated during writing and query, so the query mode is relatively limited, if the query is based on a non-main key column, HBase can only adopt full-table scanning, and the resource consumption is huge and the performance is extremely low.
Therefore, a secondary index scheme based on non-primary key column query needs to be provided, a secondary index can be established for the non-primary key column, and quick query for the non-primary key column is realized. Then, in the face of massive historical data, how to quickly construct the secondary index is the first step to be solved.
In the prior art, Apache Phoenix is an open source SQL engine, can provide SQL capability for HBase, supports establishment of secondary indexes for HBase non-primary key columns, and supports establishment of indexes for historical data. The scheme is as follows: when the index is created, if synchronous index construction is selected, multithreading is adopted to simultaneously read data in a main table (source data table), then the index is constructed, and then the data is written into a secondary index table to construct a secondary index; when creating the index, if asynchronous build index is selected, batch build index can be implemented later by submitting MR tasks.
The construction of the secondary index in the prior art has the following disadvantages: when the index is synchronously constructed, multiple threads run in a region service, and resources of the region service, such as a Handler and a memory of the region service, are occupied, so that normal read-write requests of other tables are influenced.
Disclosure of Invention
The application provides a method for constructing a secondary index, which aims to solve the problem that the existing secondary index is occupied with original service resources.
The application provides a construction method of a secondary index, which comprises the following steps:
reading original data in a service node through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
Optionally, the mapping task includes a local area service thread corresponding to a main table partition in a service node;
the reading of the original data in the service node by the mapping task includes: and reading the original data of the main table partition corresponding to the local area service thread in the service node through the local area service thread in the mapping task.
Optionally, the method further includes: carrying out snapshot operation on original data in the service node to generate snapshot data of the original data;
the reading of the original data in the service node by the mapping task includes: reading snapshot data of original data in the service node through the mapping task;
the selecting non-primary key columns in the original data as primary key columns of a secondary index table comprises: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
Optionally, the method further includes: acquiring indication information of a primary key column of a secondary index table;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: according to the indication information of the primary key column of the secondary index table, searching a non-primary key column matched with the primary key column of the secondary index table from the original data, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
Optionally, the constructing data of the secondary index table according to the primary key column of the secondary index table includes:
the primary key column in the original data and the non-primary key columns other than the primary key column as the secondary index table are constructed as the non-primary key columns of the secondary index table.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
according to the data structure of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing;
and writing the data of the secondary index table after the aggregation and the sorting into an index table.
Optionally, the aggregating and sorting the data in the secondary index table according to the data structure of the secondary index table to obtain the aggregated and sorted data in the secondary index table includes:
acquiring characteristic requirement information of a primary key column of the secondary index table;
and according to the characteristic requirement information of the primary key column of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
and generating a file comprising the data of the secondary index table through a summary task, and loading the file into the secondary index table.
Optionally, the number of the summary tasks is the same as the number of partitions of the secondary index table.
Optionally, the summary task includes an index file write thread;
generating a file including data of the secondary index table through a summary task, and loading the file into the secondary index table, including: and generating a file comprising the data of the secondary index table through an index file writing thread in the summarizing task, and loading the file into the secondary index table.
Optionally, the mapping task is a mapping task running in a service node.
Optionally, the secondary index table is a secondary index table in a non-relational database.
The present application further provides a device for constructing a secondary index, including:
the original data reading unit is used for reading original data in the service node through the mapping task;
the secondary index table data construction unit is used for selecting a non-primary key column in the original data as a primary key column of a secondary index table and constructing data of the secondary index table according to the primary key column of the secondary index table;
and the index table data writing unit is used for writing the data of the secondary index table into the secondary index table.
The present application further provides an electronic device, comprising:
a processor; and
a memory for storing a program of a construction method of a secondary index, the apparatus performing the following steps after being powered on and running the program of the construction method of the secondary index by the processor:
reading original data in the service node through the mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
The present application also provides a storage device storing a program of a method for constructing a secondary index, the program being executed by a processor and performing the steps of:
reading original data in the service node through the mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
Compared with the prior art, the method has the following advantages:
the application provides a method and a device for constructing a secondary index, electronic equipment and storage equipment, wherein original data in a service node is read through a mapping task; the problem of occupation of original regional service resources when a secondary index is constructed is solved.
In the preferred scheme of the application, the snapshot data of the original data is generated by performing snapshot operation on the original data in the service node, and the original data in the service node is solidified, so that the interference of real-time data writing is avoided in the index construction process.
In the preferred scheme of the application, the files comprising the data of the secondary index table are generated through the summary task and then loaded into the secondary index table, so that the situation that the data of the secondary index table is directly written into the secondary index table across networks is avoided, and the interference on the service node where the secondary index table is located is reduced.
Drawings
Fig. 1 is a flowchart of a method for constructing a two-level index according to a first embodiment of the present application.
FIG. 2 is a flowchart of an example of two-level index building provided in the first embodiment of the present application.
Fig. 3 is a schematic diagram of a scenario of two-level index building provided in the first embodiment of the present application.
Fig. 4 is a schematic diagram of a device for constructing a two-level index according to a second embodiment of the present application.
Fig. 5 is a schematic diagram of an electronic device according to a third embodiment of the present application.
Fig. 6 is a flowchart of an offline construction method of a secondary index according to a fifth embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather construed as limited to the embodiments set forth herein.
A first embodiment of the present application provides a method for constructing a two-level index, which is described in detail below with reference to fig. 1, fig. 2, and fig. 3.
As shown in fig. 1, in step S101, original data in a service node is read by a mapping task, a non-primary key column in the original data is selected as a primary key column of a secondary index table, and data of the secondary index table is constructed according to the primary key column of the secondary index table.
The service node refers to a physical node storing original data. The service nodes include regional service nodes (RegionServer nodes), which may be one or more, and when there are a plurality of regional service nodes, each regional service node may store a part of the original data.
And the original data in the service node comprises the original data of the distributed database.
The distributed databases, including NoSQL (non-relational database), for example: HBASE (an open source distributed NoSQL storage system), etc.
The mapping task may be a mapping task running in a service node, and may include a local area service thread corresponding to a main table partition in the service node. The mapping task may be Map task in MapReduce. MapReduce is a Hadoop offline execution engine and supports treatment of dividing and treating mass data. The primary table, which is an index table of the database, is a primary index table, and as shown in fig. 3, a table with a user id as a primary key is used as the primary table. The main table includes at least one main table partition, and since the data size of the main table is usually large, the main table may be divided into a plurality of main table partitions.
The reading of the original data in the service node by the mapping task includes: and reading the original data of the main table partition corresponding to the local area service thread in the service node through the local area service thread in the mapping task.
If a main table includes multiple main table partitions, different main table partitions may be stored in different service nodes, as shown in fig. 2, the main table is distributed over 3 regional service nodes, each regional service node stores different main table partitions, each main table partition corresponds to one mapping task, each mapping task includes one local regional service thread, each local regional service thread corresponds to a main table partition on a regional service node at this time, and each local regional service thread reads original data of the main table partition corresponding to the local regional service thread.
If one main table comprises a plurality of main table partitions, the main table partitions can also be stored on the same service node, one mapping task comprises a plurality of local area service threads, each local area service thread corresponds to one main table partition in the service node, and the local area service threads read the original data of the corresponding main table partition.
The method and the device start a mapping task for each service node, read the data of the corresponding service node by adopting a local area service thread, and avoid reading source data across multiple layers; because the local area service thread has the capability of reading data of the original RegionServer, the local area service thread (LocalRegionServer) can mainly multiplex the reading link of the RegionServer, and the reading link of the HFile is prevented from being reconstructed.
Further, in order to avoid interference of real-time data writing, before the step of reading the original data in the service node by the mapping task, Snapshot (Snapshot) operation may be performed on the original data of the service node to generate Snapshot data of the original data. Snapshot, which refers to a SnapShot of data, does not involve copying of the data, and is a more efficient data backup scheme. As shown in fig. 2, a SnapShot operation is performed on the main table partition of each service node storing the main table, and a main table SnapShot (SnapShot data of the main table partition) is generated. After the primary table SnapSoot is generated, the original data of the primary table partition is solidified, and the interference of real-time data writing is not worried in the process of constructing the secondary index.
The reading of the original data in the service node by the mapping task includes: and reading snapshot data of the original data in the service node through the mapping task.
After generating the snapshot data of the original data, the local area service thread may read the snapshot data of the original data from the corresponding service node. The corresponding service node may refer to a physical node running a local area service thread. As shown in fig. 2, the RegionServer1, the RegionServer2, and the RegionServer3 correspond to one local region service thread, respectively, and each local region service thread reads the primary table partition snapshot data from the corresponding region service node.
It should be noted that, because there are usually multiple main table partitions, each main table partition corresponds to one local area service thread, and multiple local area service threads can concurrently read data of each main table partition, the throughput of data reading is ensured, and the time for constructing the secondary index is reduced.
As shown in fig. 1, in step S102, a non-primary key column in the original data is selected as a primary key column of a secondary index table, and data of the secondary index table is constructed according to the primary key column of the secondary index table.
The secondary index table may refer to an index table constructed according to a metadata format of the secondary index. The secondary index can improve the efficiency of data searching, increasing and deleting.
The data of the secondary index table may refer to index data stored in the secondary index table. In fig. 3, the data with "name" as the primary key column is the data of the constructed secondary index table, wherein the primary key column name of the secondary index table is "name", and the non-primary key column names are "user id" and "commodity".
In specific implementation, any non-primary key column in the original data can be used as a primary key column of the secondary index table, and the rest columns are used as non-primary key columns of the secondary index table; the indication information of the primary key column of the secondary index table can be obtained; according to the indication information of the primary key column of the secondary index table, searching the non-primary key column matched with the primary key column of the secondary index table from the original data, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table, for example, as shown in fig. 3, if the name of the primary key column in the indication information of the primary key column of the secondary index table is "name", the "name" column is taken as the primary key column.
When snapshot data of original data in a service node is read through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table, including: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
The data of the secondary index table constructed according to the primary key column of the secondary index table comprises the following steps:
the primary key column in the original data and the non-primary key columns other than the primary key column as the secondary index table are constructed as the non-primary key columns of the secondary index table. For example, as shown in fig. 3, if the primary key column name in the indication information of the primary key column of the secondary index table is "name", the "user id" column and the "goods" column are taken as non-primary key columns.
As shown in FIG. 3, the raw data in the primary table partition is shown in Table 1:
user id Name (I) Commodity
100 Zhang three Television receiver
101 Li four Refrigerator with a door
110 Wangwu tea Washing machine
TABLE 1
After step S102, the data for generating the secondary index table is shown in table 2:
name (I) User id Commodity
Zhang three 100 Television receiver
Li four 101 Refrigerator with a door
Wangwu tea
110 Washing machine
TABLE 2
As shown in fig. 1, in step S103, the data of the secondary index table is written into the secondary index table.
After the mapping task constructs the data of the secondary index table according to the read original data in the service node, the mapping task can directly write the data of the secondary index table into the index table.
Because the secondary index table may include a plurality of secondary index table partitions, in order to ensure that data belonging to the same interval can be written into the same secondary index table partition, the data of the secondary index table may be aggregated and classified first.
The writing the data of the secondary index table into the secondary index table includes:
according to the data structure of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing;
and writing the data of the secondary index table after the aggregation and the sorting into an index table.
The data structure of the secondary index table may refer to a metadata format of a directory in the secondary index table, for example, the data sequence in the data structure of the secondary index table in fig. 3 is: main key column: name, non-primary key column: user id, commodity.
For example, if the type of the primary key of the secondary index table is a number and the range of the value is 1-100, the data of the secondary index table corresponding to the primary key with the value within the range of 1-50 can be aggregated together through a shuffle operation to obtain the data of a partition of the secondary index table after aggregation and sorting; the data of the secondary index table corresponding to the main key with the value of 51-100 intervals are aggregated together through a shuffle operation to obtain the data of the other partition of the secondary index table after aggregation and sorting processing, and the data of the two partitions are respectively written into the corresponding secondary index table partitions, so that the data belonging to the same interval can be written into the same secondary index table partition.
For another example, if the type of the primary key of the secondary index table is a Chinese character, the aggregation can be performed according to the pinyin of the Chinese character, for example, the data of the secondary index table in which the first letter of the name pinyin of the primary key is located in the a-h interval can be aggregated together to obtain the data of one partition of the secondary index table after aggregation and sorting; and aggregating the data of the secondary index table of which the first letter of the name pinyin of the primary key is positioned in the i-z interval to obtain the data of the other partition of the secondary index table after aggregation and sorting processing, and writing the data of the two partitions into the corresponding secondary index table partitions respectively. When the type of the primary key of the secondary index table is Chinese character, the aggregation classification can be carried out according to the strokes of the primary key besides the initial letter of pinyin.
As shown in fig. 3, the records of the initials of the name pinyin of the primary key in the a-w interval are aggregated together, and the records of the names "chendi", "lien four", "liu one", "grand seven", "wangwu" and wujiu "in the original data are aggregated in a secondary index table partition; the records of the first letter of the name pinyin of the main key in the x-z interval are aggregated together, namely the records of the names of 'Zhang three', 'Zhao six', 'Zhengshi' and 'Zhou eight' in the original data are aggregated in another secondary index table partition.
The aggregating and sorting the data of the secondary index table according to the data structure of the secondary index table to obtain the aggregated and sorted data of the secondary index table includes:
acquiring characteristic requirement information of a primary key column of the secondary index table;
and according to the characteristic requirement information of the primary key column of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing.
The characteristic requirement information of the primary key column of the secondary index table may refer to information according to which the data of the secondary index table is aggregated and sorted according to the characteristic of the primary key column. For example, if the primary key column is a name, the feature requirement information of the primary key column of the secondary index table may be: gathering the records of the initials of the name pinyin of the primary key in the a-w interval; and gathering the records of the first letter of the name pinyin of the primary key in the x-z interval.
Further, in order to avoid the problem that the resources of the service node where the secondary index table is located are occupied when the data of the secondary index table is directly written into the secondary index table, the data of the secondary index table may be first sent to a summary task, a file including the data of the secondary index table is generated through the summary task, and then the file is loaded into the secondary index table.
It should be noted that the number of the summary tasks is determined by the number of the partitions of the secondary index table, the number of the summary tasks may be the same as the number of the partitions of the secondary index table, the number of the partitions of the secondary index table may be set according to a requirement, and the secondary index table includes at least one partition of the secondary index table. When the secondary index table comprises two or more secondary index table partitions, each secondary index table partition corresponds to one summary task. As shown in fig. 2, the secondary index table includes two secondary index table partitions, and each secondary index table partition corresponds to one summary task.
The rollup task may be a ReduceTask in the MapReduce module, which may include an index file write thread (e.g., IndexHFileWriter). Generating a file including data of the secondary index table through a summary task, and loading the file into the secondary index table, including: and generating a file comprising the data of the secondary index table through an index file writing thread in the summarizing task, and loading the file into the secondary index table.
And if the non-relational database is HBase, the file comprising the data of the secondary index table is HFile of the secondary index table. Generating a file including the data of the secondary index table by the summarization task may refer to writing the constructed data of the secondary index table into the HFile file of the secondary index table by the indexahfilewriter thread. Because the ReduceTask writes the constructed index data into the file of the secondary index table through the IndexHFileWriter, the operation of writing the secondary index table through the HBase native API is avoided, an excessively deep call stack generated by using the native API is avoided, and the interference on the original service node is reduced; meanwhile, the data of the secondary index table is prevented from being directly written into the secondary index table across the network, and the interference to the service node where the secondary index table is located is reduced.
After the file including the data of the secondary index table is generated by the summarization task, since the previous steps are performed offline, the file including the data of the secondary index table needs to be loaded into the corresponding index table partition.
Referring to fig. 2 and 3, an application scenario of the first embodiment of the present application is described below, and as shown in fig. 2 and 3, the primary table includes three primary table partitions, which are stored in three regional service nodes, i.e., the RegionServer1 node, the RegionServer2 node, and the RegionServer3 node, respectively, the primary table has "user id" as a primary key, and the secondary index table has "name" as a primary key. The three regional service nodes respectively start three independent local regional service threads (local area service servers) to read snapshot data of a main table partition of the corresponding regional service node, then transmit the aggregated and sequenced data to two IndexFileWriter tasks through the aggregation and sequencing of the shuffle data, so as to generate an HFile file containing data of a secondary index table, and finally load the HFile file to the corresponding index table partition (namely the secondary index table partition), wherein the data in the secondary index table takes the name as a main key.
Now, a detailed description is given of an implementation of the watermark embedding method provided in the first embodiment of the present application. According to the first embodiment of the application, the snapshot operation is performed on the main table partition of each service node storing the main table, the original data of the main table partition is solidified, and the interference of real-time data writing is not worried in the index building process; by running a mapping service on each service node, the resource of a region service (RegionServer) is prevented from being occupied, a plurality of local region service threads can concurrently read the partition data of the service node, the throughput of data reading is ensured, and the time for constructing a secondary index is reduced; the summarizing task generates the file comprising the data of the secondary index table through the index file writing thread, so that the operation of writing the index table through HBase native API is avoided, the phenomenon that the native API is used to generate an over-deep call stack is avoided, and the interference on the original service node is reduced.
Corresponding to the method for constructing the secondary index provided in the first embodiment of the present application, a second embodiment of the present application also provides a device for constructing the secondary index.
As shown in fig. 4, the apparatus for constructing the secondary index includes:
an original data reading unit 401, configured to read original data in a service node through a mapping task;
a secondary index table data constructing unit 402, configured to select a non-primary key column in the original data as a primary key column of a secondary index table, and construct data of the secondary index table according to the primary key column of the secondary index table;
an index table data writing unit 403, configured to write data of the secondary index table into the secondary index table.
Optionally, the mapping task includes a local area service thread corresponding to a main table partition in a service node;
the original data reading unit is specifically configured to: and reading the original data of the main table partition corresponding to the local area service thread in the service node through the local area service thread in the mapping task.
Optionally, the apparatus further comprises: the original data snapshot unit is used for performing snapshot operation on original data in the service node to generate snapshot data of the original data;
the original data reading unit is specifically configured to: reading snapshot data of original data in the service node through the mapping task;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
Optionally, the apparatus further comprises: a primary key column indication information obtaining unit for obtaining indication information of a primary key column of the secondary index table;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: according to the indication information of the primary key column of the secondary index table, searching a non-primary key column matched with the primary key column of the secondary index table from the original data, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
Optionally, the secondary index table data constructing unit is specifically configured to:
the primary key column in the original data and the non-primary key columns other than the primary key column as the secondary index table are constructed as the non-primary key columns of the secondary index table.
Optionally, the writing unit of the index table data is specifically configured to:
according to the data structure of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing;
and writing the data of the secondary index table after the aggregation and the sorting into an index table.
Optionally, the aggregating and sorting the data in the secondary index table according to the data structure of the secondary index table to obtain the aggregated and sorted data in the secondary index table includes:
acquiring characteristic requirement information of a primary key column of the secondary index table;
and according to the characteristic requirement information of the primary key column of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing.
Optionally, the index table data writing unit is specifically configured to:
and generating a file comprising the data of the secondary index table through a summary task, and loading the file into the secondary index table.
Optionally, the number of the summary tasks is the same as the number of partitions of the secondary index table.
Optionally, the summary task includes an index file write thread;
generating a file including data of the secondary index table through a summary task, and loading the file into the secondary index table, including: and generating a file comprising the data of the secondary index table through an index file writing thread in the summarizing task, and loading the file into the secondary index table.
Optionally, the mapping task is a mapping task running in a service node.
Optionally, the secondary index table is a secondary index table in a non-relational database.
It should be noted that, for the detailed description of the device for constructing the secondary index provided in the second embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not repeated here.
Corresponding to the method for constructing the secondary index provided in the first embodiment of the present application, a third embodiment of the present application also provides an electronic device.
As shown in fig. 5, the electronic device includes:
a processor 501; and
a memory 502 for storing a program of a construction method of a secondary index, wherein after the apparatus is powered on and the program of the construction method of the secondary index is executed by the processor, the following steps are performed:
reading original data in the service node through the mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
Optionally, the mapping task includes a local area service thread corresponding to a main table partition in a service node;
the reading of the original data in the service node by the mapping task includes: and reading the original data of the main table partition corresponding to the local area service thread in the service node through the local area service thread in the mapping task.
Optionally, the electronic device further performs the following steps: carrying out snapshot operation on original data in the service node to generate snapshot data of the original data;
the reading of the original data in the service node by the mapping task includes: reading snapshot data of original data in the service node through the mapping task;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
Optionally, the electronic device further performs the following steps: acquiring indication information of a primary key column of a secondary index table;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: according to the indication information of the primary key column of the secondary index table, searching a non-primary key column matched with the primary key column of the secondary index table from the original data, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
Optionally, the constructing data of the secondary index table according to the primary key column of the secondary index table includes:
the primary key column in the original data and the non-primary key columns other than the primary key column as the secondary index table are constructed as the non-primary key columns of the secondary index table.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
according to the data structure of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing;
and writing the data of the secondary index table after the aggregation and the sorting into an index table.
Optionally, the aggregating and sorting the data in the secondary index table according to the data structure of the secondary index table to obtain the aggregated and sorted data in the secondary index table includes:
acquiring characteristic requirement information of a primary key column of the secondary index table;
and according to the characteristic requirement information of the primary key column of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
and generating a file comprising the data of the secondary index table through a summary task, and loading the file into the secondary index table.
Optionally, the number of the summary tasks is the same as the number of partitions of the secondary index table.
Optionally, the summary task includes an index file write thread;
generating a file including data of the secondary index table through a summary task, and loading the file into the secondary index table, including: and generating a file comprising the data of the secondary index table through an index file writing thread in the summarizing task, and loading the file into the secondary index table.
Optionally, the mapping task is a mapping task running in a service node.
Optionally, the secondary index table is a secondary index table in a non-relational database.
It should be noted that, for the detailed description of the electronic device provided in the third embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not repeated here.
Corresponding to the method for constructing the secondary index provided in the first embodiment of the present application, a fourth embodiment of the present application further provides a storage device, in which a program of the method for constructing the secondary index is stored, and the program is executed by a processor to perform the following steps:
reading original data in the service node through the mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
It should be noted that, for the detailed description of the storage device provided in the fourth embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, and details are not described here again.
A fifth embodiment of the present application provides an offline construction method of a secondary index, which is described below with reference to fig. 6.
As shown in fig. 6, in step S601, the original offline data in the service node is read by the mapping task.
And the original offline data in the service node comprises original offline data of a distributed database.
The distributed databases, including NoSQL (non-relational database), for example: HBASE (an open source distributed NoSQL storage system), etc.
The mapping task may be a mapping task running in a service node, and may include a local area service thread corresponding to a main table partition in the service node. The mapping task may be Map task in MapReduce.
As shown in fig. 6, in step S602, a non-primary key column in the original offline data is selected as a primary key column of a secondary index table, and the offline data of the secondary index table is constructed according to the primary key column of the secondary index table.
The implementation manner of this step is similar to that of step S102 in the first embodiment of the present application, and only the data in the secondary index table in step S102 needs to be changed into the offline data in the secondary index table, which is not described herein again.
As shown in fig. 6, in step S603, the offline data of the secondary index table is written into the secondary index table.
The writing the offline data of the secondary index table into the secondary index table includes:
and generating a file comprising the offline data of the secondary index table through a summary task, and loading the file into the secondary index table.
It should be noted that the number of the summary tasks is determined by the number of the partitions of the secondary index table, the number of the summary tasks may be the same as the number of the partitions of the secondary index table, the number of the partitions of the secondary index table may be set according to a requirement, and the secondary index table includes at least one partition of the secondary index table. When the secondary index table comprises two or more secondary index table partitions, each secondary index table partition corresponds to one summary task. As shown in fig. 2, the secondary index table includes two secondary index table partitions, and each secondary index table partition corresponds to one summary task.
The rollup task may be a ReduceTask in the MapReduce module, which may include an index file write thread (e.g., IndexHFileWriter). The generating a file including the offline data of the secondary index table through the summary task, and loading the file into the secondary index table include: and generating a file comprising the offline data of the secondary index table through an index file writing thread in the summary task, and loading the file into the secondary index table.
According to the fifth embodiment of the application, the offline data of the secondary index table can be sent to the summarizing task, the document including the offline data of the secondary index table is generated through the summarizing task, and then the document is loaded into the secondary index table, so that offline construction of the secondary index is realized, and the problem that resources of service nodes where the secondary index table is located are occupied when the data of the secondary index table is written into the secondary index table online is solved.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (16)

1. A construction method of a secondary index is characterized by comprising the following steps:
reading original data in the service node through the mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
2. The method of claim 1, wherein the mapping task comprises a local area service thread corresponding to a primary table partition in a service node;
the reading of the original data in the service node by the mapping task includes: and reading the original data of the main table partition corresponding to the local area service thread in the service node through the local area service thread in the mapping task.
3. The method of claim 1, further comprising: carrying out snapshot operation on original data in the service node to generate snapshot data of the original data;
the reading of the original data in the service node by the mapping task includes: reading snapshot data of original data in the service node through the mapping task;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
4. The method of claim 1, further comprising: acquiring indication information of a primary key column of a secondary index table;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: according to the indication information of the primary key column of the secondary index table, searching a non-primary key column matched with the primary key column of the secondary index table from the original data, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
5. The method of claim 1, wherein constructing data of a secondary index table from primary key columns of the secondary index table comprises:
the primary key column in the original data and the non-primary key columns other than the primary key column as the secondary index table are constructed as the non-primary key columns of the secondary index table.
6. The method of claim 1, wherein writing the data of the secondary index table into the secondary index table comprises:
according to the data structure of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing;
and writing the data of the secondary index table after the aggregation and the sorting into an index table.
7. The method according to claim 6, wherein the aggregating and sorting the data in the secondary index table according to the data structure of the secondary index table to obtain the aggregated and sorted data in the secondary index table comprises:
acquiring characteristic requirement information of a primary key column of the secondary index table;
and according to the characteristic requirement information of the primary key column of the secondary index table, carrying out aggregation and sorting processing on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sorting processing.
8. The method of claim 1, wherein writing the data of the secondary index table into the secondary index table comprises:
and generating a file comprising the data of the secondary index table through a summary task, and loading the file into the secondary index table.
9. The method of claim 8, wherein the number of summary tasks is the same as the number of partitions of the secondary index table.
10. The method of claim 8 or 9, wherein the summary task comprises an index file write thread;
generating a file including data of the secondary index table through a summary task, and loading the file into the secondary index table, including: and generating a file comprising the data of the secondary index table through an index file writing thread in the summarizing task, and loading the file into the secondary index table.
11. The method of claim 1, wherein the mapping task is a mapping task running in a service node.
12. The method of claim 1, wherein the secondary index table is a secondary index table in a non-relational database.
13. An apparatus for constructing a secondary index, comprising:
the original data reading unit is used for reading original data in the service node through the mapping task;
the secondary index table data construction unit is used for selecting a non-primary key column in the original data as a primary key column of a secondary index table and constructing data of the secondary index table according to the primary key column of the secondary index table;
and the index table data writing unit is used for writing the data of the secondary index table into the secondary index table.
14. An electronic device, comprising:
a processor; and
a memory for storing a program of a construction method of a secondary index, the apparatus performing the following steps after being powered on and running the program of the construction method of the secondary index by the processor:
reading original data in the service node through the mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
15. A storage device, characterized in that,
a program storing a construction method of a secondary index, the program being executed by a processor to perform the steps of:
reading original data in a service node through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
16. An off-line construction method of a secondary index is characterized by comprising the following steps:
reading original offline data in the service node through the mapping task;
selecting a non-primary key column in the original offline data as a primary key column of a secondary index table, and constructing the offline data of the secondary index table according to the primary key column of the secondary index table;
and writing the offline data of the secondary index table into the secondary index table.
CN201811426358.5A 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index Active CN111221814B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811426358.5A CN111221814B (en) 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811426358.5A CN111221814B (en) 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index

Publications (2)

Publication Number Publication Date
CN111221814A true CN111221814A (en) 2020-06-02
CN111221814B CN111221814B (en) 2023-06-27

Family

ID=70809354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811426358.5A Active CN111221814B (en) 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index

Country Status (1)

Country Link
CN (1) CN111221814B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377786A (en) * 2021-08-16 2021-09-10 北京易鲸捷信息技术有限公司 Method for realizing on-line index creation
CN113868251A (en) * 2021-09-24 2021-12-31 北京百度网讯科技有限公司 Global secondary indexing method and device for distributed database

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060106860A1 (en) * 2004-11-17 2006-05-18 Bmc Software, Inc. Method and apparatus for building index of source data
CN101853288A (en) * 2010-05-19 2010-10-06 马晓普 Configurable full-text retrieval service system based on document real-time monitoring
WO2011011996A1 (en) * 2009-07-31 2011-02-03 中兴通讯股份有限公司 Method for backing up terminal data and system thereof
CN102193917A (en) * 2010-03-01 2011-09-21 中国移动通信集团公司 Method and device for processing and querying data
CN103020204A (en) * 2012-12-05 2013-04-03 北京普泽天玑数据技术有限公司 Method and system for carrying out multi-dimensional regional inquiry on distribution type sequence table
CN103324762A (en) * 2013-07-17 2013-09-25 陆嘉恒 Hadoop-based index creation method and indexing method thereof
CN106326239A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Distributed file system and file meta-information management method thereof
CN107783974A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Data handling system and method
CN108228819A (en) * 2017-12-29 2018-06-29 武汉长江仪器自动化研究所有限公司 Methods For The Prediction Ofthe Deformation of A Large Dam based on big data platform

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060106860A1 (en) * 2004-11-17 2006-05-18 Bmc Software, Inc. Method and apparatus for building index of source data
WO2011011996A1 (en) * 2009-07-31 2011-02-03 中兴通讯股份有限公司 Method for backing up terminal data and system thereof
CN102193917A (en) * 2010-03-01 2011-09-21 中国移动通信集团公司 Method and device for processing and querying data
CN101853288A (en) * 2010-05-19 2010-10-06 马晓普 Configurable full-text retrieval service system based on document real-time monitoring
CN103020204A (en) * 2012-12-05 2013-04-03 北京普泽天玑数据技术有限公司 Method and system for carrying out multi-dimensional regional inquiry on distribution type sequence table
CN103324762A (en) * 2013-07-17 2013-09-25 陆嘉恒 Hadoop-based index creation method and indexing method thereof
CN106326239A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Distributed file system and file meta-information management method thereof
CN107783974A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Data handling system and method
CN108228819A (en) * 2017-12-29 2018-06-29 武汉长江仪器自动化研究所有限公司 Methods For The Prediction Ofthe Deformation of A Large Dam based on big data platform

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377786A (en) * 2021-08-16 2021-09-10 北京易鲸捷信息技术有限公司 Method for realizing on-line index creation
CN113377786B (en) * 2021-08-16 2021-11-02 北京易鲸捷信息技术有限公司 Method for realizing on-line index creation
CN113868251A (en) * 2021-09-24 2021-12-31 北京百度网讯科技有限公司 Global secondary indexing method and device for distributed database
JP7397928B2 (en) 2021-09-24 2023-12-13 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Global secondary index method and device for distributed database

Also Published As

Publication number Publication date
CN111221814B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
US9953102B2 (en) Creating NoSQL database index for semi-structured data
US10331641B2 (en) Hash database configuration method and apparatus
US8977623B2 (en) Method and system for search engine indexing and searching using the index
CN110674154B (en) Spark-based method for inserting, updating and deleting data in Hive
JP2014099163A (en) Method, system, and computer program product for hybrid table implementation using buffer pool as permanent in-memory storage for memory-resident data
US11288287B2 (en) Methods and apparatus to partition a database
CN107391544B (en) Processing method, device and equipment of column type storage data and computer storage medium
WO2017028394A1 (en) Example-based distributed data recovery method and apparatus
CN103440301A (en) Data multi-duplicate hybrid storage method and system
CN110427364A (en) A kind of data processing method, device, electronic equipment and storage medium
CN111221814B (en) Method, device and equipment for constructing secondary index
US10534765B2 (en) Assigning segments of a shared database storage to nodes
CN103559247A (en) Data service processing method and device
CN104408084A (en) Method and device for screening big data
US20170270149A1 (en) Database systems with re-ordered replicas and methods of accessing and backing up databases
CN110020001A (en) Storage, querying method and the corresponding equipment of string data
US8290935B1 (en) Method and system for optimizing database system queries
CN107102898B (en) Memory management and data structure construction method and device based on NUMA (non Uniform memory Access) architecture
CN107451142B (en) Method and apparatus for writing and querying data in database, management system and computer-readable storage medium thereof
US20170364454A1 (en) Method, apparatus, and computer program stored in computer readable medium for reading block in database system
US20220365905A1 (en) Metadata processing method and apparatus, and a computer-readable storage medium
CN111143711A (en) Object searching method and system
CN115544096B (en) Data query method and device, computer equipment and storage medium
US10169250B2 (en) Method and apparatus method and apparatus for controlling access to a hash-based disk
CN113553329B (en) Data integration system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant