CN111221814B - Method, device and equipment for constructing secondary index - Google Patents

Method, device and equipment for constructing secondary index Download PDF

Info

Publication number
CN111221814B
CN111221814B CN201811426358.5A CN201811426358A CN111221814B CN 111221814 B CN111221814 B CN 111221814B CN 201811426358 A CN201811426358 A CN 201811426358A CN 111221814 B CN111221814 B CN 111221814B
Authority
CN
China
Prior art keywords
secondary index
index table
data
primary key
key column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811426358.5A
Other languages
Chinese (zh)
Other versions
CN111221814A (en
Inventor
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811426358.5A priority Critical patent/CN111221814B/en
Publication of CN111221814A publication Critical patent/CN111221814A/en
Application granted granted Critical
Publication of CN111221814B publication Critical patent/CN111221814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a construction method of a secondary index, which comprises the following steps: reading original data in a service node through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table; and writing the data of the secondary index table into the secondary index table. The method is used for solving the problem of occupation of original service resources when the secondary index is built in the prior art.

Description

Method, device and equipment for constructing secondary index
Technical Field
The application relates to the technical field of distributed databases, in particular to a method and a device for constructing a secondary index, electronic equipment and storage equipment.
Background
In a distributed database, each product will generally provide the best query mode in a certain scenario, taking HBase (an open-source distributed non-relational storage system) as an example: HBase is a NoSQL (non-relational database) database supporting KV queries, supporting primary key (rowkey) queries, range (scan) queries, and full-table queries. Of which the most common is a primary key based query. Each service needs to determine a primary key in advance, and the primary key needs to be designated during writing and inquiring, so that the inquiring mode is limited, if inquiring is to be performed based on a non-primary key row, the HBase can only adopt full-table scanning, and the resource consumption is huge and the performance is extremely low.
Therefore, a secondary index scheme based on non-primary key row query needs to be provided, and a secondary index can be established for the non-primary key row, so that quick query for the non-primary key row is realized. Then, in the face of massive historical data, how to quickly construct the secondary index is the first step to be solved.
Under the prior art, apache Phoenix is an open source SQL engine, can provide SQL capability for HBase, supports the establishment of a secondary index for HBase non-primary key columns, and also supports the establishment of an index for historical data. The scheme is as follows: when the index is created, if synchronous construction index is selected, multithreading is adopted to simultaneously read data in a main table (source data table), then the index is constructed, and then a secondary index table is written to construct a secondary index; when creating the index, if an asynchronous build index is selected, the batch build index may be later implemented by submitting MR tasks.
The following disadvantages exist in the prior art for constructing the secondary index: when the index is synchronously constructed, the multithreading runs in the region server, which occupies the resources of the region server, such as the Handler and the memory of the region server, and influences the normal read-write requests of other tables.
Disclosure of Invention
The application provides a method for constructing a secondary index, which aims to solve the problem of occupation of original service resources when the secondary index is constructed in the prior art.
The application provides a construction method of a secondary index, which comprises the following steps:
reading original data in a service node through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
Optionally, the mapping task includes a local area service thread corresponding to a main table partition in the service node;
the reading of the original data in the service node by the mapping task comprises the following steps: and reading the original data of the main table partition corresponding to the local regional service thread in the service node through the local regional service thread in the mapping task.
Optionally, the method further comprises: performing snapshot operation on the original data in the service node to generate snapshot data of the original data;
the reading of the original data in the service node by the mapping task comprises the following steps: reading snapshot data of the original data in the service node through the mapping task;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
Optionally, the method further comprises: obtaining indication information of a main key column of a secondary index table;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and searching a non-primary key column matched with the primary key column of the secondary index table from the original data according to the indication information of the primary key column of the secondary index table, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
Optionally, the constructing the data of the secondary index table according to the primary key row of the secondary index table includes:
the primary key columns in the original data, and the non-primary key columns other than the primary key columns that are used as the secondary index table, are constructed as the non-primary key columns of the secondary index table.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
according to the data structure of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after the aggregation and sequencing treatment;
and writing the data of the secondary index table after the aggregation and sequencing treatment into the index table.
Optionally, the aggregating and sorting processing is performed on the data of the secondary index table according to the data structure of the secondary index table, so as to obtain the data of the secondary index table after the aggregating and sorting processing, including:
Obtaining the feature requirement information of the primary key row of the secondary index table;
and according to the characteristic requirement information of the primary key row of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sequencing treatment.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
and generating a file comprising the data of the secondary index table through a summarizing task, and loading the file into the secondary index table.
Optionally, the number of the summarizing tasks is the same as the number of the partitions of the secondary index table.
Optionally, the summarizing task includes an index file writing thread;
the generating a file including the data of the secondary index table through the summarizing task, and loading the file into the secondary index table includes: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
Optionally, the mapping task is a mapping task running in a service node.
Optionally, the secondary index table is a secondary index table in a non-relational database.
The application also provides a device for constructing the secondary index, which comprises:
the original data reading unit is used for reading the original data in the service node through the mapping task;
a secondary index table data construction unit, configured to select a non-primary key column in the original data as a primary key column of a secondary index table, and construct data of the secondary index table according to the primary key column of the secondary index table;
and the index table data writing unit is used for writing the data of the secondary index table into the secondary index table.
The application also provides an electronic device comprising:
a processor; and
a memory for storing a program of a construction method of the secondary index, the apparatus being powered on and executing the program of the construction method of the secondary index by the processor, and performing the steps of:
reading original data in the service node through a mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
The present application also provides a storage device storing a program of a construction method of a secondary index, the program being executed by a processor to perform the steps of:
Reading original data in the service node through a mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
Compared with the prior art, the application has the following advantages:
the application provides a method, a device, electronic equipment and storage equipment for constructing a secondary index, wherein original data in a service node is read through a mapping task; the problem of occupation of original regional service resources when constructing the secondary index is avoided.
In the preferred scheme of the application, the snapshot operation is carried out on the original data in the service node, so that the snapshot data of the original data is generated, the original data in the service node is solidified, and the interference of real-time data writing is avoided in the process of constructing the index.
In the preferred scheme of the application, the file comprising the data of the secondary index table is generated through the summarizing task and then is loaded into the secondary index table, so that the situation that the data of the secondary index table are directly written into the secondary index table across a network is avoided, and the interference to the service node where the secondary index table is located is reduced.
Drawings
Fig. 1 is a flowchart of a method for constructing a secondary index according to a first embodiment of the present application.
Fig. 2 is a flowchart of an example of the two-level index construction provided in the first embodiment of the present application.
Fig. 3 is a schematic diagram of a scenario of two-level index construction provided in the first embodiment of the present application.
Fig. 4 is a schematic diagram of a second level index building apparatus according to a second embodiment of the present application.
Fig. 5 is a schematic diagram of an electronic device according to a third embodiment of the present application.
Fig. 6 is a flowchart of a method for offline construction of a secondary index according to a fifth embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present invention may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present invention is not limited to the specific embodiments disclosed below.
A first embodiment of the present application provides a method for constructing a secondary index, which is described in detail below with reference to fig. 1, 2, and 3.
As shown in fig. 1, in step S101, original data in a service node is read by a mapping task, a non-primary key column in the original data is selected as a primary key column of a secondary index table, and data of the secondary index table is constructed according to the primary key column of the secondary index table.
The service node refers to a physical node for storing the original data. The service nodes include regional service nodes (region server nodes), which may be one or more, and each regional service node may store a portion of the original data when the regional service nodes are plural.
The raw data in the service node includes raw data of a distributed database.
The distributed database includes NoSQL (non-relational database), such as: HBASE (an open source distributed NoSQL storage system) and the like.
The mapping task may be a mapping task running in a service node, and may include a local area service thread corresponding to a primary table partition in the service node. The mapping task may be a Map task in MapReduce. MapReduce is a Hadoop offline execution engine and supports divide and conquer processing of mass data. The main table is an index table of the primary database, as shown in fig. 3, and the table with the user id as the main key is the main table. The main table includes at least one main table partition, and may be divided into a plurality of main table partitions because the data amount of the main table is large in general.
The reading of the original data in the service node by the mapping task comprises the following steps: and reading the original data of the main table partition corresponding to the local regional service thread in the service node through the local regional service thread in the mapping task.
If a master table contains multiple master table partitions, different master table partitions may be stored on different service nodes, as shown in fig. 2, where the master table is distributed on 3 area service nodes, each area service node stores different master table partitions, each master table partition corresponds to a mapping task, each mapping task contains a local area service thread, at this time, each local area service thread corresponds to a master table partition on an area service node, and each local area service thread reads the original data of the master table partition corresponding to the local area service thread.
If a master table contains multiple master table partitions, multiple master table partitions may also be stored on the same service node, and a mapping task includes multiple local area service threads, where each local area service thread corresponds to one master table partition in the service node, and the local area service thread reads the original data of the corresponding master table partition.
According to the method and the device, a mapping task is started for each service node, and the data of the corresponding service node is read by adopting a local area service thread, so that source data is prevented from being read across multiple layers; because the local area service thread has the capability of reading data of the original region server, the local area service thread (Localregion server) can be realized to mainly multiplex the read link of the region server, and the reconstruction of the read link of HFile is avoided.
Furthermore, in order to avoid the interference of real-time data writing, before the step of reading the original data in the service node through the mapping task, snapshot (snap shot) operation may be performed on the original data of the service node, so as to generate Snapshot data of the original data. Snapshot refers to a SnapShot of data, and does not relate to copying of the data, so that the SnapShot is a more efficient data backup scheme. As shown in fig. 2, a SnapShot operation is performed on a main table partition of each service node storing a main table, and a main table snapplot (SnapShot data of the main table partition) is generated. After generating the main table Snapplot, the original data of the main table partition is solidified, and the interference of real-time data writing is not worried in the process of constructing the secondary index.
The reading of the original data in the service node by the mapping task comprises the following steps: the snapshot data of the original data in the service node is read by the mapping task.
After generating the snapshot data of the original data, the local area service thread may read the snapshot data of the original data from the corresponding service node. The corresponding service node may refer to a physical node running a local area service thread. As shown in fig. 2, the region servers 1, 2 and 3 respectively correspond to a local region service thread, and each local region service thread respectively reads the snapshot data of the main table partition from the corresponding region service node.
It should be noted that, since there are usually multiple main table partitions, each main table partition corresponds to one local area service thread, and the multiple local area service threads can concurrently read the data of each main table partition, thereby ensuring the throughput of data reading and reducing the time for constructing the secondary index.
As shown in fig. 1, in step S102, a non-primary key column in the original data is selected as a primary key column of a secondary index table, and data of the secondary index table is constructed according to the primary key column of the secondary index table.
The secondary index table may refer to an index table constructed according to a metadata format of the secondary index. The secondary index can improve the efficiency of searching, adding and deleting data.
The data of the secondary index table may refer to index data stored in the secondary index table. In fig. 3, the data with "name" as the primary key column is the data of the constructed secondary index table, wherein the primary key column of the secondary index table is named "name", and the non-primary key column is named "user id" and "commodity".
In the implementation, any non-primary key column in the original data can be used as the primary key column of the secondary index table, and the other columns are used as the non-primary key columns of the secondary index table; the indication information of the main key row of the secondary index table can be obtained; according to the indication information of the primary key column of the secondary index table, searching the non-primary key column matched with the primary key column of the secondary index table from the original data, determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table, for example, if the primary key column name in the indication information of the primary key column of the secondary index table is "name", the "name" column is taken as the primary key column as shown in fig. 3.
When snapshot data of original data in a service node is read through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table comprises: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
The constructing the data of the secondary index table according to the primary key row of the secondary index table comprises the following steps:
the primary key columns in the original data, and the non-primary key columns other than the primary key columns that are used as the secondary index table, are constructed as the non-primary key columns of the secondary index table. For example, as shown in fig. 3, if the primary key column name is "name" in the indication information of the primary key column of the secondary index table, the "user id" column and the "commodity" column are regarded as non-primary key columns.
As shown in fig. 3, the original data in the main table partition is shown in table 1:
user id Name of name Goods commodity
100 Zhang San Television set
101 Li Si Refrigerator with a refrigerator body
110 Wang Wu Washing machine
TABLE 1
The data for generating the secondary index table is shown in table 2, via step S102:
name of name User id Goods commodity
Zhang San
100 Television set
Li Si
101 Refrigerator with a refrigerator body
Wang Wu
110 Washing machine
TABLE 2
As shown in fig. 1, in step S103, the data of the secondary index table is written into the secondary index table.
After the mapping task constructs the data of the secondary index table according to the read original data in the service node, the mapping task can directly write the data of the secondary index table into the index table.
Because the secondary index table may include a plurality of secondary index table partitions, in order to ensure that data belonging to the same section can be written into the same secondary index table partition, the data of the secondary index table may be first aggregated and categorized.
The writing the data of the secondary index table into the secondary index table comprises the following steps:
according to the data structure of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after the aggregation and sequencing treatment;
and writing the data of the secondary index table after the aggregation and sequencing treatment into the index table.
The data structure of the secondary index table may refer to a metadata format of a directory in the secondary index table, for example, the data sequence in the data structure of the secondary index table in fig. 3 is: main key row: name, non-primary key column: user id, commodity.
For example, if the type of the primary key of the secondary index table is a number, the section where the value is located is 1-100, the data of the secondary index table corresponding to the primary key in the section where the value is 1-50 can be aggregated together through a shuffle operation, so as to obtain the data of one partition of the secondary index table after aggregation and sorting processing; the data of the secondary index table corresponding to the primary key in the interval with the value of 51-100 are aggregated together through the shuffle operation, so that the data of the other partition of the secondary index table after aggregation and sequencing are obtained, and the data of the two partitions are respectively written into the corresponding secondary index table partition, so that the data belonging to the same interval can be ensured to be written into the same secondary index table partition.
For another example, if the type of the primary key of the secondary index table is a Chinese character, the data of the secondary index table with the initial letters of the name pinyin of the primary key in the a-h interval can be aggregated together to obtain the data of one partition of the secondary index table after aggregation and sorting processing; and aggregating the data of the secondary index table with the initial letter of the name pinyin of the primary key in the i-z interval to obtain the data of the other partition of the secondary index table after aggregation and sequencing, and writing the data of the two partitions into the corresponding secondary index table partitions respectively. When the type of the primary key of the secondary index table is Chinese characters, the primary key can be aggregated according to strokes of the primary key in addition to the aggregation classification according to the initial letters of pinyin.
As shown in fig. 3, the records of the first letter of the name pinyin of the primary key in the a-w section are aggregated together, and the records of the names "Chen Er", "litetral", "Liu Yi", "Sun Qi", "wang wu" and "Wu Jiu" in the original data are aggregated in a secondary index table partition; the records of the initial letters of the name pinyin of the primary key in the x-z interval are aggregated together, namely, the records of the names of 'Zhang San', 'Zhao Liu', 'Zheng Shi', 'Zhouba' in the original data are aggregated in another secondary index table partition.
The aggregation and sorting processing is performed on the data of the secondary index table according to the data structure of the secondary index table, so as to obtain the data of the secondary index table after the aggregation and sorting processing, which comprises the following steps:
obtaining the feature requirement information of the primary key row of the secondary index table;
and according to the characteristic requirement information of the primary key row of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sequencing treatment.
The feature requirement information of the primary key row of the secondary index table may refer to information according to which data of the secondary index table is aggregated and ordered according to the feature setting of the primary key row. For example, if the primary key column is a name, the feature requirement information of the primary key column of the secondary index table may be: the records of the initials of the name pinyin of the primary key in the a-w interval are aggregated together; the records of the initials of the name pinyin of the primary key in the x-z interval are aggregated together.
Further, in order to avoid the problem of occupying resources of a service node where the secondary index table is located caused when the data of the secondary index table is directly written into the secondary index table, the data of the secondary index table may be sent to a summary task, a file including the data of the secondary index table is generated by the summary task, and then the file is loaded into the secondary index table.
It should be noted that, the number of the summary tasks is determined by the number of partitions of the secondary index table, the number of the summary tasks may be the same as the number of the partitions of the secondary index table, the number of the partitions of the secondary index table may be set according to the requirement, and the secondary index table includes at least one partition of the secondary index table. When the secondary index table comprises two or more secondary index table partitions, each secondary index table partition corresponds to a summary task. As shown in FIG. 2, the secondary index table includes two secondary index table partitions, each corresponding to a summary task.
The summary task may be a resumptask in a MapReduce module, which may include an index file write thread (e.g., indexHFileWriter). The generating a file including the data of the secondary index table through the summarizing task, and loading the file into the secondary index table includes: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
If the non-relational database is HBase, the file comprising the data of the secondary index table is HFile of the secondary index table. Generating a file including the data of the secondary index table through the summarization task may refer to writing the constructed data of the secondary index table into the HFile file of the secondary index table through an IndexHFileWriter thread. Because the redtask writes the constructed index data into the file of the secondary index table through the IndexHFileWriter, the operation of writing into the secondary index table through the HBase native API is avoided, the use of the native API to generate an excessively deep call stack is avoided, and the interference to the original service node is reduced; meanwhile, the data of the secondary index table is prevented from being directly written into the secondary index table across the network, and the interference to the service node where the secondary index table is located is reduced.
After the file including the data of the secondary index table is generated by the summarizing task, since the previous step is performed offline, the file including the data of the secondary index table needs to be loaded into the corresponding index table partition.
An application scenario of the first embodiment of the present application is described below with reference to fig. 2 and 3, where, as shown in fig. 2 and 3, the main table includes three main table partitions, which are respectively stored on three regional service nodes, namely, a region server1 node, a region server2 node, and a region server3 node, the main table uses "user id" as a main key, and the secondary index table uses "name" as a main key. The method comprises the steps that three regional service nodes respectively start three independent local regional service threads (LocalRegionServer) to read snapshot data of a main table partition of the corresponding regional service node, then data after aggregation and sequencing are transmitted to two IndexFileWriter tasks through data aggregation and sequencing of a shuffle, so that HFile files containing secondary index table data are generated, finally the HFile files are loaded to the corresponding index table partition (namely the secondary index table partition), and the data in the secondary index table takes names as main keys.
Thus far, the embodiment of the watermark embedding method provided in the first embodiment of the present application is described in detail. According to the first embodiment of the application, the snapshot operation is carried out on the main table partition of each service node storing the main table, the original data of the main table partition is solidified, and the interference of real-time data writing is not worried in the process of constructing the index; by running a mapping service on each service node, the resources of the regional service (region server) are prevented from being occupied, a plurality of local regional service threads can read the partition data of the service nodes concurrently, the throughput of data reading is ensured, and the time for constructing the secondary index is reduced; because the summarizing task generates the file comprising the data of the secondary index table through the index file writing thread, the operation of writing the index table through the HBase native API is avoided, the generation of an excessively deep call stack by using the native API is avoided, and the interference to the original service node is reduced.
Corresponding to the method for constructing the secondary index provided in the first embodiment of the present application, the second embodiment of the present application further provides a device for constructing the secondary index.
As shown in fig. 4, the device for constructing the secondary index includes:
an original data reading unit 401, configured to read original data in the service node through a mapping task;
a secondary index table data construction unit 402, configured to select a non-primary key column in the original data as a primary key column of a secondary index table, and construct data of the secondary index table according to the primary key column of the secondary index table;
an index table data writing unit 403, configured to write data of the secondary index table into the secondary index table.
Optionally, the mapping task includes a local area service thread corresponding to a main table partition in the service node;
the original data reading unit is specifically configured to: and reading the original data of the main table partition corresponding to the local regional service thread in the service node through the local regional service thread in the mapping task.
Optionally, the apparatus further includes: the original data snapshot unit is used for carrying out snapshot operation on the original data in the service node and generating snapshot data of the original data;
The original data reading unit is specifically configured to: reading snapshot data of the original data in the service node through the mapping task;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
Optionally, the apparatus further includes: a primary key column indication information obtaining unit for obtaining indication information of a primary key column of the secondary index table;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and searching a non-primary key column matched with the primary key column of the secondary index table from the original data according to the indication information of the primary key column of the secondary index table, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
Optionally, the secondary index table data construction unit is specifically configured to:
the primary key columns in the original data, and the non-primary key columns other than the primary key columns that are used as the secondary index table, are constructed as the non-primary key columns of the secondary index table.
Optionally, the writing unit of the index table data is specifically configured to:
According to the data structure of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after the aggregation and sequencing treatment;
and writing the data of the secondary index table after the aggregation and sequencing treatment into the index table.
Optionally, the aggregating and sorting processing is performed on the data of the secondary index table according to the data structure of the secondary index table, so as to obtain the data of the secondary index table after the aggregating and sorting processing, including:
obtaining the feature requirement information of the primary key row of the secondary index table;
and according to the characteristic requirement information of the primary key row of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sequencing treatment.
Optionally, the index table data writing unit is specifically configured to:
and generating a file comprising the data of the secondary index table through a summarizing task, and loading the file into the secondary index table.
Optionally, the number of the summarizing tasks is the same as the number of the partitions of the secondary index table.
Optionally, the summarizing task includes an index file writing thread;
The generating a file including the data of the secondary index table through the summarizing task, and loading the file into the secondary index table includes: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
Optionally, the mapping task is a mapping task running in a service node.
Optionally, the secondary index table is a secondary index table in a non-relational database.
It should be noted that, for the detailed description of the device for constructing the secondary index provided in the second embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, which is not repeated here.
Corresponding to the method for constructing the secondary index provided in the first embodiment of the present application, the third embodiment of the present application further provides an electronic device.
As shown in fig. 5, the electronic device includes:
a processor 501; and
a memory 502 for storing a program of a construction method of a secondary index, and after the apparatus is powered on and runs the program of the construction method of the secondary index through the processor, the following steps are performed:
reading original data in the service node through a mapping task;
Selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
Optionally, the mapping task includes a local area service thread corresponding to a main table partition in the service node;
the reading of the original data in the service node by the mapping task comprises the following steps: and reading the original data of the main table partition corresponding to the local regional service thread in the service node through the local regional service thread in the mapping task.
Optionally, the electronic device further performs the following steps: performing snapshot operation on the original data in the service node to generate snapshot data of the original data;
the reading of the original data in the service node by the mapping task comprises the following steps: reading snapshot data of the original data in the service node through the mapping task;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
Optionally, the electronic device further performs the following steps: obtaining indication information of a main key column of a secondary index table;
The selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and searching a non-primary key column matched with the primary key column of the secondary index table from the original data according to the indication information of the primary key column of the secondary index table, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
Optionally, the constructing the data of the secondary index table according to the primary key row of the secondary index table includes:
the primary key columns in the original data, and the non-primary key columns other than the primary key columns that are used as the secondary index table, are constructed as the non-primary key columns of the secondary index table.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
according to the data structure of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after the aggregation and sequencing treatment;
and writing the data of the secondary index table after the aggregation and sequencing treatment into the index table.
Optionally, the aggregating and sorting processing is performed on the data of the secondary index table according to the data structure of the secondary index table, so as to obtain the data of the secondary index table after the aggregating and sorting processing, including:
Obtaining the feature requirement information of the primary key row of the secondary index table;
and according to the characteristic requirement information of the primary key row of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sequencing treatment.
Optionally, the writing the data of the secondary index table into the secondary index table includes:
and generating a file comprising the data of the secondary index table through a summarizing task, and loading the file into the secondary index table.
Optionally, the number of the summarizing tasks is the same as the number of the partitions of the secondary index table.
Optionally, the summarizing task includes an index file writing thread;
the generating a file including the data of the secondary index table through the summarizing task, and loading the file into the secondary index table includes: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
Optionally, the mapping task is a mapping task running in a service node.
Optionally, the secondary index table is a secondary index table in a non-relational database.
It should be noted that, for the detailed description of the electronic device provided in the third embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, which is not repeated here.
Corresponding to the method for constructing a secondary index provided in the first embodiment of the present application, the fourth embodiment of the present application further provides a storage device, storing a program of the method for constructing a secondary index, the program being executed by a processor, and executing the steps of:
reading original data in the service node through a mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
and writing the data of the secondary index table into the secondary index table.
It should be noted that, for the detailed description of the storage device provided in the fourth embodiment of the present application, reference may be made to the related description of the first embodiment of the present application, which is not repeated here.
A fifth embodiment of the present application provides an offline construction method of a secondary index, which is described below with reference to fig. 6.
As shown in fig. 6, in step S601, original offline data in a service node is read by a mapping task.
The original offline data in the service node includes original offline data of a distributed database.
The distributed database includes NoSQL (non-relational database), such as: HBASE (an open source distributed NoSQL storage system) and the like.
The mapping task may be a mapping task running in a service node, and may include a local area service thread corresponding to a primary table partition in the service node. The mapping task may be a Map task in MapReduce.
As shown in fig. 6, in step S602, a non-primary key column in the original offline data is selected as a primary key column of a secondary index table, and the offline data of the secondary index table is constructed according to the primary key column of the secondary index table.
The implementation of this step is similar to step S102 of the first embodiment of the present application, and only the data of the secondary index table in step S102 need be changed to the offline data of the secondary index table, which is not described in detail herein.
As shown in fig. 6, in step S603, the offline data of the secondary index table is written into the secondary index table.
The writing the offline data of the secondary index table into the secondary index table comprises the following steps:
and generating a file comprising offline data of the secondary index table through a summarizing task, and loading the file into the secondary index table.
It should be noted that, the number of the summary tasks is determined by the number of partitions of the secondary index table, the number of the summary tasks may be the same as the number of the partitions of the secondary index table, the number of the partitions of the secondary index table may be set according to the requirement, and the secondary index table includes at least one partition of the secondary index table. When the secondary index table comprises two or more secondary index table partitions, each secondary index table partition corresponds to a summary task. As shown in FIG. 2, the secondary index table includes two secondary index table partitions, each corresponding to a summary task.
The summary task may be a resumptask in a MapReduce module, which may include an index file write thread (e.g., indexHFileWriter). The generating a file of offline data comprising the secondary index table through the summarizing task, and loading the file into the secondary index table comprises the following steps: generating a file comprising offline data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
According to the fifth embodiment of the application, the offline data of the secondary index table can be sent to the summarizing task, the file comprising the offline data of the secondary index table is generated through the summarizing task, and then the file is loaded into the secondary index table, so that the offline construction of the secondary index is realized, and the problem that resources of a service node where the secondary index table is located are occupied when the data of the secondary index table is written into the secondary index table on line is solved.
While the preferred embodiment has been disclosed, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications which fall within the spirit and scope of the present invention, and therefore the scope of the present invention shall be defined by the appended claims.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (14)

1. A method for constructing a secondary index, comprising:
reading original data in the service node through a mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
writing the data of the secondary index table into the secondary index table;
the writing the data of the secondary index table into the secondary index table comprises the following steps: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
2. The method of claim 1, wherein the mapping task comprises a local area service thread corresponding to a primary table partition in a service node;
the reading of the original data in the service node by the mapping task comprises the following steps: and reading the original data of the main table partition corresponding to the local regional service thread in the service node through the local regional service thread in the mapping task.
3. The method as recited in claim 1, further comprising: performing snapshot operation on the original data in the service node to generate snapshot data of the original data;
the reading of the original data in the service node by the mapping task comprises the following steps: reading snapshot data of the original data in the service node through the mapping task;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and selecting a non-primary key column in the snapshot data of the original data as a primary key column of a secondary index table.
4. The method as recited in claim 1, further comprising: obtaining indication information of a main key column of a secondary index table;
the selecting the non-primary key column in the original data as the primary key column of the secondary index table includes: and searching a non-primary key column matched with the primary key column of the secondary index table from the original data according to the indication information of the primary key column of the secondary index table, and determining the non-primary key column matched with the primary key column of the secondary index table as the primary key column of the secondary index table.
5. The method of claim 1, wherein constructing the data of the secondary index table from the primary key columns of the secondary index table comprises:
the primary key columns in the original data, and the non-primary key columns other than the primary key columns that are used as the secondary index table, are constructed as the non-primary key columns of the secondary index table.
6. The method of claim 1, wherein writing the data of the secondary index table into the secondary index table comprises:
according to the data structure of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after the aggregation and sequencing treatment;
and writing the data of the secondary index table after the aggregation and sequencing treatment into the index table.
7. The method of claim 6, wherein the aggregating and sorting the data of the secondary index table according to the data structure of the secondary index table, to obtain the aggregated and sorted data of the secondary index table, comprises:
obtaining the feature requirement information of the primary key row of the secondary index table;
and according to the characteristic requirement information of the primary key row of the secondary index table, carrying out aggregation and sequencing treatment on the data of the secondary index table to obtain the data of the secondary index table after aggregation and sequencing treatment.
8. The method of claim 1, wherein the number of summary tasks is the same as the number of partitions of the secondary index table.
9. The method of claim 1, wherein the mapping task is a mapping task running in a service node.
10. The method of claim 1, wherein the secondary index table is a secondary index table in a non-relational database.
11. A secondary index building apparatus, comprising:
the original data reading unit is used for reading the original data in the service node through the mapping task;
a secondary index table data construction unit, configured to select a non-primary key column in the original data as a primary key column of a secondary index table, and construct data of the secondary index table according to the primary key column of the secondary index table;
an index table data writing unit, configured to write data of the secondary index table into the secondary index table, includes: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
12. An electronic device, comprising:
A processor; and
a memory for storing a program of a construction method of the secondary index, the apparatus being powered on and executing the program of the construction method of the secondary index by the processor, and performing the steps of:
reading original data in the service node through a mapping task;
selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
writing the data of the secondary index table into the secondary index table;
the writing the data of the secondary index table into the secondary index table comprises the following steps: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
13. A memory device, characterized in that,
a program for storing a construction method of a secondary index, the program being executed by a processor, the program performing the steps of:
reading original data in a service node through a mapping task, selecting a non-primary key column in the original data as a primary key column of a secondary index table, and constructing data of the secondary index table according to the primary key column of the secondary index table;
Writing the data of the secondary index table into the secondary index table;
the writing the data of the secondary index table into the secondary index table comprises the following steps: generating a file comprising data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
14. An offline construction method of a secondary index, comprising the steps of:
reading original offline data in the service node through a mapping task;
selecting a non-primary key column in the original offline data as a primary key column of a secondary index table, and constructing the offline data of the secondary index table according to the primary key column of the secondary index table;
writing the offline data of the secondary index table into the secondary index table;
the writing the offline data of the secondary index table into the secondary index table comprises the following steps: generating a file comprising offline data of the secondary index table through an index file writing thread in a summary task, and loading the file into the secondary index table.
CN201811426358.5A 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index Active CN111221814B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811426358.5A CN111221814B (en) 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811426358.5A CN111221814B (en) 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index

Publications (2)

Publication Number Publication Date
CN111221814A CN111221814A (en) 2020-06-02
CN111221814B true CN111221814B (en) 2023-06-27

Family

ID=70809354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811426358.5A Active CN111221814B (en) 2018-11-27 2018-11-27 Method, device and equipment for constructing secondary index

Country Status (1)

Country Link
CN (1) CN111221814B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377786B (en) * 2021-08-16 2021-11-02 北京易鲸捷信息技术有限公司 Method for realizing on-line index creation
CN113868251B (en) 2021-09-24 2022-10-18 北京百度网讯科技有限公司 Global secondary indexing method and device for distributed database

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853288A (en) * 2010-05-19 2010-10-06 马晓普 Configurable full-text retrieval service system based on document real-time monitoring
WO2011011996A1 (en) * 2009-07-31 2011-02-03 中兴通讯股份有限公司 Method for backing up terminal data and system thereof
CN102193917A (en) * 2010-03-01 2011-09-21 中国移动通信集团公司 Method and device for processing and querying data
CN103020204A (en) * 2012-12-05 2013-04-03 北京普泽天玑数据技术有限公司 Method and system for carrying out multi-dimensional regional inquiry on distribution type sequence table
CN103324762A (en) * 2013-07-17 2013-09-25 陆嘉恒 Hadoop-based index creation method and indexing method thereof
CN106326239A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Distributed file system and file meta-information management method thereof
CN107783974A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Data handling system and method
CN108228819A (en) * 2017-12-29 2018-06-29 武汉长江仪器自动化研究所有限公司 Methods For The Prediction Ofthe Deformation of A Large Dam based on big data platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7933927B2 (en) * 2004-11-17 2011-04-26 Bmc Software, Inc. Method and apparatus for building index of source data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011011996A1 (en) * 2009-07-31 2011-02-03 中兴通讯股份有限公司 Method for backing up terminal data and system thereof
CN102193917A (en) * 2010-03-01 2011-09-21 中国移动通信集团公司 Method and device for processing and querying data
CN101853288A (en) * 2010-05-19 2010-10-06 马晓普 Configurable full-text retrieval service system based on document real-time monitoring
CN103020204A (en) * 2012-12-05 2013-04-03 北京普泽天玑数据技术有限公司 Method and system for carrying out multi-dimensional regional inquiry on distribution type sequence table
CN103324762A (en) * 2013-07-17 2013-09-25 陆嘉恒 Hadoop-based index creation method and indexing method thereof
CN106326239A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Distributed file system and file meta-information management method thereof
CN107783974A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Data handling system and method
CN108228819A (en) * 2017-12-29 2018-06-29 武汉长江仪器自动化研究所有限公司 Methods For The Prediction Ofthe Deformation of A Large Dam based on big data platform

Also Published As

Publication number Publication date
CN111221814A (en) 2020-06-02

Similar Documents

Publication Publication Date Title
US11288282B2 (en) Distributed database systems and methods with pluggable storage engines
US9953102B2 (en) Creating NoSQL database index for semi-structured data
US10331641B2 (en) Hash database configuration method and apparatus
US9262458B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
JP2017507426A (en) Transparent discovery of semi-structured data schema
CN107391544B (en) Processing method, device and equipment of column type storage data and computer storage medium
US10310904B2 (en) Distributed technique for allocating long-lived jobs among worker processes
CN103914483B (en) File memory method, device and file reading, device
CN112181902B (en) Database storage method and device and electronic equipment
Liu The University of Chicago
CN111221814B (en) Method, device and equipment for constructing secondary index
CN114218267A (en) Query request asynchronous processing method and device, computer equipment and storage medium
US10534765B2 (en) Assigning segments of a shared database storage to nodes
Yassien et al. RDBMS, NoSQL, Hadoop: a performance-based empirical analysis
JP2012168781A (en) Distributed data-store system, and record management method in distributed data-store system
CN103559247A (en) Data service processing method and device
CN104408084A (en) Method and device for screening big data
US20170270149A1 (en) Database systems with re-ordered replicas and methods of accessing and backing up databases
GB2516501A (en) Method and system for processing data in a parallel database environment
Niranjanamurthy et al. The research study on DynamoDB—NoSQL database service
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
US9442948B2 (en) Resource-specific control blocks for database cache
CN107102898B (en) Memory management and data structure construction method and device based on NUMA (non Uniform memory Access) architecture
CN111143711A (en) Object searching method and system
CN115544096B (en) Data query method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant