CN115114293A - Database index creating method, related device, equipment and storage medium - Google Patents

Database index creating method, related device, equipment and storage medium Download PDF

Info

Publication number
CN115114293A
CN115114293A CN202210752984.3A CN202210752984A CN115114293A CN 115114293 A CN115114293 A CN 115114293A CN 202210752984 A CN202210752984 A CN 202210752984A CN 115114293 A CN115114293 A CN 115114293A
Authority
CN
China
Prior art keywords
data
target
index
data set
threads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210752984.3A
Other languages
Chinese (zh)
Inventor
王冬慧
王鲁俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Shenzhen Tencent Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Computer Systems Co Ltd filed Critical Shenzhen Tencent Computer Systems Co Ltd
Priority to CN202210752984.3A priority Critical patent/CN115114293A/en
Publication of CN115114293A publication Critical patent/CN115114293A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The application discloses a method for creating a database index, and relates to a database management technology. The application includes: acquiring a sampling point data sequence from a target data set, wherein the sampling point data sequence comprises at least one sampling point data in an ascending order or a descending order; determining at least one sub-site data from the sampling point data sequence; dividing a target data set into at least two unordered data partitions according to at least one piece of sub-site data; calling at least two threads to sequence data in at least two unordered data partitions to obtain at least two ordered data partitions; and constructing a database index corresponding to the target field according to the at least two ordered data partitions. The application also provides a device, equipment and a storage medium. The data in a plurality of unordered data partitions are sequenced in a parallel mode, and the data are sequenced outside the data partitions, so that the index creation efficiency is improved.

Description

Database index creating method, related device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies and database technologies, and in particular, to a method, a related apparatus, a device, and a storage medium for creating a database index.
Background
An index is a structure that orders values of one or more columns in a database table. One of the main purposes of indexing is to speed up the retrieval of data in a table, i.e. to assist the information searcher to find the auxiliary data structure of record identification meeting the restriction condition as soon as possible. Since the table data volume of a database is often large, it is necessary to establish an efficient index.
Currently, the way to create an index in a relational database management system (MySQL) mainly includes three phases. The entire primary key index file is scanned using a single thread in a first stage to generate a secondary index data file. And in the second stage, a single thread is used for carrying out two-way merging and sequencing until the whole file is ordered. And in the third stage, the index data files which are sequenced in the second stage are scanned by using a single thread and are sequentially inserted into the index tree, so that the index tree is obtained.
However, the inventors have found that at least the following problem exists in the prior art, because all three phases are performed serially by a single thread in the process of creating the database index, resulting in low efficiency of creating the index.
Disclosure of Invention
The embodiment of the application provides a database index creating method, a related device, equipment and a storage medium. The data in the plurality of unordered data partitions are sequenced in a parallel mode, and the data in the unordered data partitions are sequenced in sequence, so that external sequencing is not needed, and the index creating efficiency is improved.
In view of the above, an aspect of the present application provides a method for creating a database index, including:
acquiring a sampling point data sequence from a target data set, wherein the target data set comprises a data set corresponding to a target field, and the sampling point data sequence comprises at least one sampling point data in an ascending order or a descending order;
determining at least one sub-site data from the sampling point data sequence;
dividing a target data set into at least two unordered data partitions according to at least one piece of sub-site data;
calling at least two threads to sequence data in at least two unordered data partitions to obtain at least two ordered data partitions, wherein the ordered data partitions and the unordered data partitions have one-to-one correspondence;
and constructing a database index corresponding to the target field according to the at least two ordered data partitions.
Another aspect of the present application provides a method for creating a database index, including:
calling N threads to scan an original data set to obtain N data subsets, wherein each data subset comprises a data set corresponding to a target field, each data subset is obtained by scanning one thread, and N is an integer greater than 1;
calling N threads, and sequencing data in the N data subsets to obtain N first data sequences, wherein each thread in the N threads is used for sequencing data in one data subset;
calling N/2 threads, performing two-way merging sequencing on the N first data sequences to obtain N/2 second data sequences until a target data sequence is obtained, wherein each thread in the N/2 threads is used for sequencing two first data sequences;
and constructing a database index corresponding to the target field according to the target data sequence.
Another aspect of the present application provides a method for creating a database index, including:
calling N threads to scan the original data sets to obtain N first data sets, wherein each first data set comprises a data set corresponding to a target field, each first data set is obtained by scanning one thread, and N is an integer greater than 1;
calling N threads, and sequencing data in the N first data sets to obtain N first data sequences, wherein each thread in the N threads is used for sequencing data in one first data set;
determining (N-1) pivots corresponding to each first data sequence according to the N first data sequences, wherein each pivot is used for dividing data in the first data sequences;
dividing the N first data sequences into N second data sets according to (N-1) pivots;
calling N threads, and sequencing data in the N second data sets to obtain N second data sequences, wherein each thread in the N threads is used for sequencing data in one second data set;
and constructing a database index corresponding to the target field according to the N second data sequences.
Another aspect of the present application provides an index creating apparatus, including:
the acquisition module is used for acquiring a sampling point data sequence from a target data set, wherein the target data set comprises a data set corresponding to a target field, and the sampling point data sequence comprises at least one sampling point data in an ascending order or a descending order;
the determining module is used for determining at least one sub-bit data from the sampling point data sequence;
the dividing module is used for dividing the target data set into at least two unordered data partitions according to at least one sub-site data;
the sequencing module is used for calling at least two threads to sequence the data in the at least two unordered data partitions to obtain at least two ordered data partitions, wherein the ordered data partitions and the unordered data partitions have one-to-one correspondence;
and the construction module is used for constructing the database index corresponding to the target field according to the at least two ordered data partitions.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the acquisition module is specifically used for sampling data corresponding to the target field according to a preset sampling interval when S threads are called to scan an original data set, wherein the original data set comprises a data set corresponding to the target field, and S is an integer greater than 1;
when the scanning of the original data set is finished, obtaining a target data set and at least one sampling point data to be sequenced, wherein the target data set comprises at least one data subset, and each data subset is obtained by scanning of one thread;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the acquisition module is specifically used for sampling data corresponding to a target field according to a preset sampling interval when a target thread is called to scan an original data set, wherein the original data set comprises a data set corresponding to the target field;
when the scanning of the original data set is finished, obtaining a target data set and at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the acquisition module is specifically used for calling S threads to scan an original data set to obtain a target data set, wherein the original data set comprises a data set corresponding to a target field, the target data set comprises at least one data subset, each data subset is obtained by scanning one thread, and S is an integer greater than 1;
calling at least two threads, and scanning the target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the acquisition module is specifically used for calling a target thread to scan an original data set to obtain a target data set, wherein the original data set comprises a data set corresponding to a target field;
calling a target thread, and scanning a target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the determining module is specifically used for determining the total number of the sampling point data according to the sampling point data sequence;
determining the total amount of data of the quantile points according to the total amount of the data partitions to be divided;
carrying out quotient calculation on the total number of the sampling point data and the total number of the quantile point data to obtain a target quotient value;
and extracting sampling point data from the sampling point data sequence as sub-bit data according to the integral multiple of the target quotient value to obtain at least one sub-bit data.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the dividing module is specifically used for determining at least two data interval ranges according to at least one piece of sub-site data;
and calling a target thread, and dividing data corresponding to a target field in a target data set into corresponding data interval ranges to obtain at least two unordered data partitions.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the dividing module is specifically used for determining at least two data interval ranges according to at least one piece of sub-site data;
and calling N threads, and dividing data corresponding to target fields in N data subsets into corresponding data interval ranges to obtain at least two unordered data partitions, wherein the N data subsets belong to a target data set, each thread is used for dividing the data in one data subset into corresponding data area ranges, and N is an integer greater than 1.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the sorting module is specifically configured to call N threads, sort data in the N unordered data partitions, and obtain N ordered data partitions, where each thread is used to sort data in one unordered data partition, and N is an integer greater than 1.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the building module is specifically used for calling N threads and carrying out index building on N ordered data partitions to obtain N index trees, wherein each thread is used for building a corresponding index tree according to the ordered data partitions, and N is an integer greater than 1;
and constructing a target index tree corresponding to the target field according to the N index trees, wherein the target index tree belongs to the database index.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
the building module is specifically used for determining the highest tree layer number according to the N index trees;
according to the highest tree layer number, adding root nodes to the index trees with the tree layer number smaller than the highest tree layer number in the N index trees until N index trees with the same tree layer number are obtained;
and carrying out node combination processing on the N index trees with the equal tree layer number to obtain a target index tree corresponding to the target field.
In one possible design, in another implementation of another aspect of an embodiment of the present application,
and the construction module is specifically used for calling the target thread and carrying out index construction on the N ordered data partitions to obtain a target index tree, wherein the target index tree belongs to the database index.
Another aspect of the present application provides an index creating apparatus, including:
the acquisition module is used for calling N threads to scan the original data set to obtain N data subsets, wherein each data subset comprises a data set corresponding to a target field, each data subset is obtained by scanning one thread, and N is an integer greater than 1;
the sorting module is used for calling N threads and sorting data in the N data subsets to obtain N first data sequences, wherein each thread in the N threads is used for sorting the data in one data subset;
the sequencing module is further used for calling N/2 threads, performing two-way merging sequencing on the N first data sequences, obtaining N/2 second data sequences until a target data sequence is obtained, wherein each thread in the N/2 threads is used for sequencing two first data sequences;
and the construction module is used for constructing the database index corresponding to the target field according to the target data sequence.
Another aspect of the present application provides an index creating apparatus, including:
the acquisition module is used for calling N threads to scan the original data sets to obtain N first data sets, wherein each first data set comprises a data set corresponding to a target field, each first data set is obtained by scanning one thread, and N is an integer greater than 1;
the sorting module is used for calling N threads and sorting the data in the N first data sets to obtain N first data sequences, wherein each thread in the N threads is used for sorting the data in one first data set;
the determining module is used for determining (N-1) pivots corresponding to each first data sequence according to the N first data sequences, wherein each pivot is used for dividing data in the first data sequences;
a dividing module, configured to divide the N first data sequences into N second data sets according to (N-1) pivots;
the sequencing module is further configured to call N threads, sequence the data in the N second data sets, and obtain N second data sequences, where each thread in the N threads is used to sequence the data in one second data set;
and the construction module is used for constructing the database index corresponding to the target field according to the N second data sequences.
Another aspect of the present application provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the method of the above aspects when executing the computer program.
Another aspect of the application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the above-described aspects.
In another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the method of the above aspects.
According to the technical scheme, the embodiment of the application has the following advantages:
in the embodiment of the application, a method for creating a database index is provided. In order to improve the parallelism of external sorting, data are divided according to ordered sub-bit data, so that a plurality of internal unordered data partitions with ordered external can be obtained, then a plurality of threads are called to sort the data in the unordered data partitions in parallel, and a plurality of ordered data partitions are generated, so that an ordered data sequence can be obtained to construct a database index. Through the method, on one hand, the data in the unordered data partitions are sequenced in a parallel mode, so that the parallelism of index creation is improved, and the time for creating the index is reduced. On the other hand, because the data partitions are externally ordered, external sorting is not needed, and the index creation efficiency is further improved.
Drawings
FIG. 1 is a schematic diagram of an architecture of a database index creation system according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a B + tree in an embodiment of the present application;
FIG. 3 is a schematic flowchart of a database index creation method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of multi-threaded parallel scan sampling in an embodiment of the present application;
FIG. 5 is a diagram illustrating single-threaded serial scan sampling according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating partitioning according to the partitioning point data in the embodiment of the present application;
FIG. 7 is a diagram illustrating parallel construction of a target index tree according to an embodiment of the present application;
FIG. 8 is a diagram illustrating a process of filling the tree height of an index tree according to an embodiment of the present invention;
FIG. 9 is a diagram illustrating a tree height of a compressed index tree according to an embodiment of the present application;
FIG. 10 is a schematic flow chart illustrating a database index creation method according to an embodiment of the present application;
FIG. 11 is a schematic diagram of generating a target data sequence based on a two-way merging method in an embodiment of the present application;
FIG. 12 is a schematic flow chart illustrating a database index creation method according to an embodiment of the present application;
FIG. 13 is a schematic illustration of pivot determination based on data sets in an embodiment of the present application;
FIG. 14 is a diagram of an index creation apparatus according to an embodiment of the present application;
FIG. 15 is another diagram of an index creation apparatus in an embodiment of the present application;
FIG. 16 is another diagram of an index creation apparatus in an embodiment of the present application;
fig. 17 is a schematic structural diagram of a server in the embodiment of the present application.
Detailed Description
The embodiment of the application provides a database index creating method, a related device, equipment and a storage medium. The data in the plurality of unordered data partitions are sequenced in a parallel mode, and the data in the unordered data partitions are sequenced in sequence, so that external sequencing is not needed, and the index creating efficiency is improved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Advances in Database (Database) technology have provided powerful data organization, data management, and data storage capabilities for a variety of computer applications. With the advent of big data and cloud computing age, changes in data volume are presenting an exponential growth. When a database service system is in operation, an index is often required to be created for a certain data table for some reason. In a relational database, an index is a storage structure for sorting values of one or more columns in a database table, and the index functions as a directory of a book, so that required contents can be quickly found according to page numbers in the directory. Since the present application relates to database technology, a database will be described below.
The database can be regarded as an electronic file cabinet, namely, a place for storing electronic files, and a user can add, inquire, update, delete and the like to the data in the files. So-called "databases" are collections of data that are stored together in a manner that can be shared with multiple users, have as little redundancy as possible, and are independent of applications. A Database Management System (DBMS) is a computer software System designed for managing a Database, and generally has basic functions such as storage, interception, security assurance, and backup. The database management system may be categorized according to the database model it supports, such as relational, Extensible Markup Language (XML); or classified according to the type of computer supported, e.g., server cluster, mobile phone; or classified according to the Query Language used, e.g., Structured Query Language (SQL), XQuery; or by performance impulse emphasis, e.g., maximum size, maximum operating speed; or other classification schemes. Regardless of the manner of classification used, some DBMSs are capable of supporting multiple query languages across categories, for example, simultaneously.
In machine learning, which is involved in Artificial Intelligence (AI), data can be extracted according to an index for training. Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
In order to improve the creating efficiency of the database index, the database index creating method can be applied to a Cloud Database (CDB) to accelerate database operation needing index creation. For example, an index (add index), an optimized data table (optimized table), an altered data table (alter table), etc. are added, and particularly, when the data volume of the table is large, the method provided by the application has a more obvious improvement effect.
The method provided by the present application is applied to the database index creating system shown in fig. 1, as shown in the figure, the database index creating system includes a server and a terminal device, and a client is deployed on the terminal device, wherein the client may run on the terminal device in the form of a browser, or may run on the terminal device in the form of an independent Application (APP), and a specific presentation form of the client is not limited herein. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, safety service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, a vehicle-mounted device, a wearable device, and the like, but is not limited thereto. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited. The scheme provided by the application can be independently completed by the terminal device, can also be independently completed by the server, and can also be completed by the cooperation of the terminal device and the server, so that the application is not particularly limited.
Illustratively, the server is built on the basis of a database system and has the characteristics of the database system, and the main functions comprise database management functions (such as system configuration and management, data access and update management, data integrity management and data security management), query and manipulation functions of the database (such as database retrieval and modification), database maintenance functions and database parallel operation. The user can import data or trigger data instructions and the like through the terminal equipment.
In view of the fact that this application relates to certain terms that are relevant to the art, the following explanation is provided for ease of understanding.
(1) Relational database management system (MySQL): is one of open source relational database management systems, "InnodB" is one of MySQL database engines, now the default storage engine for MySQL. Relational databases maintain data in different tables, which increases speed and flexibility. The Structured Query Language (SQL) used by MySQL is the most common standardized Language for accessing databases.
(2) B + Tree (B + Tree): the self-balancing search tree can keep data stable and ordered, is usually used in a file system of a database and an operating system, and is also a data structure adopted by the composition of the table data files in the InNODB. A table typically consists of multiple B + trees, where a node corresponds to a page (page) on disk. For convenience of understanding, please refer to fig. 2, where fig. 2 is a schematic structural diagram of a B + tree in an embodiment of the present disclosure, as shown in the figure, data is not stored in non-leaf nodes of the B + tree, only key values are stored, and the leaf nodes store not only key values but also data.
In a B + tree of order T, each internal node has T/2 to T child nodes except the root node, and all leaf nodes are at the same level of the tree structure, so the tree structure is always highly balanced.
(3) Index (index): the index is a structure for sequencing one or more columns of values in a database table, and the index can ensure that a database system does not need to scan the whole table but directly locates the records meeting the conditions. In the InoDB storage engine, one index corresponds to one B + tree, and the index is further divided into an aggregated index (i.e., primary key index) and a non-aggregated index (i.e., secondary index).
(4) Primary key index (clustered index): non-leaf nodes corresponding to the B + tree store only the primary key index values, and leaf nodes store the contents of all columns of a row of data. If not explicitly specified, MySQL would generate a hidden column (row _ id) as the primary key column.
(5) Secondary index (secondary index): also called as auxiliary index, the secondary index stores index data and primary key index data corresponding to non-leaf nodes and leaf nodes of the B + tree and does not contain all data of the row record according to whether the index key value is unique and can be divided into unique index and non-unique index.
With reference to the above description, a database index creating method in the present application will be described below, and referring to fig. 3, in an embodiment of the present application, the database index creating method may be executed by a computer device, where the computer device may be a server or a terminal device, and the embodiment of the present application includes:
110. acquiring a sampling point data sequence from a target data set, wherein the target data set comprises a data set corresponding to a target field, and the sampling point data sequence comprises at least one sampling point data in an ascending order or a descending order;
in one or more embodiments, an original data set is first obtained, where the original data set includes a data set corresponding to at least one field. Based on this, the original data set is scanned as required, and all data sets including the target field may be scanned, for example, to obtain a target data set. For example, a plurality of sampling point data may be scanned from the original data set at a certain sampling interval, where the sampling point data is data corresponding to the target field.
It can be seen that the number of sample point data should be less than or equal to the number of data in the target data set, and the sample point data belongs to the target data set. Based on this, the data of the sampling points can be further arranged according to an ascending order or a descending order, so as to obtain a data sequence of the sampling points.
In the present application, the "target field" may be one field or a field set composed of a plurality of fields, and is not limited herein.
120. Determining at least one sub-site data from the sampling point data sequence;
in one or more embodiments, at least one quantile point data is determined from the sequence of sample point data, where the quantile point data may be sample point data in the sequence of sample point data, or the quantile point data may be determined based on the sequence of sample point data. Taking the sample point data sequence as "10, 96, 290, 312, 571, 822, 1203" as an example, it is assumed that one piece of sub-bit data needs to be extracted, and based on this, the following description will be given with reference to the example.
Illustratively, the most middle sample point data may be extracted directly from the sample point data sequence as the quantile point data, for example, the sample point data "312" is taken as the quantile point data.
Illustratively, the most middle sample point data is extracted from the sample point data sequence as the middle data, for example, the middle data "312" is taken as a middle number, and then an interval is constructed in the range of ± 10%, that is, the interval is 280.8 to 343.2. Thus, a value is randomly taken out from the interval as the sub-point data.
130. Dividing a target data set into at least two unordered data partitions according to at least one piece of sub-site data;
in one or more embodiments, the target data set is partitioned according to at least one partition point data, resulting in at least two unordered data partitions. Where the K quantile data may divide the target data set into (K +1) unordered data partitions. In general, the number of data partitions that need to be partitioned can be determined based on the number of threads available. In order to make the thread utilization rate reach 100%, the number of data partitions may be determined as the number of available threads, and based on this, the number of data partitions may be subtracted by 1 to obtain the number of quantile points.
140. Calling at least two threads to sequence data in at least two unordered data partitions to obtain at least two ordered data partitions, wherein the ordered data partitions and the unordered data partitions have one-to-one correspondence;
in one or more embodiments, since the number of available threads is considered when partitioning the data partitions, the resulting number of out-of-order data partitions is equal to the number of available threads. Based on the method, one thread is respectively called to carry out sorting operation on one unordered data partition, and N threads are used for sorting N unordered data partitions in parallel. And after the sequencing is finished, N ordered data partitions can be obtained.
150. And constructing a database index corresponding to the target field according to the at least two ordered data partitions.
In one or more embodiments, after obtaining at least two ordered data partitions, data sorting for the target field may be completed, and then, a database index corresponding to the target field may be constructed according to the sorted data. Wherein, the database index is an ordered index structure.
It should be noted that the database index may be specifically represented in the form of an index tree. The types of index trees referred to in this application include, but are not limited to, B + trees, B-trees, etc., and are not limited thereto. It is understood that the database index corresponding to one field is constructed in steps 110 to 150, and in practical applications, database indexes corresponding to other fields can also be constructed in a similar manner, and meanwhile, database indexes corresponding to different fields can be constructed in parallel by using a plurality of threads.
In the embodiment of the application, a method for creating a database index is provided. Through the mode, on one hand, the data in the unordered data partitions are sequenced in a parallel mode, so that the parallelism of index creation is improved, and the time for creating the index is reduced. On the other hand, since the data partitions are ordered externally, external sorting is not needed, and the index creation efficiency is further improved.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, acquiring a sample point data sequence derived from a target data set may specifically include:
when S threads are called to scan an original data set, sampling data corresponding to the target field according to a preset sampling interval, wherein the original data set comprises the data set corresponding to the target field, and S is an integer greater than 1;
when the scanning of the original data set is finished, obtaining a target data set and at least one sampling point data to be sequenced, wherein the target data set comprises at least one data subset, and each data subset is obtained by scanning of one thread;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one or more embodiments, a way to invoke multi-threaded parallel scanning and sampling of data is presented. As can be appreciated from the foregoing embodiments, scanning the primary key index may generate an unordered secondary index data file. The parallel scanning stage mainly comprises two tasks, wherein one task is to scan a main key record for each index to be created to generate a data file of < secondary index and main key index > ", wherein the < secondary index and main key index >" represents a combination of a secondary index column and a main key index column. Another task is to perform data sampling work. In a database, scanning refers to the process used by a database server to search each record of a table until all records that meet a given condition are returned.
Specifically, for the convenience of understanding, please refer to fig. 4, where fig. 4 is a schematic diagram of multi-thread parallel scan sampling in the embodiment of the present application, and as shown in the figure, taking parallel scanning of 3 threads (i.e., S equals to 3) as an example, based on this, 3 threads may be called to scan three different indexes in an original data set in parallel, thereby obtaining a data set corresponding to each index. Meanwhile, the data under each index can be sampled according to a preset sampling interval, and at least one to-be-sorted sampling point data corresponding to each index is obtained. For example, field a is used as an index column, and in this case, field a is a target field. For another example, when both the field a and the field B are used as index columns, the field a and the field B together constitute a target field.
Based on this, after the scanning of the original data set is completed, the data set for the S indexes can be obtained. For ease of understanding, please refer to table 1, where table 1 is an illustration of an original data set.
TABLE 1
Identification Name (I) Amount of purchase
0 Zhao Xiao 5000
1 Small coin 5000
2 Sun of young Sun 10000
3 Plum 8000
4 Small week 6000
5 Small Wu 5000
6 Xiaozheng 12000
Based on the original data set shown in table 1, it is assumed that only an index of the column "name" needs to be constructed, at this time, the target field is "name", and it is seen that the original data set needs to include a data set corresponding to the target field. Then, at least one thread may be used to scan to extract all "< name, identification >", thereby generating data containing only "< name, identification >", and then writing the data into a file, which is the target data set corresponding to the target field. Alternatively, assuming that the index of the column of "name" is scanned by using 4 threads simultaneously, 4 data subsets are obtained, and the 4 data subsets form a target data set.
For ease of understanding, please refer to table 2, which is an illustration of a target data set.
TABLE 2
Name ID
Plum
3
Small coin 1
Sun of young Sun 2
Small Wu 5
Zhao Xiao 0
Xiaozheng 6
Small week 4
Similarly, if an index of the column "purchase amount" needs to be built, another thread is invoked to scan in parallel to extract all "< purchase amount, identification >", thereby generating data containing "< purchase amount, identification >" only, which is then written to another file for subsequent processing.
It should be noted that, for convenience of introduction, the target data set is taken as an example for description, and it is understood that data sets constructed based on other indexes are also processed in a similar manner, and are not described herein again.
And the thread can perform sampling processing on the data under the target field according to a preset sampling interval while scanning the data. One possible way is to extract data as sample point data at fixed sampling intervals, thereby obtaining at least one sample point data to be sorted. Another way is to extract data as sampling point data at random sampling intervals, that is, before each sampling, a value is randomly selected from a reasonable range, and the data is extracted as sampling point data by using the random value as the sampling interval, thereby obtaining at least one sampling point data to be sorted. Based on this, the sampling point data can be sorted from large to small (i.e. descending order) or from small to large (i.e. ascending order), and the present application will be described by taking as an example the sorting of the sampling point data from small to large (i.e. ascending order), but should not be construed as limiting the present application, so far, the sampling point data sequence is obtained.
It will be appreciated that the amount of data in the sample point data is typically determined by the size of the available memory, e.g. only W size memory is provided, and each record is R in size, then W/R records are sampled.
Secondly, in the embodiment of the application, a mode for calling multithreading parallel scanning and data sampling is provided. By the mode, a multi-thread parallel scanning mode is used when an original data set is scanned, wherein the core idea is mainly to split the range of the index tree corresponding to the main key index and take charge of parallel scanning of data in the range by different threads, so that the effect of accelerating scanning can be achieved. Meanwhile, a parallel scanning mode can be adopted for each index to obtain a plurality of data subsets, and therefore the data scanning efficiency is further improved.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, acquiring a sample point data sequence derived from a target data set may specifically include:
when a target thread is called to scan an original data set, sampling data corresponding to a target field according to a preset sampling interval, wherein the original data set comprises a data set corresponding to the target field;
when the scanning of the original data set is finished, obtaining a target data set and at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one or more embodiments, a manner of invoking single threaded scanning and sampling of data is presented. As can be appreciated from the foregoing embodiments, scanning the primary key index may generate an unordered secondary index data file. The parallel scanning phase has mainly two tasks, one for scanning the primary key record for each index to be created. Another task is to perform data sampling work.
Specifically, for ease of understanding, please refer to fig. 5, where fig. 5 is a schematic diagram of single-thread serial scan sampling in the embodiment of the present application, and as shown in the figure, taking a thread (i.e., a target thread) scan as an example, based on which the target thread may be called to sequentially scan three different indexes in an original data set, thereby obtaining a data set corresponding to each index. Meanwhile, the data under each index can be sampled according to a preset sampling interval, and at least one to-be-sorted sampling point data corresponding to each index is obtained. In which, an index may designate any field as an index column, which is not limited herein.
Based on this, after the target thread finishes scanning the original data set, the target data set corresponding to the target field can be obtained. Optionally, data sets corresponding to other fields may also be obtained. It should be noted that, for convenience of introduction, the target data set is used as an example for description in the present application, and it may be understood that data sets constructed based on other indexes are also processed in a similar manner, and are not described herein again.
And the target thread can scan the data and simultaneously perform sampling processing on the data under the target field according to a preset sampling interval. One possible way is to extract data at fixed sampling intervals as sampling point data, and the other way is to extract data at random sampling intervals as sampling point data, thereby obtaining at least one piece of sampling point data to be sorted. Based on this, the sampling point data can be sorted from large to small (i.e. descending order) or from small to large (i.e. ascending order), and the present application will be described by taking as an example the sorting of the sampling point data from small to large (i.e. ascending order), but should not be construed as limiting the present application, so far, the sampling point data sequence is obtained.
Secondly, in the embodiment of the application, a mode for calling single-thread scanning and data sampling is provided. By the mode, a single-thread serial scanning mode is used when the original data set is scanned, so that the data scanning work can be completed under the condition that the thread quantity is insufficient, and the flexibility and operability of data scanning are improved.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, acquiring a sample point data sequence derived from a target data set may specifically include:
calling S threads to scan an original data set to obtain a target data set, wherein the original data set comprises a data set corresponding to a target field, the target data set comprises at least one data subset, each data subset is obtained by scanning one thread, and S is an integer greater than 1;
calling at least two threads, and scanning a target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one or more embodiments, a way to invoke multi-threading first parallel scanning and then parallel sampling of data is presented. As can be appreciated from the foregoing embodiments, scanning the primary key index may generate an unordered secondary index data file. The scanning phase mainly has two tasks, namely firstly, the main key record needs to be scanned for each index to be created, and then data sampling work needs to be carried out.
Specifically, for example, taking parallel scanning of 3 threads (i.e., S equals 3) as an example, based on this, three different indexes in the original data set may be scanned in parallel by using 3 threads, thereby obtaining a data set corresponding to each index. Illustratively, taking 12 threads (i.e., S equals 12) parallel scanning as an example, based on this, 12 threads may be called to scan three different indexes in the original data set in parallel, and it is assumed that each index employs 4 threads to scan in parallel, thereby obtaining a data set corresponding to each index. Taking the target data set as an example, the target data set includes 4 data subsets, and each data subset is obtained by scanning one thread.
And when the target thread finishes scanning the original data set, obtaining a target data set corresponding to the target field. Optionally, a data set corresponding to other fields may also be obtained. It should be noted that, for convenience of introduction, the target data set is taken as an example for description, and it is understood that data sets constructed based on other indexes are also processed in a similar manner, and are not described herein again.
And after the target data set is obtained, at least two threads can be called to scan the target data set according to a preset sampling interval. For example, one thread scans backward starting with the first data in the target data set. The other thread scans forward starting with the last data in the target data set. And ending the scanning until the scanning parts of the two threads are overlapped, thereby obtaining at least one sampling point data to be sequenced.
One possible way is to extract data at fixed sampling intervals as sampling point data, and the other way is to extract data at random sampling intervals as sampling point data, thereby obtaining at least one sampling point data to be sorted. Based on this, the sampling point data can be sorted from large to small (i.e. descending order) or from small to large (i.e. ascending order), and the present application will be described by taking as an example the sorting of the sampling point data from small to large (i.e. ascending order), but should not be construed as limiting the present application, so far, the sampling point data sequence is obtained.
Secondly, in the embodiment of the application, a mode of calling multithreading to firstly scan in parallel and then sample data in parallel is provided. By the method, a multi-thread parallel scanning mode is used when an original data set is scanned, wherein the core idea is mainly to split the range of the index tree corresponding to the primary key index and take charge of parallel scanning of the data in the range by different threads, so that the effect of accelerating scanning can be achieved. And then, each index is scanned in parallel to obtain a plurality of data subsets, so that a feasible mode is provided for implementation of the scheme.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, acquiring a sample point data sequence derived from a target data set may specifically include:
calling a target thread to scan an original data set to obtain a target data set, wherein the original data set comprises a data set corresponding to a target field;
calling a target thread, and scanning a target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In one or more embodiments, a way to invoke single-threaded parallel scanning before parallel sampling of data is presented. As can be appreciated from the foregoing embodiments, scanning the primary key index may generate an unordered secondary index data file. The scanning stage mainly has two tasks, namely firstly, scanning a main key record for each index to be created, and then carrying out data sampling work.
Specifically, for example, taking scanning of one thread (i.e., the target thread) as an example, based on this, the target thread may be invoked to scan different indexes in the original data set, thereby obtaining a data set corresponding to each index. Taking the example of obtaining the target data set by scanning, the target thread scans the data corresponding to the target field in the original data set, so as to obtain the target data set. It should be noted that, for convenience of introduction, the target data set is taken as an example for description, and it is understood that data sets constructed based on other indexes are also processed in a similar manner, and are not described herein again.
And after the target data set is obtained, the target thread can be called to scan the target data set according to a preset sampling interval. For example, scanning backward from the first data in the target data set or scanning forward from the last data in the target data set. And obtaining at least one sampling point data to be sequenced till the scanning is finished.
One possible way is to extract data at fixed sampling intervals as sampling point data, and the other way is to extract data at random sampling intervals as sampling point data, thereby obtaining at least one sampling point data to be sorted. Based on this, the sampling point data can be sorted from large to small (i.e. descending order) or from small to large (i.e. ascending order), and the present application will be described by taking as an example the sorting of the sampling point data from small to large (i.e. ascending order), but should not be construed as limiting the present application, so far, the sampling point data sequence is obtained.
Secondly, in the embodiment of the application, a mode of calling a single thread to perform parallel scanning and then perform parallel data sampling is provided. By the mode, a single-thread serial scanning mode is used when the original data set is scanned, so that the data scanning work can be completed under the condition that the thread quantity is insufficient, and the flexibility and operability of data scanning are improved.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, determining at least one sub-point data from the sample point data sequence may specifically include:
determining the total number of sampling point data according to the sampling point data sequence;
determining the total amount of quantile point data according to the total amount of data partitions to be divided;
carrying out quotient calculation on the total number of the sampling point data and the total number of the quantile point data to obtain a target quotient value;
and extracting sampling point data from the sampling point data sequence as sub-bit data according to the integral multiple of the target quotient value to obtain at least one sub-bit data.
In one or more embodiments, a manner of extracting sub-point data based on a sequence of sample point data is presented. As can be seen from the foregoing embodiments, the K quantile data may divide the target data set into (K +1) unordered data partitions. For the data set corresponding to each index, the corresponding sampling point data sequence can be respectively obtained, and then at least one sub-bit data is determined according to the sampling point data sequence. The process of determining at least one sub-point data from a sequence of sample point data will now be described, taking as an example a target data set.
Exemplarily, the data obtained after sampling is assumed to be:
290,571,822,312,10,1203,96
the sampling point data sequence obtained after ascending order arrangement is as follows:
10,96,290,312,571,822,1203
based on this, the total number of the sampling point data is determined to be 7 according to the sampling point data sequence, and if the total number of the data partitions to be divided is 3, the total number of the quantile point data is determined to be 2, that is, the total number of the quantile point data is the total number of the data partitions minus 1. Thus, the total number of sample point data (e.g., 7) is divided by the total number of sub-point data (e.g., 2) to obtain a target quotient value (e.g., 3). Then, sampling point data is extracted from the sampling point data sequence as sub-point data according to integral multiple of the target quotient value. For example, the target quotient value is 3, the 3 rd sampling point data (i.e., 290) is extracted from the sampling point data sequence as one sub-point data, and the 6 th sampling point data (i.e., 822) is extracted as another sub-point data. The two sub-site data may be considered to represent a global data distribution.
Secondly, in the embodiment of the application, a way of extracting sub-point data based on a sampling point data sequence is provided. Through the mode, the data of the sampling points are extracted from the data sequence of the sampling points at reasonable intervals, so that the data distribution can be more uniform to a certain extent, the condition that the data volume in each disordered data partition is unbalanced is avoided, and the condition of parallel processing is better met.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, the dividing, according to at least one piece of partition point data, the target data set into at least two unordered data partitions may specifically include:
determining at least two data interval ranges according to at least one piece of sub-site data;
and calling a target thread, and dividing data corresponding to a target field in a target data set into corresponding data interval ranges to obtain at least two unordered data partitions.
In one or more embodiments, a manner of implementing data partitioning based on single threading is presented. As can be seen from the foregoing embodiments, the target data set may be divided into (K +1) unordered data partitions according to the K quantile data, where each unordered data partition corresponds to a data interval range.
Specifically, in connection with the above example, assume that the sample point data sequence is:
10,96,290,312,571,822,1203
assuming that the total number of data partitions to be divided is 3, based on this, two sub-point data, 290 and 822 respectively, can be extracted from the sample point data sequence. Thus, three data interval ranges are obtained. Wherein the first range of data intervals is less than or equal to 290. The second range of data intervals is greater than 290 and less than or equal to 822. The third range of data intervals is greater than 822. Then, one thread (i.e., the target thread) may be invoked to divide the data corresponding to the target field in the target data set. The target thread may perform data division backward from the first data in the target data set, or the target thread may perform data division forward from the last data in the target data set. And after the data corresponding to the target field in the target data set is divided, obtaining the unordered data partition corresponding to each data interval range.
Taking the sub-point data as Q1 as an example, the first division manner is that one data interval range is smaller than or equal to Q1, and the other data interval range is larger than Q1. Illustratively, the second division is that one data interval range is less than Q1 and the other data interval range is greater than or equal to Q1. It is to be understood that the present application may adopt the first division manner or the second division manner, and is not limited herein.
Secondly, in the embodiment of the application, a mode for realizing data partitioning based on a single thread is provided. By the method, after the quantile point data are generated for each index, the data corresponding to the target field in the target data set are divided into the corresponding data interval range according to the quantile point data. Since the sampling point data sequence is ordered, the extracted quantile data is also ordered. Because of this, the unordered data partitions are also ordered among themselves. The external ordering stage is simple, and the multiple threads completely independently and parallelly order the data in the unordered data partition. Therefore, the finally generated ordered data partitions are globally ordered, and the ordering efficiency is improved.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, the dividing, according to at least one piece of partition point data, the target data set into at least two unordered data partitions may specifically include:
determining at least two data interval ranges according to at least one piece of sub-site data;
and calling N threads, and dividing data corresponding to target fields in N data subsets into corresponding data interval ranges to obtain at least two unordered data partitions, wherein the N data subsets belong to a target data set, each thread is used for dividing the data in one data subset into corresponding data area ranges, and N is an integer greater than 1.
In one or more embodiments, a manner of implementing data partitioning based on multiple threads is presented. As can be seen from the foregoing embodiments, the target data set may be divided into (K +1) unordered data partitions according to the K quantile data, where each unordered data partition corresponds to a data interval range.
Specifically, in connection with the above example, assume that the sample point data sequence is:
10,96,290,312,571,822,1203
assuming that the total number of data partitions to be divided is 3, based on this, two pieces of sub-bit data, 290 and 822, respectively, can be extracted from the sample point data sequence. Thus, three data interval ranges are obtained. Wherein the first range of data intervals is less than or equal to 290. The second range of data intervals is greater than 290 and less than or equal to 822. The third range of data intervals is greater than 822. Thus, N threads (e.g., 3 threads) may be invoked to divide the data corresponding to the destination field in the destination data set.
For ease of understanding, please refer to fig. 6, where fig. 6 is a schematic diagram illustrating partitioning according to the divided point data in the embodiment of the present application, and as shown in the figure, it is assumed that the target data set includes 3 data subsets, where each data subset is scanned by one thread. Thus, N threads (i.e., 3 threads) may be invoked to divide the data corresponding to the target field in the 3 data subsets, respectively. Illustratively, thread 1 divides the data in data subset 1, resulting in data within three data intervals, respectively. The thread 2 divides the data in the data subset 2, so as to obtain data within three data intervals respectively. The thread 3 divides the data in the data subset 3, so as to obtain data within three data intervals respectively. Wherein the white area represents the first data interval range, the gray area represents the second data interval range, and the black area represents the third data interval range.
Based on this, the local partition results obtained by processing each thread are merged to obtain a global unordered data partition, which is to say 3 unordered data partitions as illustrated in fig. 6. All data in the unordered data partition 1 is smaller than the first partition data. All data in the unordered data partition 2 is greater than or equal to the first quantile point data and less than the second quantile point data. All data in the unordered data partition 3 is larger than the second sub-bitmap data.
It should be noted that, the dividing of the data interval range based on the quantile point data is as described in the foregoing embodiments, and details are not described here.
Secondly, in the embodiment of the application, a mode for realizing data partitioning based on multiple threads is provided. By the method, after the quantile point data are generated for each index, according to the quantile point data, the data corresponding to the target field in the target data set can be divided into the corresponding data interval range in parallel by using a plurality of threads, and therefore index creation efficiency is improved. And because the sampling point data sequence is ordered, the extracted quantile point data is also ordered. Thus, the unordered data partitions are also ordered among themselves. The external sorting stage is simpler, so that the sorting efficiency is improved.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, at least two threads are called to sort data in at least two unordered data partitions, so as to obtain at least two ordered data partitions, which may specifically include:
and calling N threads, and sequencing the data in the N unordered data partitions to obtain N ordered data partitions, wherein each thread is used for sequencing the data in one unordered data partition, and N is an integer greater than 1.
In one or more embodiments, a manner of implementing parallel ordering based on multiple threads is presented. As can be seen from the foregoing embodiments, after N out-of-order data partitions are obtained, the same number of threads may be invoked to order the data within the N out-of-order data partitions.
Specifically, assuming that N is 3, based on this, after 3 unordered data partitions are obtained, thread 1 may be invoked to sort data corresponding to a target field in unordered data partition 1, and after the sorting is completed, ordered data partition 1 is obtained. Similarly, the calling thread 2 sequences the data corresponding to the target field in the unordered data partition 2, and the ordered data partition 2 is obtained after the sequencing is completed. And the calling thread 3 sorts the data corresponding to the target field in the unordered data partition 3, and the ordered data partition 3 is obtained after the sorting is completed. It should be noted that, thread 1, thread 2, and thread 3 work in parallel, so that the data sorting efficiency can be improved.
As can be seen, after the data partitions are internally sorted, each ordered data partition is obtained. And external ordering is realized among the data partitions, so that the finally generated N ordered data partitions are also globally ordered.
Secondly, in the embodiment of the application, a mode for realizing parallel sorting based on multiple threads is provided. By the method, after at least two unordered data partitions are obtained according to the quantile point data division, one thread can be respectively adopted for sorting aiming at the data in each unordered data partition. Therefore, parallel sequencing is carried out on the multiple unordered data partitions, the parallelism degree of index creation is improved, and the efficiency of index creation is improved.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, constructing the database index corresponding to the target field according to at least two ordered data partitions specifically may include:
calling N threads, and performing index construction on N ordered data partitions to obtain N index trees, wherein each thread is used for constructing a corresponding index tree according to the ordered data partitions, and N is an integer greater than 1;
and constructing a target index tree corresponding to the target field according to the N index trees, wherein the target index tree belongs to the database index.
In one or more embodiments, a way to implement parallel treeing based on multiple threads is presented. It can be seen from the foregoing embodiment that N ordered data partitions have been obtained in the previous stage, based on which, N threads may be invoked to construct their corresponding index trees for the N ordered data partitions, and finally, the index trees are synthesized into one target index tree, which is a database index for a target field. The construction of the B + tree is taken as an example for explanation, and it can be understood that in practical application, the construction of other ordered data structures is also supported.
Specifically, for convenience of understanding, please refer to fig. 7, fig. 7 is a schematic diagram of parallel construction of a target index tree in the embodiment of the present application, and as shown in the figure, it is assumed that there are 3 ordered data partitions, namely, an ordered data partition 1, an ordered data partition 2, and an ordered data partition 3, and based on this, a thread 1 may be invoked to construct a corresponding index tree 1 for data in the ordered data partition 1. Thread 2 may be invoked to build a corresponding index tree 2 for the data within the ordered data partition 2. Thread 3 may be invoked to build a corresponding index tree 3 for the data within the ordered data partition 3. Index tree 1, index tree 2, and index tree 3 are three subtrees of the target index tree, and thus, the three index trees are merged to form one target index tree.
It will be appreciated that in the stage of building the B + tree (i.e., the target index tree), the N threads are responsible for sorting the N ordered partitions of data produced by the stage, respectively, and building sub B + trees (i.e., the index trees) for them, respectively. Since the N ordered data partitions are globally ordered, merging all the sub B + trees (i.e., the N index trees) to obtain an overall B + tree (i.e., the target index tree) is also the desired database index structure.
Secondly, in the embodiment of the application, a mode for realizing parallel creation of the index tree based on multithreading is provided. By the method, the index tree is created in parallel by using the multiple threads, so that the parallelism of creating the index is improved, and the efficiency of creating the index is improved. Meanwhile, multithreading parallelism can be realized by scanning the whole main key index file, the multithreading parallelism can be realized by sorting each unordered data partition, and the multithreading parallelism can also be realized when an index tree is constructed. Therefore, when the database operation of index creation and reconstruction is involved, the time for creating the index can be significantly reduced, thereby reducing the time delay of user operation.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, constructing a target index tree corresponding to a target field according to N index trees may specifically include:
determining the highest tree layer number according to the N index trees;
according to the highest tree layer number, adding root nodes to the index trees with the tree layer number smaller than the highest tree layer number in the N index trees until N index trees with the same tree layer number are obtained;
and carrying out node combination processing on the N index trees with the equal tree layer number to obtain a target index tree corresponding to the target field.
In one or more embodiments, a manner of merging multiple index trees is presented. As can be seen from the foregoing embodiments, after N index trees are constructed, the index trees need to be further processed according to the characteristics of the target index tree (for example, the characteristics of the B + tree are that the left and right child nodes can be connected by pointers).
Specifically, each layer of the N index trees may be horizontally concatenated, and if the tree heights of the index trees are inconsistent, the tree heights of the index trees with insufficient heights are complemented according to the highest index tree, and then a new tree root is generated. Wherein, the height of the filling tree can be realized by adding a root node. Finally, an attempt is made to compress the tree height for the level of the index tree root. The compression tree height can judge whether the child nodes under the root node can be merged to the root node, and if the data volume can be stored by one node, the compression tree height can be represented, namely, the compression tree height can be compressed.
Illustratively, the manner in which the tree height is filled will be described below in conjunction with FIG. 8. Assuming that the number of the highest tree levels determined in the N index trees is three, for convenience of understanding, please refer to fig. 8, where fig. 8 is a schematic diagram illustrating the height of the complete index tree in the embodiment of the present application, and as shown in the figure, the index tree on the left side is only two levels, so that the original root node can be copied, and the copied node is used as the new root node of the index tree. For example, the original root node is "258", and after the tree height is filled, the root node is still "258".
Illustratively, the manner in which the tree height is compressed will be described below in conjunction with FIG. 9. For easy understanding, please refer to fig. 9, fig. 9 is a schematic diagram of the height of the compressed index tree in the embodiment of the present application, and as shown in the figure, it is assumed that the upper diagram is a merged index tree of two index trees or a resultant index tree, based on which, node "268" and node "1115" may be merged, and thus, leaf node "8" and leaf node "1011" may also be merged into one leaf node. Thereby compressing the original three-layer tree height to a two-layer tree height. Finally, a target index tree corresponding to the target field can be obtained.
Third, in the embodiment of the present application, a manner of merging a plurality of index trees is provided. By the mode, on one hand, the tree heights of all the index trees can be filled, so that the left child node and the right child node can be connected by the pointer, and the construction requirement of the B + tree is met. On the other hand, after the tree height is filled, the tree height can be compressed, so that the aim of optimizing the data structure is fulfilled.
Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, constructing the database index corresponding to the target field according to at least two ordered data partitions specifically may include:
and calling a target thread, and performing index construction on the N ordered data partitions to obtain a target index tree, wherein the target index tree belongs to the database index.
In one or more embodiments, a manner of building an index tree based on a single thread is presented. As can be seen from the foregoing embodiments, N ordered data partitions have been obtained at the last stage, and based on this, one thread (i.e., target thread) may be invoked to build its corresponding target index tree for the N ordered data partitions. The construction of the B + tree is taken as an example for explanation, and it can be understood that in practical application, the construction of other ordered data structures is also supported.
Specifically, assuming that there are 3 sequential data partitions, namely, sequential data partition 1, sequential data partition 2, and sequential data partition 3, since each sequential data partition also satisfies a certain order, the data of sequential data partition 1, the data of sequential data partition 2, and the data of sequential data partition 3 may be spliced together according to the corresponding order, thereby obtaining a target data sequence. Thus, the target thread may be invoked to build a corresponding target index tree for the data of the target data sequence.
Illustratively, if the data within the ordered data partitions satisfies the rule of ascending order, then the rule of ascending order should also be satisfied between the various ordered data partitions. For example, the sequential data in the ordered data partition 1 is "15, 18, 23, 36, 42", the sequential data in the ordered data partition 2 is "48, 55, 66, 79, 87, 100", and based on this, the obtained target data sequence is "15, 18, 23, 36, 42, 48, 55, 66, 79, 87, 100".
Illustratively, if the data within the ordered data partitions satisfies the descending order rule, then the descending order rule should also be satisfied between the various ordered data partitions. For example, the sequential data in the ordered data partition 1 is "100, 87, 79, 66, 55", and the sequential data in the ordered data partition 2 is "48, 42, 36, 23, 18, 15", and based on this, the obtained target data sequence is "100, 87, 79, 66, 55, 48, 42, 36, 23, 18, 15".
Secondly, in the embodiment of the application, a mode for establishing an index tree based on a single thread is provided. By the method, the target index tree corresponding to the target field is constructed in a single thread mode, so that the feasibility and operability of the scheme are improved.
With reference to the above description, another database index creation method in the present application will be described below, and referring to fig. 10, a database index creation method in an embodiment of the present application includes:
210. calling N threads to scan the original data set to obtain N data subsets, wherein each data subset comprises a data set corresponding to a target field, each data subset is obtained by scanning one thread, and N is an integer greater than 1;
in one or more embodiments, an original data set is obtained, and N threads are called as required to scan the original data set, that is, all data sets including target fields are scanned, where the original data set includes a data set corresponding to at least one field. Assuming that 8 threads are invoked to scan the original data set, 8 data subsets are obtained. In the present application, the "target field" may be one field or a field set composed of a plurality of fields, and is not limited herein.
220. Calling N threads, and sequencing data in the N data subsets to obtain N first data sequences, wherein each thread in the N threads is used for sequencing data in one data subset;
in one or more embodiments, N threads may be called to sort the data in the N data subsets, that is, each thread separately sorts the data in one data subset in an ascending order or a descending order, so as to achieve the purpose of parallel sorting. Thereby, N first data sequences are obtained.
230. Calling N/2 threads, performing two-way merging sequencing on the N first data sequences to obtain N/2 second data sequences until a target data sequence is obtained, wherein each thread in the N/2 threads is used for sequencing two first data sequences;
in one or more embodiments, after obtaining the N first data sequences, the N first data sequences may be sorted by a two-way merging method, and since each thread is responsible for merging two of the first data sequences, N/2 threads may be used here. After sorting, N/2 second data sequences are obtained. Until a global target data sequence is obtained. It will be appreciated that if there are only two first data sequences, then calling one thread may result in the target data sequence.
Specifically, the process of generating the target data sequence will be described below with reference to the drawings. For convenience of understanding, please refer to fig. 11, where fig. 11 is a schematic diagram of generating a target data sequence based on a two-way merging method in the embodiment of the present application, as shown in the figure, 8 threads are first called to sort 8 data subsets, so as to obtain 8 ordered first data sequences. Based on this, the calling thread 1 performs two-way merging and sorting on the first data sequence 1 and the first data sequence 2, and generates a second data sequence 1. And the calling thread 2 performs two-way merging and sequencing on the first data sequence 3 and the first data sequence 4 to generate a second data sequence 2. And the calling thread 3 performs two-way merging and sequencing on the first data sequence 5 and the first data sequence 6 to generate a second data sequence 3. And the calling thread 4 performs two-way merging and sequencing on the first data sequence 7 and the first data sequence 8 to generate a second data sequence 4. 4 parallel threads are used in this round robin.
And then, entering the next round of circulation, and calling the thread 1 to perform two-way merging and sequencing on the second data sequence 1 and the second data sequence 2 to generate a third data sequence 1. And the calling thread 2 performs two-way merging and sequencing on the second data sequence 3 and the second data sequence 4 to generate a third data sequence 2. 2 parallel threads are used in this round robin.
And finally, entering the next round of circulation, and calling the thread 1 to perform two-way merging and sequencing on the third data sequence 1 and the third data sequence 2 to generate a target data sequence. 1 thread is used in this round robin.
240. And constructing a database index corresponding to the target field according to the target data sequence.
In one or more embodiments, since the ordered target data sequence has already been acquired, the database index corresponding to the target field may be constructed based on the target data sequence. It should be noted that the present application may invoke a database index corresponding to a single-threaded build target field, or invoke a database index corresponding to a multi-threaded build target field. Wherein the database index may be represented as a target index tree.
Specifically, one way is to invoke one thread (i.e., the target thread) to build its corresponding target index tree for the target data sequence. Another way is to divide the target data sequence into N sub-data sequences, where the data in each sub-data sequence is also ordered, and N is an integer greater than 1. At this time, N threads are called to construct corresponding index trees for the N sub-data sequences, and finally, the index trees are synthesized into a target index tree.
In the embodiment of the application, another database index creation method is provided. Firstly, calling N threads to scan an original data set to obtain N data subsets, and then respectively carrying out data sorting on each data subset, so that a plurality of internal ordered data sequences can be obtained. At this point, a plurality of threads can be called, and two paths of merging and sorting are carried out on the data sequences, so that a globally ordered target data sequence is obtained and is used as an index basis for constructing a database. By sequencing a plurality of data sequences in a parallel mode, the parallelism of index creation can be improved, and the time for creating the index is reduced.
With reference to the above description, another database index creation method in the present application will be described below, and referring to fig. 12, a database index creation method in an embodiment of the present application includes:
310. calling N threads to scan the original data sets to obtain N first data sets, wherein each first data set comprises a data set corresponding to a target field, each first data set is obtained by scanning one thread, and N is an integer greater than 1;
in one or more embodiments, an original data set is obtained, and N threads are called as required to scan the original data set, that is, all data sets including target fields are scanned, where the original data set includes a data set corresponding to at least one field. Assuming that 4 threads are invoked to scan the original data set, 4 first data sets are available. It should be noted that the "target field" in the present application may be one field, or may be a field set composed of a plurality of fields, and is not limited herein.
320. Calling N threads, and sequencing data in the N first data sets to obtain N first data sequences, wherein each thread in the N threads is used for sequencing data in one first data set;
in one or more embodiments, N threads may be invoked to sort data in N first data sets, that is, each thread separately sorts data in one first data set in an ascending order or a descending order, so as to achieve the purpose of parallel sorting. Thereby, N first data sequences are obtained.
330. Determining (N-1) pivots corresponding to each first data sequence according to the N first data sequences, wherein each pivot is used for dividing data in the first data sequences;
in one or more embodiments, after obtaining the N first data sequences, if it is desired to invoke N threads to N-way merge the N first data sequences (i.e., the N threads are fully utilized), each first data sequence may also be divided into N ranges (ranges), such that in each first data sequence, the data in the first range is smaller than the data in the second range, and all of the data in the second range is smaller than the data in the third range, and so on. Finally, N threads can be used for executing N-way merging in parallel, and finally a globally ordered target data sequence is obtained. Wherein, the position of range is the pivot.
Specifically, in order to make the data distribution within each range uniform, the manner of determining the pivot will be described below with reference to an example. For easy understanding, please refer to fig. 13, where fig. 13 is a schematic diagram of determining a pivot based on a data set in the embodiment of the present application, as shown in the figure, 4 first data sequences are obtained after sorting, and if 4 threads are used to merge and sort the 4 first data sequences, each first data sequence needs to be divided into 4 ranges.
Based on this, the first pivot of the first data sequence 1 is determined. The initial stage may use an aliquot approach, assuming that the first data sequence 1 has 10 data, the first data sequence 1 needs to determine 3 pivots, wherein the position of the first pivot may be selected after the third data. From the first pivot of the first data sequence 1, the first pivot of the first data sequence 2, the first data sequence 3 and the first data sequence 4 can be found. In this example, 30 is used as the pivot. If the total number of values determined by the first pivot of the respective first data series is larger than the average value that each range should have, the first pivot in the first data series 1 is reduced and the first pivots in all first data series are calculated again until the difference between the total number of values determined by the first pivot and the average value that each range should have is within a smaller interval (e.g. 5), whereby the pivot is considered to have been found. It is understood that the positions of the second pivot and the third pivot can be calculated by similar methods, which are not described herein.
340. Dividing the N first data sequences into N second data sets according to (N-1) pivots;
in one or more embodiments, each first data sequence is divided into N second data sets based on the (N-1) pivots to which it corresponds.
Specifically, for ease of understanding, please refer to fig. 13 again, after determining the pivot corresponding to each first data sequence, the 4 second data sets are obtained by dividing according to the pivot position. Wherein the second data set 1 comprises "1520302551015510", the second data set 2 comprises "405060708035555045", the second data set 3 comprises "9010085951058510090100", and the second data set 4 comprises "205300400120125200130150200". It can be seen that although the data in each second data set is unordered, the respective second data sets are ordered, that is, the data in second data set 1 is smaller than the data in second data set 2, the data in second data set 2 is smaller than the data in second data set 3, and the data in second data set 3 is smaller than the data in second data set 4.
350. Calling N threads, and sequencing data in the N second data sets to obtain N second data sequences, wherein each thread in the N threads is used for sequencing data in one second data set;
in one or more embodiments, N threads are invoked to sort the data in the N second data sets in parallel, that is, each thread sorts the data in one second data set, thereby obtaining N ordered second data sequences.
360. And constructing a database index corresponding to the target field according to the N second data sequences.
In one or more embodiments, the N second data sequences are spliced according to the order of the second data set to obtain an ordered target data sequence, and therefore, a database index corresponding to the target field can be constructed based on the target data sequence. It should be noted that the present application may invoke a database index corresponding to a single-threaded build target field, or invoke a database index corresponding to a multi-threaded build target field. Wherein the database index may be represented as a target index tree.
Specifically, one way is to invoke one thread (i.e., the target thread) to build its corresponding target index tree for the target data sequence. The other mode is that N threads are called to construct corresponding index trees for the N second data sequences in parallel, and finally the index trees are synthesized into a target index tree.
In the embodiment of the application, another database index creation method is provided. Firstly, calling N threads to scan an original data set to obtain N data subsets, and then respectively carrying out data sorting on each data subset, so that a plurality of internal ordered data sequences can be obtained. Therefore, the pivot corresponding to each data sequence is determined, and the data distribution unevenness problem can be solved because how many data exist in each data set can be accurately known through the pivot dividing mode.
Referring to fig. 14, fig. 14 is a schematic diagram of an embodiment of an index creating apparatus in an embodiment of the present application, and the index creating apparatus 40 includes:
an obtaining module 410, configured to obtain a sampling point data sequence derived from a target data set, where the target data set includes a data set corresponding to a target field, and the sampling point data sequence includes at least one sampling point data in an ascending order or a descending order;
a determining module 420, configured to determine at least one sub-point data from the sampling point data sequence;
a partitioning module 430, configured to partition the target data set into at least two unordered data partitions according to the at least one partition point data;
the sorting module 440 is configured to call at least two threads to sort data in at least two unordered data partitions to obtain at least two ordered data partitions, where the ordered data partitions and the unordered data partitions have a one-to-one correspondence relationship;
the building module 450 is configured to build a database index corresponding to the target field according to the at least two ordered data partitions.
In the embodiment of the application, an index creating device is provided. By adopting the device, on one hand, the data in the unordered data partitions are sequenced in a parallel mode, so that the parallelism of index creation is improved, and the time for creating the index is reduced. On the other hand, since the data partitions are ordered externally, external sorting is not needed, and the index creation efficiency is further improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
the obtaining module 410 is specifically configured to, when S threads are called to scan an original data set, perform sampling processing on data corresponding to the target field according to a preset sampling interval, where the original data set includes a data set corresponding to the target field, and S is an integer greater than 1;
when the scanning of the original data set is finished, obtaining a target data set and at least one sampling point data to be sequenced, wherein the target data set comprises at least one data subset, and each data subset is obtained by scanning of one thread;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In the embodiment of the application, an index creating device is provided. By adopting the device, a multithreading parallel scanning mode is used when an original data set is scanned, wherein the core idea is mainly to split the range of the index tree corresponding to the main key index and take charge of parallel scanning of the data in the range by different threads, thereby achieving the effect of accelerating scanning. Meanwhile, a parallel scanning mode can be adopted for each index to obtain a plurality of data subsets, and therefore the data scanning efficiency is further improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
the obtaining module 410 is specifically configured to, when a target thread is called to scan an original data set, perform sampling processing on data corresponding to a target field according to a preset sampling interval, where the original data set includes a data set corresponding to the target field;
when the scanning of the original data set is finished, obtaining a target data set and at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In the embodiment of the application, an index creating device is provided. By adopting the device, a single-thread serial scanning mode is used when the original data set is scanned, so that the data scanning work can be completed under the condition of insufficient thread quantity, and the flexibility and operability of data scanning are improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
the obtaining module 410 is specifically configured to invoke S threads to scan an original data set to obtain a target data set, where the original data set includes a data set corresponding to a target field, the target data set includes at least one data subset, each data subset is obtained by scanning one thread, and S is an integer greater than 1;
calling at least two threads, and scanning the target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In the embodiment of the application, an index creating device is provided. By adopting the device, a multithreading parallel scanning mode is used when an original data set is scanned, wherein the core idea is mainly to split the range of the index tree corresponding to the main key index and take charge of parallel scanning of the data in the range by different threads, thereby achieving the effect of accelerating scanning. And then, each index is scanned in parallel to obtain a plurality of data subsets, so that a feasible mode is provided for implementation of the scheme.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
the obtaining module 410 is specifically configured to invoke a target thread to scan an original data set to obtain a target data set, where the original data set includes a data set corresponding to a target field;
calling a target thread, and scanning a target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending arrangement or descending arrangement on at least one sampling point data to be sequenced to obtain a sampling point data sequence.
In the embodiment of the application, an index creating device is provided. By adopting the device, a single-thread serial scanning mode is used when the original data set is scanned, so that the data scanning work can be completed under the condition of insufficient thread quantity, and the flexibility and operability of data scanning are improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
a determining module 420, specifically configured to determine the total number of the sampling point data according to the sampling point data sequence;
determining the total amount of quantile point data according to the total amount of data partitions to be divided;
carrying out quotient calculation on the total number of the sampling point data and the total number of the quantile point data to obtain a target quotient value;
and extracting sampling point data from the sampling point data sequence as sub-bit data according to the integral multiple of the target quotient value to obtain at least one sub-bit data.
In the embodiment of the application, an index creating device is provided. By adopting the device, the sub-site data is extracted from the sampling point data sequence according to reasonable intervals, so that the data distribution can be more uniform to a certain extent, the condition of unbalanced data volume in each disordered data partition is avoided, and the condition of parallel processing is better met.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
a dividing module 430, configured to determine at least two data interval ranges according to at least one piece of partition point data;
and calling a target thread, and dividing data corresponding to a target field in a target data set into corresponding data interval ranges to obtain at least two unordered data partitions.
In the embodiment of the application, an index creating device is provided. By adopting the device, after the quantile point data are generated for each index, the data corresponding to the target field in the target data set are divided into the corresponding data interval range according to the quantile point data. Since the sampling point data sequence is ordered, the extracted quantile data is also ordered. Because of this, the unordered data partitions are also ordered among themselves. The external ordering stage is simple, and the multiple threads completely independently and parallelly order the data in the unordered data partition. Therefore, the finally generated ordered data partitions are globally ordered, and the ordering efficiency is improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
a dividing module 430, configured to determine at least two data interval ranges according to at least one piece of partition point data;
and calling N threads, and dividing data corresponding to target fields in N data subsets into corresponding data interval ranges to obtain at least two unordered data partitions, wherein the N data subsets belong to a target data set, each thread is used for dividing data in one data subset into corresponding data area ranges, and N is an integer greater than 1.
In the embodiment of the application, an index creating device is provided. By adopting the device, after the quantile point data are generated for each index, according to the quantile point data, the data corresponding to the target field in the target data set can be divided into the corresponding data interval range in parallel by using a plurality of threads, so that the index creation efficiency is improved. And because the sampling point data sequence is ordered, the extracted quantile point data is also ordered. Thus, the unordered data partitions are also ordered among themselves. The external sorting stage is simpler, so that the sorting efficiency is improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
the sorting module 440 is specifically configured to call N threads, sort data in the N unordered data partitions, and obtain N ordered data partitions, where each thread is used to sort data in one unordered data partition, and N is an integer greater than 1.
In the embodiment of the application, an index creating device is provided. By adopting the device, after at least two unordered data partitions are obtained according to the quantile point data division, one thread can be respectively adopted for sorting aiming at the data in each unordered data partition. Therefore, parallel sequencing is carried out on the multiple unordered data partitions, the parallelism degree of index creation is improved, and the efficiency of index creation is improved.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
the constructing module 450 is specifically configured to invoke N threads, and perform index construction on N ordered data partitions to obtain N index trees, where each thread is configured to construct a corresponding index tree according to the ordered data partitions, and N is an integer greater than 1;
and constructing a target index tree corresponding to the target field according to the N index trees, wherein the target index tree belongs to the database index.
In the embodiment of the application, an index creating device is provided. By adopting the device, the index tree is created in parallel by utilizing a plurality of threads, so that the parallelism of the index creation is improved, and the efficiency of the index creation is improved. Meanwhile, multithreading parallelism can be realized by scanning the whole main key index file, the multithreading parallelism can be realized by sorting each unordered data partition, and the multithreading parallelism can also be realized when an index tree is constructed. Therefore, when the database operation of index creation and reconstruction is involved, the time for creating the index can be significantly reduced, thereby reducing the time delay of user operation.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
a constructing module 450, specifically configured to determine the highest tree level according to the N index trees;
according to the highest tree layer number, adding root nodes to the index trees with the tree layer number smaller than the highest tree layer number in the N index trees until N index trees with the same tree layer number are obtained;
and carrying out node combination processing on the N index trees with the equal tree layer number to obtain a target index tree corresponding to the target field.
In the embodiment of the application, an index creating device is provided. By adopting the device, on one hand, the tree heights of all the index trees can be filled up, so that the left child node and the right child node can be connected by using the pointer, and the construction requirement of the B + tree is met. On the other hand, after the tree height is filled, the tree height can be compressed, so that the aim of optimizing the data structure is fulfilled.
Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the index creating apparatus 40 provided in the embodiment of the present application,
the building module 450 is specifically configured to invoke a target thread, and perform index building on the N ordered data partitions to obtain a target index tree, where the target index tree belongs to a database index.
In the embodiment of the application, an index creating device is provided. By adopting the device, the target index tree corresponding to the target field is constructed in a single thread mode, so that the feasibility and operability of the scheme are improved.
Referring to fig. 15, fig. 15 is a schematic view of another embodiment of the index creating apparatus in the embodiment of the present application, and the index creating apparatus 50 includes:
an obtaining module 510, configured to invoke N threads to scan an original data set, so as to obtain N data subsets, where each data subset includes a data set corresponding to a target field, each data subset is obtained by scanning one thread, and N is an integer greater than 1;
the sorting module 520 is configured to invoke N threads, sort data in the N data subsets, and obtain N first data sequences, where each thread in the N threads is used to sort data in one data subset;
the sorting module 520 is further configured to call N/2 threads, perform two-way merging sorting on the N first data sequences, obtain N/2 second data sequences, until a target data sequence is obtained, where each thread of the N/2 threads is used to sort two first data sequences;
a constructing module 530, configured to construct a database index corresponding to the target field according to the target data sequence.
In the embodiment of the application, an index creating device is provided. By adopting the device, firstly, N threads are called to scan the original data set to obtain N data subsets, and then each data subset is subjected to data sorting, so that a plurality of internal ordered data sequences can be obtained. At this point, a plurality of threads can be called, and two paths of merging and sorting are carried out on the data sequences, so that a globally ordered target data sequence is obtained and is used as an index basis for constructing a database. By sequencing the data sequences in a parallel mode, the parallelism of index creation can be improved, and the time required by creating the index is reduced.
Referring to fig. 16, fig. 16 is a schematic view of another embodiment of the index creating apparatus in the embodiment of the present application, and the index creating apparatus 60 includes:
an obtaining module 610, configured to invoke N threads to scan an original data set to obtain N first data sets, where each first data set includes a data set corresponding to a target field, and each first data set is obtained by scanning one thread, and N is an integer greater than 1;
a sorting module 620, configured to invoke N threads, and sort data in N first data sets to obtain N first data sequences, where each thread in the N threads is used to sort data in one first data set;
a determining module 630, configured to determine (N-1) pivots corresponding to each first data sequence according to the N first data sequences, where each pivot is used to divide data in the first data sequence;
a dividing module 640, configured to divide the N first data sequences into N second data sets according to (N-1) pivots;
the sorting module 620 is further configured to invoke N threads, sort the data in the N second data sets, and obtain N second data sequences, where each thread in the N threads is used to sort the data in one second data set;
and a constructing module 650, configured to construct a database index corresponding to the target field according to the N second data sequences.
In the embodiment of the application, an index creating device is provided. By adopting the device, firstly, N threads are called to scan the original data set to obtain N data subsets, and then each data subset is subjected to data sorting, so that a plurality of internal ordered data sequences can be obtained. Therefore, the pivot corresponding to each data sequence is determined, and the data distribution unevenness problem can be solved because how many data exist in each data set can be accurately known through the pivot dividing mode.
The application also provides an index creating device, and the index creating device is deployed in the server. To facilitate understanding, referring to fig. 17, fig. 17 is a schematic structural diagram of a server provided in the embodiment of the present application, and the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 722 (e.g., one or more processors) and a memory 732, and one or more storage media 730 (e.g., one or more mass storage devices) for storing an application program 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Further, the central processor 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server 700.
The Server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, and/or one or more operating systems 741, such as a Windows Server TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM And so on.
The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 17.
The embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the processor implements the steps of the methods described in the foregoing embodiments.
In an embodiment of the present application, a computer-readable storage medium is further provided, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the methods described in the foregoing embodiments.
Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the methods described in the foregoing embodiments.
It is understood that, in the specific implementation of the present application, data and the like corresponding to the target field are referred to, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related countries and regions.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, and other various media capable of storing computer programs.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (20)

1. A method for creating a database index, comprising:
acquiring a sampling point data sequence from a target data set, wherein the target data set comprises a data set corresponding to a target field, and the sampling point data sequence comprises at least one sampling point data in an ascending order or a descending order;
determining at least one sub-site data from the sample point data sequence;
dividing the target data set into at least two unordered data partitions according to the at least one sub-site data;
calling at least two threads to sequence data in the at least two unordered data partitions to obtain at least two ordered data partitions, wherein the ordered data partitions and the unordered data partitions have one-to-one correspondence;
and constructing a database index corresponding to the target field according to the at least two ordered data partitions.
2. The method of creating according to claim 1, wherein said obtaining a sequence of sample point data derived from a target data set comprises:
when S threads are called to scan an original data set, sampling data corresponding to the target field according to a preset sampling interval, wherein the original data set comprises the data set corresponding to the target field, and S is an integer greater than 1;
when the scanning of the original data set is finished, obtaining the target data set and at least one sampling point data to be sequenced, wherein the target data set comprises at least one data subset, and each data subset is obtained by scanning of one thread;
and performing ascending sequence arrangement or descending sequence arrangement on the at least one sampling point data to be sequenced to obtain the sampling point data sequence.
3. The method of creating according to claim 1, wherein said obtaining a sequence of sample point data derived from a target data set comprises:
when a target thread is called to scan an original data set, sampling data corresponding to a target field according to a preset sampling interval, wherein the original data set comprises a data set corresponding to the target field;
when the scanning of the original data set is finished, obtaining the target data set and at least one sampling point data to be sequenced;
and performing ascending sequence arrangement or descending sequence arrangement on the at least one sampling point data to be sequenced to obtain the sampling point data sequence.
4. The method of creating according to claim 1, wherein said obtaining a sequence of sample point data derived from a target data set comprises:
calling S threads to scan an original data set to obtain a target data set, wherein the original data set comprises a data set corresponding to the target field, the target data set comprises at least one data subset, each data subset is obtained by scanning one thread, and S is an integer greater than 1;
calling at least two threads, and scanning the target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending sequence arrangement or descending sequence arrangement on the at least one sampling point data to be sequenced to obtain the sampling point data sequence.
5. The method of creating according to claim 1, wherein said obtaining a sequence of sample point data derived from a target data set comprises:
calling a target thread to scan an original data set to obtain the target data set, wherein the original data set comprises a data set corresponding to the target field;
calling the target thread, and scanning the target data set according to a preset sampling interval to obtain at least one sampling point data to be sequenced;
and performing ascending sequence arrangement or descending sequence arrangement on the at least one sampling point data to be sequenced to obtain the sampling point data sequence.
6. The method of creating in claim 1, wherein said determining at least one sub-point data from said sequence of sample point data comprises:
determining the total number of sampling point data according to the sampling point data sequence;
determining the total amount of quantile point data according to the total amount of data partitions to be divided;
carrying out quotient calculation on the total number of the sampling point data and the total number of the quantile point data to obtain a target quotient value;
and extracting sampling point data from the sampling point data sequence as quantile point data according to the integral multiple of the target quotient value to obtain the at least one quantile point data.
7. The method of creating as claimed in claim 1, wherein said partitioning said target data set into at least two unordered data partitions according to said at least one quantile data comprises:
determining at least two data interval ranges according to the at least one sub-site data;
and calling a target thread, and dividing the data corresponding to the target field in the target data set into corresponding data interval ranges to obtain the at least two unordered data partitions.
8. The method of creating as claimed in claim 1, wherein said dividing said target data set into at least two unordered data partitions according to said at least one split-point data comprises:
determining at least two data interval ranges according to the at least one sub-site data;
and calling N threads, and dividing data corresponding to the target fields in N data subsets into corresponding data interval ranges to obtain the at least two unordered data partitions, wherein the N data subsets belong to the target data set, each thread is used for dividing data in one data subset into corresponding data area ranges, and N is an integer greater than 1.
9. The method of creating according to any of claims 1 to 8, wherein said invoking at least two threads to sort data within said at least two out-of-order data partitions, resulting in at least two ordered data partitions, comprises:
and calling N threads, and sequencing data in the N unordered data partitions to obtain N ordered data partitions, wherein each thread is used for sequencing data in one unordered data partition, and N is an integer greater than 1.
10. The creating method according to claim 1, wherein the constructing the database index corresponding to the target field according to the at least two ordered data partitions comprises:
calling N threads, and performing index construction on N ordered data partitions to obtain N index trees, wherein each thread is used for constructing a corresponding index tree according to the ordered data partitions, and N is an integer greater than 1;
and constructing a target index tree corresponding to the target field according to the N index trees, wherein the target index tree belongs to the database index.
11. The creating method according to claim 10, wherein the building a target index tree corresponding to the target field according to the N index trees includes:
determining the highest tree layer number according to the N index trees;
according to the highest tree layer number, adding root nodes to the index trees with the tree layer number smaller than the highest tree layer number in the N index trees until N index trees with the same tree layer number are obtained;
and carrying out node combination processing on the N index trees with the equal tree layer number to obtain a target index tree corresponding to the target field.
12. The creating method according to claim 1, wherein the constructing the database index corresponding to the target field according to the at least two ordered data partitions comprises:
and calling a target thread, and performing index construction on the N ordered data partitions to obtain a target index tree, wherein the target index tree belongs to the database index.
13. A method for creating a database index, comprising:
calling N threads to scan an original data set to obtain N data subsets, wherein each data subset comprises a data set corresponding to a target field, each data subset is obtained by scanning one thread, and N is an integer greater than 1;
calling the N threads, and sequencing data in the N data subsets to obtain N first data sequences, wherein each thread in the N threads is used for sequencing data in one data subset;
calling N/2 threads, performing two-way merging sequencing on the N first data sequences to obtain N/2 second data sequences until a target data sequence is obtained, wherein each thread in the N/2 threads is used for sequencing two first data sequences;
and constructing a database index corresponding to the target field according to the target data sequence.
14. A method for creating a database index, comprising:
calling N threads to scan an original data set to obtain N first data sets, wherein each first data set comprises a data set corresponding to a target field, each first data set is obtained by scanning one thread, and N is an integer greater than 1;
calling the N threads, and sequencing data in the N first data sets to obtain N first data sequences, wherein each thread in the N threads is used for sequencing data in one first data set;
determining (N-1) pivots corresponding to each first data sequence according to the N first data sequences, wherein each pivot is used for dividing data in the first data sequences;
dividing the N first data sequences into N second data sets according to the (N-1) pivots;
calling the N threads, and sequencing data in the N second data sets to obtain N second data sequences, wherein each thread in the N threads is used for sequencing data in one second data set;
and constructing a database index corresponding to the target field according to the N second data sequences.
15. An index creation apparatus, comprising:
the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a sampling point data sequence from a target data set, the target data set comprises a data set corresponding to a target field, and the sampling point data sequence comprises at least one sampling point data according to an ascending order or a descending order;
the determining module is used for determining at least one sub-site data from the sampling point data sequence;
the dividing module is used for dividing the target data set into at least two unordered data partitions according to the at least one sub-site data;
the sorting module is used for calling at least two threads to sort the data in the at least two unordered data partitions to obtain at least two ordered data partitions, wherein the ordered data partitions and the unordered data partitions have one-to-one correspondence;
and the construction module is used for constructing the database index corresponding to the target field according to the at least two ordered data partitions.
16. An index creation apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for calling N threads to scan an original data set to obtain N data subsets, each data subset comprises a data set corresponding to a target field, each data subset is obtained by scanning one thread, and N is an integer greater than 1;
the sorting module is used for calling the N threads and sorting the data in the N data subsets to obtain N first data sequences, wherein each thread in the N threads is used for sorting the data in one data subset;
the sequencing module is further configured to call N/2 threads, perform two-way merging sequencing on the N first data sequences, obtain N/2 second data sequences, until a target data sequence is obtained, where each thread of the N/2 threads is used to sequence two first data sequences;
and the construction module is used for constructing the database index corresponding to the target field according to the target data sequence.
17. An index creation apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for calling N threads to scan an original data set to obtain N first data sets, each first data set comprises a data set corresponding to a target field, each first data set is obtained by scanning one thread, and N is an integer greater than 1;
the sorting module is used for calling the N threads and sorting the data in the N first data sets to obtain N first data sequences, wherein each thread in the N threads is used for sorting the data in one first data set;
a determining module, configured to determine (N-1) pivots corresponding to each first data sequence according to the N first data sequences, where each pivot is used to divide data in the first data sequences;
a dividing module for dividing the N first data sequences into N second data sets according to the (N-1) pivots;
the sorting module is further configured to invoke the N threads, sort the data in the N second data sets, and obtain N second data sequences, where each thread in the N threads is used to sort the data in one second data set;
and the construction module is used for constructing the database index corresponding to the target field according to the N second data sequences.
18. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the creating method of any one of claims 1 to 12, or implements the steps of the creating method of claim 13, or implements the steps of the creating method of claim 14.
19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the creation method of one of claims 1 to 12, or the steps of the creation method of claim 13, or the steps of the creation method of claim 14.
20. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the creation method of any one of claims 1 to 12, or the steps of the creation method of claim 13, or the steps of the creation method of claim 14.
CN202210752984.3A 2022-06-29 2022-06-29 Database index creating method, related device, equipment and storage medium Pending CN115114293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210752984.3A CN115114293A (en) 2022-06-29 2022-06-29 Database index creating method, related device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210752984.3A CN115114293A (en) 2022-06-29 2022-06-29 Database index creating method, related device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115114293A true CN115114293A (en) 2022-09-27

Family

ID=83329585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210752984.3A Pending CN115114293A (en) 2022-06-29 2022-06-29 Database index creating method, related device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115114293A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453682A (en) * 2023-09-26 2024-01-26 广州海量数据库技术有限公司 Method and system for parallel creation of column store table btree index on openGauss database

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453682A (en) * 2023-09-26 2024-01-26 广州海量数据库技术有限公司 Method and system for parallel creation of column store table btree index on openGauss database

Similar Documents

Publication Publication Date Title
US20190258625A1 (en) Data partitioning and ordering
Blanas et al. Parallel data analysis directly on scientific file formats
Raman et al. DB2 with BLU acceleration: So much more than just a column store
Jindal et al. Trojan data layouts: right shoes for a running elephant
Aji et al. Hadoop-GIS: A high performance spatial data warehousing system over MapReduce
US9043310B2 (en) Accessing a dimensional data model when processing a query
US9256665B2 (en) Creation of inverted index system, and data processing method and apparatus
US10713255B2 (en) Spool file for optimizing hash join operations in a relational database system
Junghanns et al. Gradoop: Scalable graph data management and analytics with hadoop
US20150006509A1 (en) Incremental maintenance of range-partitioned statistics for query optimization
Bugiotti et al. RDF data management in the Amazon cloud
CN104239377A (en) Platform-crossing data retrieval method and device
US20180276264A1 (en) Index establishment method and device
Fotache et al. NoSQL and SQL Databases for Mobile Applications. Case Study: MongoDB versus PostgreSQL.
El Alami et al. Supply of a key value database redis in-memory by data from a relational database
CN115114293A (en) Database index creating method, related device, equipment and storage medium
Lawson et al. Using a robust metadata management system to accelerate scientific discovery at extreme scales
Bugiotti et al. SPARQL Query Processing in the Cloud.
CN112445776A (en) Presto-based dynamic barrel dividing method, system, equipment and readable storage medium
Kvet et al. Relational pre-indexing layer supervised by the DB_index_consolidator Background Process
Liroz-Gistau et al. Dynamic workload-based partitioning algorithms for continuously growing databases
US11847121B2 (en) Compound predicate query statement transformation
Son et al. SSFile: A novel column-store for efficient data analysis in Hadoop-based distributed systems
Eze et al. Database system concepts, implementations and organizations-a detailed survey
Stockinger et al. Zns-efficient query processing with zurichnosql

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination