WO2020057272A1 - 一种索引数据存储及检索方法、装置及存储介质 - Google Patents

一种索引数据存储及检索方法、装置及存储介质 Download PDF

Info

Publication number
WO2020057272A1
WO2020057272A1 PCT/CN2019/099125 CN2019099125W WO2020057272A1 WO 2020057272 A1 WO2020057272 A1 WO 2020057272A1 CN 2019099125 W CN2019099125 W CN 2019099125W WO 2020057272 A1 WO2020057272 A1 WO 2020057272A1
Authority
WO
WIPO (PCT)
Prior art keywords
key
sets
equal
heap
ordered
Prior art date
Application number
PCT/CN2019/099125
Other languages
English (en)
French (fr)
Inventor
朱智佳
张永光
王海滨
周成祖
杜新胜
Original Assignee
厦门市美亚柏科信息股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 厦门市美亚柏科信息股份有限公司 filed Critical 厦门市美亚柏科信息股份有限公司
Priority to EP19797156.7A priority Critical patent/EP3654195A4/en
Publication of WO2020057272A1 publication Critical patent/WO2020057272A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures

Definitions

  • the present disclosure relates to the technical field of data storage and retrieval, and in particular, to a method, device, and storage medium for storing and retrieving index data.
  • a database index is a storage structure created to speed up the retrieval of data in database tables. It is a list of a column or set of column values in the table and a corresponding list of logical pointers to the data locations in the table that physically identify the values.
  • the function of the index is equivalent to the catalog of a book, and the data content that meets the requirements can be quickly found according to the pages in the catalog.
  • BTree B / B + tree
  • Hash hash
  • TrieTree dictionary tree
  • bitMap bitmap
  • Databases generally support a single search condition query, and also support multiple search conditions for AND or non-logical combination query, for example, an e-commerce order table to retrieve data from yesterday to the current transaction amount of more than 1,000 yuan, this is a compound Inquiries, there are two search conditions, the transaction time is after yesterday, and the transaction amount exceeds one thousand yuan. There are generally two ways to implement this compound query using index retrieval:
  • S1 is the record number set of the compound query.
  • the above two schemes may be used at the same time, that is, first use the index to query the record number set that meets one or more search conditions, read the records one by one in the set, and check whether the record field value Meet other search criteria.
  • the record number result set of the individual queries of these n search conditions are S1, S2 ... Sn, and the set sizes are N1, N2 ... Nn, respectively.
  • the record number result set is S and the size is N. May wish to set S1 to be the smallest set and Sn to be the largest set, ie
  • N 1 min ⁇ N i
  • N n max ⁇ N i
  • scheme two The time complexity of scheme two is better than that of scheme one, and the retrieval time should be faster, but some database index data is in memory and the index query is fast.
  • the data that meets the search conditions one by one is on disk, which is slower, so it is generally much smaller than When n , scheme two is more efficient, and when N 1 and N n are not much different, scheme one is more efficient.
  • An index data storage method includes:
  • the dividing step is to combine the record number and the corresponding field value into a key-value pair.
  • a total of P key-value pairs sort all the key-value pairs according to the size of the value element, and sort the sorted P key-value pairs.
  • All key elements are divided into K sets, and each set includes L key elements;
  • a sorting step sorting the key element sizes of the divided K key element sets, respectively, to obtain an ordered sequence set of K corresponding key elements
  • L is an integer greater than 0
  • K is an integer greater than 0.
  • L [P / K].
  • the number of key-value pairs in the last set is P- (K-1) * L, where P is an integer greater than 1.
  • the present disclosure also proposes a data retrieval method, which includes:
  • a receiving step receiving the input first search condition and second search condition
  • the retrieval result obtaining step includes processing the sets S1 and S2 sorted by key values to obtain retrieval results.
  • the data in the database is stored using the index data storage method according to any one of the above.
  • the operation is: take the smallest element m1 in S1; find the first element m2 greater than or equal to m1 in S2; if m2 and m1 is equal, add m2 to the set S, S2 continues to find a replacement m2; find the first element m1 that is greater than or equal to m2 below S1; if m1 and m2 are equal, add m1 to the set S, and S1 continues to Next, find a replacement m1; repeat m2 in S2 and m1 in S1 until m1 or m2 cannot obtain elements, where set S is the intersection of ordered sets S1 and S2, and the corresponding search is obtained according to the key value in S result.
  • the operation is: take the smallest element m1 in S1; find the first element m2 greater than or equal to m1 in S2; if m2 Not equal to m1, add m1 to the set S. If m2 and m1 are equal, S2 continues to find a replacement m2; below S1, find all elements greater than m1 and less than m2 and add them to the result set S, and replace m1 Oftens the first element in S1 that is larger than m2. Repeat m2 in S2 and m1 in S1 until m1 or m2 cannot obtain the elements.
  • the set S is the result of the non-operation of the ordered set S1 on S2, and the corresponding search result is obtained according to the key value in S.
  • the operation is: perform an OR operation by establishing a maximal heap.
  • the specific operation of the retrieval step is to query the key value range of S1 according to the first search condition, and split the key value range according to the set sorted when the data is stored to obtain multiple ordered sets and X unordered sets. ; Sort the X unordered sets in S1; build a maximum heap for all subsets of S1, use the size of the smallest element in each subset to represent the size relationship between the subsets, and the smallest set is at the top of the heap, The smallest element of S1 is the first element of the top of the heap set; find the first element that is greater than or equal to m1, and determine whether the first element of the top of the heap set is less than m1.
  • the top of the set is sorted by the ordered set binary search to remove less than
  • the element of m1 makes the top element of the heap set greater than or equal to m1, then readjusts the heap, and then repeatedly judges the top of the heap set until the first element of the heap set is greater than or equal to m1. At this time, the first element of the heap top is the first element in all sets of the entire heap.
  • the present disclosure also proposes an index data storage device.
  • the device includes:
  • Dividing module which is used to combine the record number and the value of the corresponding field into a key-value pair.
  • a total of P key-value pairs sort all the key-value pairs according to the size of the value element, and sort the sorted P key-values. All key elements of the pair are divided into K sets, and each set includes L key elements;
  • a sorting module for sorting the size of the key elements of the divided K key element sets to obtain an ordered sequence set of K corresponding key elements
  • the storage module is configured to store P ordered key-value pairs and corresponding K sort key element sets respectively; wherein L is an integer greater than 0 and K is an integer greater than 0.
  • L [P / K].
  • the number of key-value pairs in the last set is P- (K-1) * L, where P is an integer greater than 1.
  • the present disclosure also proposes a data retrieval device, which includes:
  • a receiving module configured to receive an inputted first search condition and a second search condition
  • a retrieval module configured to perform a retrieval in the database according to the first retrieval condition and the second retrieval condition to obtain the sets S1 and S2 sorted by key values corresponding to the retrieval results;
  • a retrieval result obtaining module is configured to process the sets S1 and S2 sorted by key values to obtain retrieval results.
  • the data in the database is stored using the index data storage method according to any one of the above.
  • the operation when processing the sets S1 and S2 sorted by key values is an AND operation, the operation is: Take S1 The smallest element m1; find the first element m2 greater than or equal to m1 in S2; if m2 and m1 are equal, add m2 to the set S, S2 continues to find a replacement m2; find the first greater than S1 Element m1 equal to m2; if m1 and m2 are equal, add m1 to the set S, S1 continues to find a replacement m1; repeat m2 in S2, m1 in S1, until m1 or m2 cannot obtain the element, where The set S is the intersection of the ordered sets S1 and S2, and the corresponding search results are obtained according to the key values in S.
  • the operation is: take the smallest element m1 in S1; find the first element m2 greater than or equal to m1 in S2; if m2 Not equal to m1, add m1 to the set S. If m2 and m1 are equal, S2 continues to find a replacement m2; below S1, find all elements greater than m1 and less than m2 and add them to the result set S, and replace m1 Oftens the first element in S1 that is larger than m2. Repeat m2 in S2 and m1 in S1 until m1 or m2 cannot obtain the elements.
  • the set S is the result of the non-operation of the ordered set S1 on S2, and the corresponding search result is obtained according to the key value in S.
  • the operation is: perform an OR operation by establishing a maximal heap.
  • the specific operation performed by the retrieval module is: query the key value range of S1 according to the first search condition, and split the key value range according to the set sorted when the data is stored to obtain multiple ordered sets and X unordered Set; sort X unordered sets in S1; build a maximum heap for all subsets of S1, use the size of the smallest element in each subset to represent the size relationship between the subsets, and the smallest set is at the top of the heap , S1 takes the smallest element as the first element of the heap top set; finds the first element that is greater than or equal to m1, and determines whether the first element of the heap top set is less than m1.
  • the heap top set eliminates the small in the set by ordered set binary search
  • the element at m1 makes the top element of the heap set greater than or equal to m1, then readjusts the heap, and then repeatedly judges the top element of the heap set again until the first element of the heap top set is greater than or equal to m1. At this time, the first element of the heap top is all in the entire heap The first element greater than or equal to m1; get S2 in the same way, where X is 0, 1, or 2.
  • the present disclosure also proposes a computer-readable storage medium, characterized in that a computer program code is stored on the storage medium, and when the computer program code is executed by a computer, any one of the methods described above is performed.
  • the technical effect of the present disclosure is: in order to improve the retrieval efficiency under multiple retrieval conditions, when the data (that is, key-value pairs) is stored, not only the sorting is performed according to the size of the value element, but the sorted data sequence is divided into multiple Segment, each segment sorts the key value, and stores the data sequence corresponding to the key value sort, so that the value element and the key value (also known as the record number) are stored in an orderly manner, that is, a brand new index structure is constructed.
  • the result set can be represented by the union of one or more sets, and most of these sets are ordered. At most, the two sets at the boundary are unordered, which improves the performance of AND, Efficiency of OR, NOT, etc.
  • FIG. 1 is a flowchart of an index data storage method according to an embodiment of the present disclosure.
  • FIG. 2 is a flowchart of a data retrieval method according to an embodiment of the present disclosure.
  • FIG. 3 is a structural diagram of an index data storage device according to an embodiment of the present disclosure.
  • FIG. 4 is a structural diagram of a data retrieval device according to an embodiment of the present disclosure.
  • FIG. 1 illustrates an index data storage method of the present disclosure.
  • the method includes:
  • Dividing step S101 combining the record number and the corresponding field value into a key-value pair, recording a total of P key-value pairs, sorting all the key-value pairs according to the size of the value element, and sorting the P key-value pairs after sorting All key elements of are divided into K sets, each set including L key elements (also referred to as key values, in this disclosure, key values and key elements are the same).
  • the sorting step S102 sorts the key element sizes of the divided K key element sets to obtain K corresponding key element ordered sequence sets.
  • the P ordered key-value pairs and the corresponding K sort key element sets are correspondingly stored.
  • P, L, and K are obviously integers greater than 0, which are obvious to those skilled in the art.
  • L [P / K].
  • the total number of key-value pairs cannot be exactly divided into multiple sets, that is, when the total number P of all key-value pairs is not an integer multiple of K, the number of key-value pairs of the last set is P- (K -1) * L, where P is an integer greater than 1.
  • a specific embodiment of the method is: in order to achieve ordering of values and record numbers, here is a method of exchanging space for time, in the form of a multi-level sequential table, except for the record numbers corresponding to the field ordered values, Added storage for sorting record numbers, and batch storage of multiple segments according to the configured level. As shown in the table below, add a column to sort every five consecutive record numbers.
  • This data storage method is the establishment of the index structure. It is one of the important disclosure points of the present disclosure, and is the basis of the retrieval method proposed by the present disclosure.
  • Figure 2 illustrates a data retrieval method, which includes:
  • step S201 a first search condition and a second search condition are received.
  • the first search condition i is that the transaction time is recorded after yesterday
  • the second search condition is that the transaction amount exceeds one thousand yuan.
  • a search is performed in a database according to the first search condition and the second search condition to obtain sets S1 and S2 sorted by key values corresponding to the search results.
  • S1 is a set of record numbers after the transaction time after yesterday
  • S2 is a set of record numbers whose transaction amount exceeds one thousand yuan
  • S1 and S2 are both ordered sets.
  • the retrieval result obtaining step S203 is performed on the sets S1 and S2 sorted by key values to obtain retrieval results.
  • the data in the database in the present disclosure is stored using the method shown in FIG. 1.
  • T (n) N * log 2 (N 1 * N 2 ) / (2a)
  • N is the size of the S set
  • N1 is the size of the S1 set
  • N2 is the size of the S2 set
  • a is the coefficient parameter for the intersection of the two sets of S1 and S2.
  • the storage structures of S1 and S2 are ordered, and the result set S can be quickly calculated without traversing S1 and S2, so the algorithm will achieve very good results when N1 and N2 are relatively large and N is relatively small.
  • the set S is the intersection of the ordered sets S1 and S2 of key values, and corresponding search results are obtained according to the key values in S.
  • the negation operation is basically similar to the operation, for example Proceed as follows:
  • S 34 S 3 ⁇ S 4 regards S 34 as a search condition.
  • a specific implementation is a combination of four search conditions as follows:
  • the four-condition search is converted into S 12 and S 34 to perform the AND operation, that is, the multi-condition operation is split and combined to perform the operation.
  • the above-mentioned specific multi-condition retrieval method is one of the important disclosure points of the present disclosure, and in combination with the data storage manner of the present disclosure, the retrieval efficiency can be greatly improved.
  • the specific operation of the searching step S102 is: query the key value range of S1 according to the first search condition, and split the key value range according to the set sorted when the data is stored to obtain a plurality of ordered sets and X unordered sets; for S1 Sort the X unordered sets in the set; build a maximum heap for all subsets of S1, use the size of the smallest element in each subset to represent the size relationship between the subsets, the smallest set is at the top of the heap, and S1 is the smallest
  • the element is the first element of the heap top set; find the first element greater than or equal to m1, and determine whether the first element of the heap top set is less than m1.
  • the heap top set finds the elements less than m1 in the set by the ordered set binary search. , So that the first element of the top of the heap set is greater than or equal to m1, then readjust the heap, and then repeatedly judge the top of the heap set, until the first element of the top of the heap set is greater than or equal to m1, at this time the first element of the top of the heap is the first of all the sets in the whole heap An element equal to m1; get S2 in the same way, where X is 0, 1, or 2. That is, only the set of left and right boundaries may be out of order.
  • a record of transaction amounts of more than 1000 ⁇ 14,2,9,12,1,13,15 ⁇ is an unordered set ⁇ 14,2 ⁇ (the left border is unordered Set) and an ordered set ⁇ 1,9,12,13,15 ⁇ .
  • S1 is the union of an ordered set and an unordered set. What we want to do is to use multiple algorithms as a whole to quickly find the position of the specified size element in the set.
  • the ordered set can be divided by two. Find it. Multiple ordered collections. In fact, we can use the heap to organize together to find the location of the specified size element more efficiently.
  • T (n) 4t log 2 (t) + N * N 1 * N 2 * log 2 (N 1 * N 2 ) / (2t 2 ab 1 b 2 )
  • t is the size of the ordered set
  • b1 is the cross case coefficient between the S1 subsets
  • b2 is the cross case system between the S2 subsets
  • 4t log 2 (t) is the complexity of sorting the unordered set. It can be seen that the increase of t will increase the complexity of sorting, but it can reduce the number of ordered subsets and reduce the complexity of the subsequent partial merge, so it can be further optimized to establish a multi-level sorting sequence so that the number of unordered sets can be With smaller t, the ordered subset can use a combination of various sizes t to reduce the number of subsets and improve efficiency, but each additional level of ordering will increase the space occupied by the record number, so establish several ordered sets Need to make a choice. When N is relatively small, most of t 2 b 1 b 2 will be very close to N 1 * N 2 , and the overall time complexity will achieve good results. This is one of the important disclosure points of this disclosure.
  • the present disclosure provides an embodiment of an index data storage device.
  • the device embodiment corresponds to the method embodiment shown in FIG. 1.
  • the device is specific. Can be included in various electronic devices.
  • FIG. 3 illustrates an index data storage device, which includes:
  • Dividing module 301 combining the record number and the corresponding field value into a key-value pair, recording a total of P key-value pairs, sorting all key-value pairs according to the size of the value element, and sorting the P key-value pairs after sorting All key elements of are divided into K sets, each set including L key elements (also referred to as key values, in this disclosure, key values and key elements are the same).
  • the sorting module 302 sorts the key element sizes of the divided K key element sets to obtain K corresponding key element ordered sequence sets.
  • the storage module 303 stores P ordered key-value pairs and corresponding K sort key element sets respectively.
  • P, L, and K are obviously integers greater than 0, which are obvious to those skilled in the art.
  • L [P / K].
  • the total number of key-value pairs cannot be exactly divided into multiple sets, that is, when the total number P of all key-value pairs is not an integer multiple of K, the number of key-value pairs of the last set is P- (K -1) * L, where P is an integer greater than 1.
  • a specific embodiment of the method is: in order to achieve ordering of values and record numbers, here is a method of exchanging space for time, in the form of a multi-level sequential table, except for the record numbers corresponding to the field ordered values, Added storage for sorting record numbers, and batch storage of multiple segments according to the configured level. As shown in the table below, add a column to sort every five consecutive record numbers.
  • This data storage method is the establishment of the index structure. It is one of the important disclosure points of the present disclosure, and is the basis of the retrieval method proposed by the present disclosure.
  • the present disclosure provides an embodiment of a data retrieval device.
  • the device embodiment corresponds to the method embodiment shown in FIG. 2.
  • the device may specifically Contained in various electronic devices.
  • FIG. 4 illustrates a data retrieval device, which includes:
  • the receiving module 401 receives the inputted first search condition and second search condition; for example, the first search condition i is that the transaction time is recorded after yesterday, and the second search condition is that the transaction amount exceeds one thousand yuan.
  • the search module 402 performs a search in the database according to the first search condition and the second search condition to obtain the sets S1 and S2 sorted by key values corresponding to the search results.
  • S1 is a set of record numbers after the transaction time after yesterday
  • S2 is a set of record numbers whose transaction amount exceeds one thousand yuan
  • S1 and S2 are both ordered sets.
  • the retrieval result obtaining module 403 processes the sets S1 and S2 sorted by key values to obtain retrieval results.
  • the data in the database in the present disclosure is stored using the method shown in FIG. 1.
  • T (n) N * log 2 (N 1 * N 2 ) / (2a)
  • N is the size of the S set
  • N1 is the size of the S1 set
  • N2 is the size of the S2 set
  • a is the coefficient parameter in the case where the two sets of S1 and S2 cross.
  • the storage structures of S1 and S2 are ordered, and the result set S can be quickly calculated without traversing S1 and S2, so the algorithm will achieve very good results when N1 and N2 are relatively large and N is relatively small.
  • the set S is the intersection of the ordered sets S1 and S2 of key values, and corresponding search results are obtained according to the key values in S.
  • the negation operation is basically similar to the operation, for example Proceed as follows:
  • S 34 S 3 ⁇ S 4 regards S 34 as a search condition.
  • a specific implementation is a combination of four search conditions as follows:
  • the four-condition search is converted into S 12 and S 34 to perform the AND operation, that is, the multi-condition operation is split and combined to perform the operation.
  • the above-mentioned specific multi-condition retrieval method is one of the important disclosure points of the present disclosure, and in combination with the data storage manner of the present disclosure, the retrieval efficiency can be greatly improved.
  • the specific operation performed by the retrieval module 402 is: query the key value range of S1 according to the first search condition, and split the key value range according to the set sorted when the data is stored, to obtain multiple ordered sets and X unordered sets; Sort the X unordered sets in S1; build a maximum heap for all subsets of S1, and use the size of the smallest element in each subset to represent the size relationship between the subsets.
  • the smallest set is at the top of the heap.
  • S1 takes The smallest element is the first element of the top of the heap set; find the first element greater than or equal to m1, and determine whether the first element of the top of the heap set is less than m1.
  • the top of the set is sorted through the ordered set to find the smaller than m1 in the set. Element, so that the first element of the top of the heap set is greater than or equal to m1, then readjust the heap, and then repeatedly judge the top of the heap set, until the first element of the top of the heap set is greater than or equal to m1, at this time the first element of the top of the heap is the first of all the collections Elements greater than or equal to m1; S2 is obtained in the same way, where X is 0, 1, or 2. That is, only the set of left and right boundaries may be out of order.
  • a record of transaction amounts of more than 1000 ⁇ 14,2,9,12,1,13,15 ⁇ is an unordered set ⁇ 14,2 ⁇ (the left border is unordered Set) and an ordered set ⁇ 1,9,12,13,15 ⁇ .
  • S1 is the union of an ordered set and an unordered set. What we want to do is to use multiple algorithms as a whole to quickly find the position of the specified size element in the set.
  • the ordered set can be divided by two. Find it. Multiple ordered collections. In fact, we can use the heap to organize together to find the location of the specified size element more efficiently.
  • T (n) 4t log 2 (t) + N * N 1 * N 2 * log 2 (N 1 * N 2 ) / (2t 2 ab 1 b 2 )
  • t is the size of the ordered set
  • b1 is the cross case coefficient between the S1 subsets
  • b2 is the cross case system between the S2 subsets
  • 4t log 2 (t) is the complexity of sorting the unordered set. It can be seen that the increase of t will increase the complexity of sorting, but it can reduce the number of ordered subsets and reduce the complexity of the subsequent partial merge, so it can be further optimized to establish a multi-level sorting sequence so that the number of unordered sets can be With smaller t, the ordered subset can use a combination of various sizes t to reduce the number of subsets and improve efficiency, but each additional level of ordering will increase the space occupied by the record number, so establish several ordered sets Need to make a choice. When N is relatively small, most of t 2 b 1 b 2 will be very close to N 1 * N 2 , and the overall time complexity will achieve good results. This is one of the important disclosure points of this disclosure.
  • the present disclosure can be implemented by means of software plus a necessary universal hardware platform.
  • the technical solution of the present disclosure that is essentially or contributes to the existing technology can be embodied in the form of a software product, which can be stored in a storage medium, such as ROM / RAM, magnetic disk , An optical disc, and the like, including a number of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or portions of the embodiments of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种索引数据存储及检索方法、装置及存储介质,数据存储方法在数据(即键值对)存储时,不仅根据值元素的大小进行排序,还将排序的数据序列划分为多个段,每个段将键值排序,并将数据序列与键值排序对应存储,实现值元素和键值(也称为记录编号)都有序存储,即构建了全新的索引结构,并提出了适于该索引结构的多条件检索方法,其对于任意的区间查询,结果集都可以用一个或者多个集合的并集来表示,并且这些集合大部分有序的,最多边界两个集合是无序的,从而提高了在多个条件查询时进行与、或、非等运算的效率。

Description

一种索引数据存储及检索方法、装置及存储介质
相关申请
本申请要求保护在2018年9月18日提交的申请号为201811091065.6的中国专利申请的优先权,该申请的全部内容以引用的方式结合到本文中。
技术领域
本公开涉及数据存储及检索技术领域,特别是一种索引数据存储及检索方法、装置及存储介质。
背景技术
数据库索引是为了加速对数据库表中数据的检索而创建的一种存储结构。它是表中某一列或者若干列值的集合和相应的指向表中物理标识这些值所在数据位置的逻辑指针清单。索引的作用相当于图书的目录,可以根据目录中的页面快速的找到符合要求的数据内容。
一般索引常见的使用存储结构有BTree(B/B+树)、Hash(哈希)、TrieTree(字典树)、bitMap(位图)、顺序表、跳跃表。
数据库一般除了支持单个检索条件的查询,也支持多个检索条件进行与或非逻辑组合查询,例如,一个电商订单表,检索昨天到现在交易金额超过一千元钱的数据,这就是一个复合查询,这就有两个检索条件,交易时间在昨天以后,并且交易额超过一千元。实现这个复合查询一般有这两种使用索引检索的方案:
方案一:
1使用“交易时间”字段索引查询交易时间在昨天以后的记录编号,记为集合S1。
2使用“交易额”字段索引查询交易额超过一千元钱的记录编号,记为集合S2。
3合并两个集合S1,S2,得到符合两个条件的记录编号集合,根据编号读出记录并放回结果。
方案二:
1判断两个检索条件,哪个可能结果集更小,就先用索引查询符合该条件的记录集合。比如预估交易时间在昨天以后的记录比交易额超过一千的记录更少,就使用“交易时间”字段索引查询交易时间在昨天以后的记录编号,记为集合S1。
2对S1集合的记录编号逐一获取记录内容读出交易额,判断是否符合超过一千元,如果不符合,则在S1集合中删除该记录编号。最后S1就是复合查询的记录编号集合。
超过两个条件的复合查询,上面两个方案可能同时使用,即先使用索引查询符合一个或者多个检索条件的记录编号集合,逐一对集合中的编号读取记录,根据记录字段的值检验是否符合其它检索条件。
现有技术的缺陷在于:
对于n个检索条件的复合检索,假设这n个检索条件的单独查询的记录编号结果集分别为S1、S2...Sn,集合大小分别为N1、N2...Nn,符合所有条件的最终记录编号结果集为S,大小为N。不妨设S1是最小的集合,Sn为最大集合,即
N 1=min{N i|1≤i≤n}
N n=max{N i|1≤i≤n}
两个方案时间复杂度分别为:
方案一
Figure PCTCN2019099125-appb-000001
方案二:
T(n)=N 1
方案二的时间复杂度比方案一更好,检索时间应该更快,但是一些数据库索引数据在内存,索引查询很快,逐一验证符合检索条件需要的数据在磁盘,比较慢,所以一般远小于N n时,方案二更高效,N 1和N n相差不大时,方案一更高效。
虽然在大部分查询中,这两个方案已经取得了不错的效果,但是有个不足点,当S1集合很大,而合并后的最终结果集S却很小的情况下,这两个方案效果就不是很好了。就是各种检索条件结果集比较大,而这些条件的复合检索总结果集很小,这类查询,数据库往往支持的不好,查询起来比较耗时。
公开内容
本公开针对上述现有技术中的缺陷,提出了如下技术方案。
一种索引数据存储方法,该方法包括:
划分步骤,将记录编号和对应字段的值组成一个键值对,记一共有P个键值对,将所有的键值对根据值元素的大小进行排序,将排序后的P个键值对的所有键元素划分为K个集合,每个集合中包括L个键元素;
排序步骤,分别对划分好的K个键元素集合按键元素大小进行排序,得到K个对应的键元素有序序列集合;
存储步骤,将P个有序键值对以及对应的K个排序键元素集合分别对应存储;
其中,L为大于0的整数、K为大于0的整数。
更进一步地,L与K的差小于一阈值,
Figure PCTCN2019099125-appb-000002
L=[P/K]。
更进一步地,当所有键值对的总数P不为K的整数倍时,最后一个集合的键值对的数量为P-(K-1)*L个,其中P为大于1的整数。
本公开还提出了一种数据检索方法,该方法包括:
接收步骤,接收输入的第一检索条件和第二检索条件;
检索步骤,根据第一检索条件和第二检索条件在数据库中进行检索分别获得检索结果对应的以键值排序的集合S1和S2;
检索结果获得步骤,对键值排序的集合S1和S2进行处理获得检索结果。
更进一步地,所述数据库中的数据使用上述任一项所述的索引数据存储方法进行存储。
更进一步地,所述对键值排序的集合S1和S2进行处理是与运算时,其操作为:取S1中最小的元素m1;在S2找到第一个大于等于m1的元素m2;如果m2和m1相等,把m2加入到集合S,S2继续往下查找一个替换m2;在S1往下找到第一个大于等于m2的元素m1;如果m1和m2相等,把m1加入到集合S,S1继续往下查找一个替换m1;重复S2中取m2,S1中取m1,直到m1或者m2获取不到元素为止,其中集合S为有序集合S1和S2的交集,根据S中的键值获得对应的检索结果。
更进一步地,所述对键值排序的集合S1和S2进行处理是取非运算时,其操作为:取S1中最小的元素m1;在S2找到第一个大于等于m1的元素m2;如果m2和m1不相等,把m1加入到集合S,如果m2和m1相等,S2继续往下查找一个替换m2;在S1往下找到所有大于m1小于m2的元素并加入到结果集S,并把m1替换成S1中第一个大于m2的元素。重复S2中取m2,S1中取m1,直到m1或者m2获取不到元素为止,其中集合S为有序集合S1对S2取非操作后的结果,根据S中的键值获得对应的检索结果。
更进一步地,所述对键值排序的集合S1和S2进行处理是或运算时,其操作为:通过建立一个极大堆进行或的运算。
更进一步地,检索步骤的具体操作为:根据第一检索条件查询S1的键值范围,按数据存储时排序的集合进行拆分该键值范围,得到多个有序集合和X个无序集合;对S1中的X个无序集合进行排序;对S1的所有子集建立一个极大堆,用每个子集合中最小元素的大小表示子集间的大小关系,最小的集合就在堆顶,S1取最小元素就是堆顶集合的首元素;查找第一个大于等于m1的元素,判断堆顶集合首元素是否小于m1,如果小于m1,堆顶集合通过有序集合二分查找排掉集合中小于m1的元素,使得堆顶集合首元素大于等于m1,然后重新调整堆,然后再次重复判断堆顶集合,直到堆顶集合首元素大于等于m1,此时堆顶首元素就是整个堆所有集合中第一个大于等于m1的元素;采用同样的方式得到S2,其中X为0、 1或2。
本公开还提出了一种索引数据存储装置,该装置包括:
划分模块,用于将记录编号和对应字段的值组成一个键值对,记一共有P个键值对,将所有的键值对根据值元素的大小进行排序,将排序后的P个键值对的所有键元素划分为K个集合,每个集合中包括L个键元素;
排序模块,用于分别对划分好的K个键元素集合按键元素大小进行排序,得到K个对应的键元素有序序列集合;
存储模块,用于将P个有序键值对以及对应的K个排序键元素集合分别对应存储;其中,L为大于0的整数、K为大于0的整数。
更进一步地,L与K的差小于一阈值,
Figure PCTCN2019099125-appb-000003
L=[P/K]。
更进一步地,当所有键值对的总数P不为K的整数倍时,最后一个集合的键值对的数量为P-(K-1)*L个,其中P为大于1的整数。
本公开还提出了一种数据检索装置,该装置包括:
接收模块,用于接收输入的第一检索条件和第二检索条件;
检索模块,用于根据第一检索条件和第二检索条件在数据库中进行检索分别获得检索结果对应的以键值排序的集合S1和S2;
检索结果获得模块,用于对键值排序的集合S1和S2进行处理获得检索结果。
更进一步地,所述数据库中的数据使用上述任一项所述的索引数据存储方法进行存储。
更进一步地,所述对键值排序的集合S1和S2进行处理是与运算时,其操作为:所述对键值排序的集合S1和S2进行处理是与运算时,其操作为:取S1中最小的元素m1;在S2找到第一个大于等于m1的元素m2;如果m2和m1相等,把m2加入到集合S,S2继续往下查找一个替换m2;在S1往下找到第一个大于等于m2的元素m1;如果m1和m2相等,把m1加入到集合S,S1继续往下查找一个替换m1;重复S2中取m2,S1中取m1,直到m1或者m2获取不到元素为止,其中集合S为有序集合S1和S2的交集,根据S中的键值获得对应的检索结果。
更进一步地,所述对键值排序的集合S1和S2进行处理是取非运算时,其操作为:取S1中最小的元素m1;在S2找到第一个大于等于m1的元素m2;如果m2和m1不相等,把m1加入到集合S,如果m2和m1相等,S2继续往下查找一个替换m2;在S1往下找到所有大于m1小于m2的元素并加入到结果集S,并把m1替换成S1中第一个大于m2的元素。重复S2中取m2,S1中取m1,直到m1或者m2获取不到元素为止,其中集合S为有序集合S1对S2取非操作后的结果,根据S中的键值获得对应的检索结果。
更进一步地,所述对键值排序的集合S1和S2进行处理是或运算时,其操作为:通过建立一个极大堆进行或的运算。
更进一步地,检索模块执行的具体操作为:根据第一检索条件查询S1的键值范围,按数据存储时排序的集合进行拆分该键值范围,得到多个有序集合和X个无序集合;对S1中的X个无序集合进行排序;对S1的所有子集建立一个极大堆,用每个子集合中最小元素的大小表示子集间的大小关系,最小的集合就在堆顶,S1取最小元素就是堆顶集合的首元素;查找第一个大于等于m1的元素,判断堆顶集合首元素是否小于m1,如果小于m1,堆顶集合通过有序集合二分查找排掉集合中小于m1的元素,使得堆顶集合首元素大于等于m1,然后重新调整堆,然后再次重复判断堆顶集合,直到堆顶集合首元素大于等于m1,此时堆顶首元素就是整个堆所有集合中第一个大于等于m1的元素;采用同样的方式得到S2,其中X为0、1或2。
本公开还提出了一种计算机可读存储介质,其特征在于,所述存储介质上存储有计算机程序代码,当所述计算机程序代码被计算机执行时执行上述之任一的方法。
本公开的技术效果为:本公开为了提高多个检索条件下的检索效率,在数据(即键值对)存储时,不仅根据值元素的大小进行排序,还将排序的数据序列划分为多个段,每个段将键值排序,并将数据序列与键值排序对应存储,实现值元素和键值(也称为记录编号)都有序存储,即构建了全新的索引结构,对于任意的区间查询,结果集都可以用一个或者多个集合的并集来表示,并且这些集合大部分有序的,最多边界两个集合是无序的,从而提高了在多个条件查询时进行与、或、非等运算的效率。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显。
图1是根据本公开的实施例的一种索引数据存储方法的流程图。
图2是根据本公开的实施例的一种数据检索方法的流程图。
图3是根据本公开的实施例的一种索引数据存储装置的结构图。
图4是根据本公开的实施例的一种数据检索装置的结构图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关公开,而非对该公开的限定。另外还需要说明的是,为了便于描 述,附图中仅示出了与有关公开相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了本公开的一种索引数据存储方法,该方法包括:
划分步骤S101,将记录编号和对应字段的值组成一个键值对,记一共有P个键值对,将所有的键值对根据值元素的大小进行排序,将排序后的P个键值对的所有键元素划分为K个集合,每个集合中包括L个键元素(也称为键值,在本公开中,键值与键元素二者是相同的)。
排序步骤S102,分别对划分好的K个键元素集合按键元素大小进行排序,得到K个对应的键元素有序序列集合。
存储步骤S103,将P个有序键值对以及对应的K个排序键元素集合分别对应存储。
本公开中的参数,P、L和K显然的为大于0的整数,这是本领域技术人员所能显而易见的知晓的。
经过试验证明,L与K的值越接近在执行检索时效果越好,即L与K的差小于一阈值,比如该阈值为3、5等等,一个优选的实施例为:
Figure PCTCN2019099125-appb-000004
L=[P/K],这时执行检索效率最高,这是本公开的重要公开点之一。
在一些实施例中,键值对的总数并不能恰好分成多个集合,即当所有键值对的总数P不为K的整数倍时,最后一个集合的键值对的数量为P-(K-1)*L个,其中P为大于1的整数。
该方法的一个具体的实施例为:为了实现值和记录编号都要有序,这里是用空间换取时间的方法,采用多级顺序表的形式,就是除了字段有序值对应的记录编号外,增加对记录编号排序的存储,并且可以按配置的级别进行分批多段存储。如下表,增加一列每五个连续记录编号进行排序的序列。
Figure PCTCN2019099125-appb-000005
L=5表示每5个连续的记录集合进行排序,就是3个大小为5的有序集合组成了所有编号,三个有序集合分别是{5,7,8,10,11}、{2,3,4,6,14}和{1,9,12,13,15},即本实施 例中P=15,K=3,L=5,这种数据存储方式即索引结构的建立方式是本公开的重要公开点之一,是后续本公开提出的检索方法的基础。
图2示出了一种数据检索方法,该方法包括:
接收步骤S201,接收输入的第一检索条件和第二检索条件;例如,第一检索条件i是交易时间在昨天以后记录,第二检索条件是交易额超过一千元记录。
检索步骤S202,根据第一检索条件和第二检索条件在数据库中进行检索分别获得检索结果对应的以键值排序的集合S1和S2。一个实施例是,S1是交易时间在昨天以后记录编号集合,S2是交易额超过一千元记录编号集合,S1、S2都是有序集。
检索结果获得步骤S203,对键值排序的集合S1和S2进行处理获得检索结果。
本公开中的所述数据库中的数据使用图1所示的方法进行存储。
与运算是多条件检索时常用的运算方式,当所述对键值排序的集合S1和S2进行处理是与运算时,其操作为:
1、取S1中最小的元素m1。时间复杂度1。
2、在S2找到第一个大于等于m1的元素m2;如果m2和m1相等,把m2加入到集合S,S2继续往下查找一个替换m2。有序集合可以二分查找,复杂度log 2(N t2),N t2为该步骤前的S2集合大小。
3、在S1往下找到第一个大于等于m2的元素m1;如果m1和m2相等,把m1加入到集合S,S1继续往下查找一个替换m1。复杂度log 2(N t1),N t1为该步骤前的S1集合大小。
4、一直重复2-3步骤,直到找不到m1或者m2元素。
该算法时间复杂度为
T(n)=N*log 2(N 1*N 2)/(2a)
N为S集合大小,N1是S1集合大小、N2是S2集合大小,a是S1和S2两个集合交叉情况的系数参数。S1和S2存储结构是有序的,无需遍历S1和S2就可以快速算出结果集S,所以该算法在N1和N2比较大,而N比较小的情况下,会取得非常好的效果。其中集合S为键值的有序集合S1和S2的交集,根据S中的键值获得对应的检索结果。
取非操作和与操作基本类似,举例说明
Figure PCTCN2019099125-appb-000006
步骤如下:
1、取S1中最小的元素m1。
2、在S2找到第一个大于等于m1的元素m2;如果m2和m1不相等,把m1加入到结果集S 12。如果m2和m1相等,S2继续往下查找一个替换m2。
3、在S1往下找到所有大于m1小于m2的元素并加入到结果集S 12,并把m1替换成S1中第一个大于m2的元素。
4、一直重复2-3步骤,直到找不到m1或者m2元素。两个检索条件的集合并集,就跟刚才多个有序子集的并集一样,建立一个极大堆,把两个集合并集的结果当做一个检索条件一样使用。和有序子集有一点区别是,有序子集间的记录编号不重复,而检索条件的几个集合可能会有相同的编号,所以取最小值过程中需要排重。
即S 34=S 3∪S 4把S 34当做一个检索条件。
一个具体的实施为一个四个检索条件的组合如下:
Figure PCTCN2019099125-appb-000007
Figure PCTCN2019099125-appb-000008
是非,∩是且(交集),∪是或(并集)。
根据上述取非及或运算的介绍,将四条件的检索转换为S 12和S 34进行与运算的操作,即将多条件的运算进行拆分组合后进行运算。上述具体的多条件检索方式是本公开的重要公开点之一,其结合本公开的数据存储方式,可以大大提高检索效率。
检索步骤S102的具体操作为:根据第一检索条件查询S1的键值范围,按数据存储时排序的集合进行拆分该键值范围,得到多个有序集合和X个无序集合;对S1中的X个无序集合进行排序;对S1的所有子集建立一个极大堆,用每个子集合中最小元素的大小表示子集间的大小关系,最小的集合就在堆顶,S1取最小元素就是堆顶集合的首元素;查找第一个大于等于m1的元素,判断堆顶集合首元素是否小于m1,如果小于m1,堆顶集合通过有序集合二分查找排掉集合中小于m1的元素,使得堆顶集合首元素大于等于m1,然后重新调整堆,然后再次重复判断堆顶集合,直到堆顶集合首元素大于等于m1,此时堆顶首元素就是整个堆所有集合中第一个大于等于m1的元素;采用同样的方式得到S2,其中X为0、1或2。即只有左右边界的集合可能无序。
结合上表,一个具体的实施方式为:超过1000的交易金额的记录{14,2,9,12,1,13,15},就是一个无序集合{14,2}(左边界为无序集合)和一个有序集合{1,9,12,13,15}的并集。
现在S1是一个有序集合和一个无序集合的并集,我们要做的是把多个集合用一个算法当做一个整体,可以快速的找到集合中指定大小元素的位置,有序集合可以通过二分查找。多个有序集合,其实我们可以用堆组织一起,可以较高效的找到指定大小元素所在的位置。
这样子时间复杂度变为了
T(n)=4t log 2(t)+N*N 1*N 2*log 2(N 1*N 2)/(2t 2ab 1b 2)
t是有序集合的大小,b1是S1子集间的交叉情况系数,b2是S2子集间的交叉情况系统,4t log 2(t),是对无序集合排序的复杂度。可以看到t的增加会增加排序的复杂度,可是可以减少有序子集的个数,减少后部分合并的复杂度,所以可以进一步优化,建立多级排序序列,这样无序集合个数可以使用较小的t,有序子集可以使用各种大小t的组合,减少 子集数,提高效率,但是每增加一级排序,记录编号占用空间将增加一份,所以建立几份有序集合需要一定取舍。在N比较小的情况下,大部分t 2b 1b 2会很接近N 1*N 2,整体时间复杂度会取得很好的效果。这是本公开的重要公开点之一。
进一步参考图3,作为对上述图1所示方法的实现,本公开提供了一种索引数据存储装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以包含于各种电子设备中。
图3示出了一种索引数据存储装置,该装置包括:
划分模块301,将记录编号和对应字段的值组成一个键值对,记一共有P个键值对,将所有的键值对根据值元素的大小进行排序,将排序后的P个键值对的所有键元素划分为K个集合,每个集合中包括L个键元素(也称为键值,在本公开中,键值与键元素二者是相同的)。
排序模块302,分别对划分好的K个键元素集合按键元素大小进行排序,得到K个对应的键元素有序序列集合。
存储模块303,将P个有序键值对以及对应的K个排序键元素集合分别对应存储。
本公开中的参数,P、L和K显然的为大于0的整数,这是本领域技术人员所能显而易见的知晓的。
经过试验证明,L与K的值越接近在执行检索时效果越好,即L与K的差小于一阈值,比如该阈值为3、5等等,一个优选的实施例为:
Figure PCTCN2019099125-appb-000009
L=[P/K],这时执行检索效率最高,这是本公开的重要公开点之一。
在一些实施例中,键值对的总数并不能恰好分成多个集合,即当所有键值对的总数P不为K的整数倍时,最后一个集合的键值对的数量为P-(K-1)*L个,其中P为大于1的整数。
该方法的一个具体的实施例为:为了实现值和记录编号都要有序,这里是用空间换取时间的方法,采用多级顺序表的形式,就是除了字段有序值对应的记录编号外,增加对记录编号排序的存储,并且可以按配置的级别进行分批多段存储。如下表,增加一列每五个连续记录编号进行排序的序列。
Figure PCTCN2019099125-appb-000010
Figure PCTCN2019099125-appb-000011
L=5表示每5个连续的记录集合进行排序,就是3个大小为5的有序集合组成了所有编号,三个有序集合分别是{5,7,8,10,11}、{2,3,4,6,14}和{1,9,12,13,15},即本实施例中P=15,K=3,L=5,这种数据存储方式即索引结构的建立方式是本公开的重要公开点之一,是后续本公开提出的检索方法的基础。
进一步参考图4,作为对上述图2所示方法的实现,本公开提供了一种数据检索装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以包含于各种电子设备中。
图4示出了一种数据检索装置,该装置包括:
接收模块401,接收输入的第一检索条件和第二检索条件;例如,第一检索条件i是交易时间在昨天以后记录,第二检索条件是交易额超过一千元记录。
检索模块402,根据第一检索条件和第二检索条件在数据库中进行检索分别获得检索结果对应的以键值排序的集合S1和S2。一个实施例是,S1是交易时间在昨天以后记录编号集合,S2是交易额超过一千元记录编号集合,S1、S2都是有序集。
检索结果获得模块403,对键值排序的集合S1和S2进行处理获得检索结果。
本公开中的所述数据库中的数据使用图1所示的方法进行存储。
与运算是多条件检索时常用的运算方式,当所述对键值排序的集合S1和S2进行处理是与运算时,其操作为:
1、取S1中最小的元素m1。时间复杂度1。
2、在S2找到第一个大于等于m1的元素m2;如果m2和m1相等,把m2加入到集合S,S2继续往下查找一个替换m2。有序集合可以二分查找,复杂度log 2(N t2),N t2为该步骤前的S2集合大小。
3、在S1往下找到第一个大于等于m2的元素m1;如果m1和m2相等,把m1加入到集合S,S1继续往下查找一个替换m1。复杂度log 2(N t1),N t1为该步骤前的S1集合大小。
4、一直重复2-3步骤,直到找不到m1或者m2元素。
该算法时间复杂度为
T(n)=N*log 2(N 1*N 2)/(2a)
N为S集合大小,N1是S1集合大小、N2是S2集合大小,a是S1和S2两个集合交叉 情况的系数参数。S1和S2存储结构是有序的,无需遍历S1和S2就可以快速算出结果集S,所以该算法在N1和N2比较大,而N比较小的情况下,会取得非常好的效果。其中集合S为键值的有序集合S1和S2的交集,根据S中的键值获得对应的检索结果。
取非操作和与操作基本类似,举例说明
Figure PCTCN2019099125-appb-000012
步骤如下:
1、取S1中最小的元素m1。
2、在S2找到第一个大于等于m1的元素m2;如果m2和m1不相等,把m1加入到结果集S 12。如果m2和m1相等,S2继续往下查找一个替换m2。
3、在S1往下找到所有大于m1小于m2的元素并加入到结果集S 12,并把m1替换成S1中第一个大于m2的元素。
4、一直重复2-3步骤,直到找不到m1或者m2元素。两个检索条件的集合并集,就跟刚才多个有序子集的并集一样,建立一个极大堆,把两个集合并集的结果当做一个检索条件一样使用。和有序子集有一点区别是,有序子集间的记录编号不重复,而检索条件的几个集合可能会有相同的编号,所以取最小值过程中需要排重。
即S 34=S 3∪S 4把S 34当做一个检索条件。
一个具体的实施为一个四个检索条件的组合如下:
Figure PCTCN2019099125-appb-000013
Figure PCTCN2019099125-appb-000014
是非,∩是且(交集),∪是或(并集)。
根据上述取非及或运算的介绍,将四条件的检索转换为S 12和S 34进行与运算的操作,即将多条件的运算进行拆分组合后进行运算。上述具体的多条件检索方式是本公开的重要公开点之一,其结合本公开的数据存储方式,可以大大提高检索效率。
检索模块402执行的具体操作为:根据第一检索条件查询S1的键值范围,按数据存储时排序的集合进行拆分该键值范围,得到多个有序集合和X个无序集合;对S1中的X个无序集合进行排序;对S1的所有子集建立一个极大堆,用每个子集合中最小元素的大小表示子集间的大小关系,最小的集合就在堆顶,S1取最小元素就是堆顶集合的首元素;查找第一个大于等于m1的元素,判断堆顶集合首元素是否小于m1,如果小于m1,堆顶集合通过有序集合二分查找排掉集合中小于m1的元素,使得堆顶集合首元素大于等于m1,然后重新调整堆,然后再次重复判断堆顶集合,直到堆顶集合首元素大于等于m1,此时堆顶首元素就是整个堆所有集合中第一个大于等于m1的元素;采用同样的方式得到S2,其中X为0、1或2。即只有左右边界的集合可能无序。
结合上表,一个具体的实施方式为:超过1000的交易金额的记录{14,2,9,12,1,13,15},就是一个无序集合{14,2}(左边界为无序集合)和一个有序集合{1,9,12,13,15}的并集。
现在S1是一个有序集合和一个无序集合的并集,我们要做的是把多个集合用一个算法当做一个整体,可以快速的找到集合中指定大小元素的位置,有序集合可以通过二分查找。多个有序集合,其实我们可以用堆组织一起,可以较高效的找到指定大小元素所在的位置。
这样子时间复杂度变为了
T(n)=4t log 2(t)+N*N 1*N 2*log 2(N 1*N 2)/(2t 2ab 1b 2)
t是有序集合的大小,b1是S1子集间的交叉情况系数,b2是S2子集间的交叉情况系统,4t log 2(t),是对无序集合排序的复杂度。可以看到t的增加会增加排序的复杂度,可是可以减少有序子集的个数,减少后部分合并的复杂度,所以可以进一步优化,建立多级排序序列,这样无序集合个数可以使用较小的t,有序子集可以使用各种大小t的组合,减少子集数,提高效率,但是每增加一级排序,记录编号占用空间将增加一份,所以建立几份有序集合需要一定取舍。在N比较小的情况下,大部分t 2b 1b 2会很接近N 1*N 2,整体时间复杂度会取得很好的效果。这是本公开的重要公开点之一。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本公开时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本公开可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例或者实施例的某些部分所述的方法。
最后所应说明的是:以上实施例仅以说明而非限制本公开的技术方案,尽管参照上述实施例对本公开进行了详细说明,本领域的普通技术人员应当理解:依然可以对本公开进行修改或者等同替换,而不脱离本公开的精神和范围的任何修改或局部替换,其均应涵盖在本公开的权利要求范围当中。

Claims (10)

  1. 一种索引数据存储方法,其特征在于,该方法包括:
    划分步骤,将记录编号和对应字段的值组成一个键值对,记一共有P个键值对,将所有的键值对根据值元素的大小进行排序,将排序后的P个键值对的所有键元素划分为K个集合,每个集合中包括L个键元素;
    排序步骤,分别对划分好的K个键元素集合按键元素大小进行排序,得到K个对应的键元素有序序列集合;
    存储步骤,将P个有序键值对以及对应的K个排序键元素集合分别对应存储;
    其中,L为大于0的整数、K为大于0的整数。
  2. 根据权利要求1所述的方法,其特征在于,L与K的差小于一阈值,
    Figure PCTCN2019099125-appb-100001
    L=[P/K]。
  3. 根据权利要求1所述的方法,其特征在于,当所有键值对的总数P不为K的整数倍时,最后一个集合的键值对的数量为P-(K-1)*L个,其中P为大于1的整数。
  4. 一种数据检索方法,其特征在于,该方法包括:
    接收步骤,接收输入的第一检索条件和第二检索条件;
    检索步骤,根据第一检索条件和第二检索条件在数据库中进行检索分别获得检索结果对应的以键值排序的集合S1和S2;
    检索结果获得步骤,对键值排序的集合S1和S2进行处理获得检索结果。
  5. 根据权利要求4所述的方法,其特征在于,所述数据库中的数据使用权利要求1-3任一项所述的方法进行存储。
  6. 根据权利要求5所述的方法,其特征在于,所述对键值排序的集合S1和S2进行处理是与运算时,其操作为:所述对键值排序的集合S1和S2进行处理是与运算时,其操作为:取S1中最小的元素m1;在S2找到第一个大于等于m1的元素m2;如果m2和m1相等,把m2加入到集合S,S2继续往下查找一个替换m2;在S1往下找到第一个大于等于m2的元素m1;如果m1和m2相等,把m1加入到集合S,S1继续往下查找一个替换m1;重复S2中取m2,S1中取m1,直到m1或者m2获取不到元素为止,其中集合S为有序集合S1和S2的交集,根据S中的键值获得对应的检索结果。
  7. 根据权利要求5所述的方法,其特征在于,所述对键值排序的集合S1和S2进行处理是取非运算时,其操作为:取S1中最小的元素m1;在S2找到第一个大于等于m1的元素m2;如果m2和m1不相等,把m1加入到集合S,如果m2和m1相等,S2继续往下查 找一个替换m2;在S1往下找到所有大于m1小于m2的元素并加入到结果集S,并把m1替换成S1中第一个大于m2的元素;重复S2中取m2,S1中取m1,直到m1或者m2获取不到元素为止,其中集合S为有序集合S1对S2取非操作后的结果,根据S中的键值获得对应的检索结果。
  8. 根据权利要求5所述的方法,其特征在于,所述对键值排序的集合S1和S2进行处理是或运算时,其操作为:通过建立一个极大堆进行或的运算。
  9. 根据权利要求6-8之一所述的方法,其特征在于,检索步骤的具体操作为:根据第一检索条件查询S1的键值范围,按数据存储时排序的集合进行拆分该键值范围,得到多个有序集合和X个无序集合;对S1中的X个无序集合进行排序;对S1的所有子集建立一个极大堆,用每个子集合中最小元素的大小表示子集间的大小关系,最小的集合就在堆顶,S1取最小元素就是堆顶集合的首元素;查找第一个大于等于m1的元素,判断堆顶集合首元素是否小于m1,如果小于m1,堆顶集合通过有序集合二分查找排掉集合中小于m1的元素,使得堆顶集合首元素大于等于m1,然后重新调整堆,然后再次重复判断堆顶集合,直到堆顶集合首元素大于等于m1,此时堆顶首元素就是整个堆所有集合中第一个大于等于m1的元素;采用同样的方式得到S2,其中X为0、1或2。
  10. 一种计算机可读存储介质,其特征在于,所述存储介质上存储有计算机程序代码,当所述计算机程序代码被计算机执行时执行权利要求1-9之任一的方法。
PCT/CN2019/099125 2018-09-18 2019-08-02 一种索引数据存储及检索方法、装置及存储介质 WO2020057272A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP19797156.7A EP3654195A4 (en) 2018-09-18 2019-08-02 METHODS AND DEVICES FOR STORING AND RECALLING INDEX DATA AND STORAGE MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811091065.6A CN109325032B (zh) 2018-09-18 2018-09-18 一种索引数据存储及检索方法、装置及存储介质
CN201811091065.6 2018-09-18

Publications (1)

Publication Number Publication Date
WO2020057272A1 true WO2020057272A1 (zh) 2020-03-26

Family

ID=65264838

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/099125 WO2020057272A1 (zh) 2018-09-18 2019-08-02 一种索引数据存储及检索方法、装置及存储介质

Country Status (3)

Country Link
EP (1) EP3654195A4 (zh)
CN (1) CN109325032B (zh)
WO (1) WO2020057272A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395336A (zh) * 2020-11-27 2021-02-23 北京卫星环境工程研究所 一种长时间序列数据管理和可视化方法
CN117573726A (zh) * 2024-01-12 2024-02-20 邯郸鉴晨网络科技有限公司 一种基于大数据的订单信息智能搜索方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325032B (zh) * 2018-09-18 2020-10-27 厦门市美亚柏科信息股份有限公司 一种索引数据存储及检索方法、装置及存储介质
CN110377642B (zh) * 2019-07-24 2020-06-02 杭州太尼科技有限公司 一种快速获取有序序列数据的装置
CN112800059B (zh) * 2021-01-27 2022-07-08 国电南瑞南京控制系统有限公司 新能源电站运维巡检表单数据存储方法、装置及系统
CN112632157B (zh) * 2021-03-11 2021-07-27 全时云商务服务股份有限公司 一种分布式系统下的多条件分页查询方法
CN113378995B (zh) * 2021-07-09 2024-03-12 中山大学 基于iDistance算法的不确定数据序列K近邻方法及系统
CN117573703B (zh) * 2024-01-16 2024-04-09 科来网络技术股份有限公司 时序数据的通用检索方法、系统、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233673A1 (en) * 2006-03-29 2007-10-04 Hee-Cheol Seo Apparatus and method for searching multimedia data based on metadata
CN101770291A (zh) * 2009-04-30 2010-07-07 广东国笔科技股份有限公司 输入系统语意分析数据散列存储和分析方法
CN103890755A (zh) * 2011-12-27 2014-06-25 三菱电机株式会社 检索装置
CN104252481A (zh) * 2013-06-27 2014-12-31 阿里巴巴集团控股有限公司 主从数据库一致性的动态校验方法和装置
CN105930453A (zh) * 2016-04-21 2016-09-07 乐视控股(北京)有限公司 重复性分析方法及装置
CN109325032A (zh) * 2018-09-18 2019-02-12 厦门市美亚柏科信息股份有限公司 一种索引数据存储及检索方法、装置及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3920719B2 (ja) * 2002-06-21 2007-05-30 三洋電機株式会社 コンテンツ再生装置
CN101751406B (zh) * 2008-12-18 2012-01-04 赵伟 一种实现基于列存储的关系型数据库的方法及装置
CN103678293B (zh) * 2012-08-29 2020-03-03 百度在线网络技术(北京)有限公司 一种数据存储方法及装置
CN105956085B (zh) * 2016-04-29 2019-08-27 优酷网络技术(北京)有限公司 一种倒排索引的构建方法和装置、检索方法和装置
CN107391502B (zh) * 2016-05-16 2020-08-04 阿里巴巴集团控股有限公司 时间间隔的数据查询方法、装置及索引构建方法、装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233673A1 (en) * 2006-03-29 2007-10-04 Hee-Cheol Seo Apparatus and method for searching multimedia data based on metadata
CN101770291A (zh) * 2009-04-30 2010-07-07 广东国笔科技股份有限公司 输入系统语意分析数据散列存储和分析方法
CN103890755A (zh) * 2011-12-27 2014-06-25 三菱电机株式会社 检索装置
CN104252481A (zh) * 2013-06-27 2014-12-31 阿里巴巴集团控股有限公司 主从数据库一致性的动态校验方法和装置
CN105930453A (zh) * 2016-04-21 2016-09-07 乐视控股(北京)有限公司 重复性分析方法及装置
CN109325032A (zh) * 2018-09-18 2019-02-12 厦门市美亚柏科信息股份有限公司 一种索引数据存储及检索方法、装置及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3654195A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395336A (zh) * 2020-11-27 2021-02-23 北京卫星环境工程研究所 一种长时间序列数据管理和可视化方法
CN112395336B (zh) * 2020-11-27 2024-03-19 北京卫星环境工程研究所 一种长时间序列数据管理和可视化方法
CN117573726A (zh) * 2024-01-12 2024-02-20 邯郸鉴晨网络科技有限公司 一种基于大数据的订单信息智能搜索方法
CN117573726B (zh) * 2024-01-12 2024-05-03 新疆原行网智慧文旅有限公司 一种基于大数据的订单信息智能搜索方法

Also Published As

Publication number Publication date
EP3654195A1 (en) 2020-05-20
EP3654195A4 (en) 2021-04-28
CN109325032B (zh) 2020-10-27
CN109325032A (zh) 2019-02-12

Similar Documents

Publication Publication Date Title
WO2020057272A1 (zh) 一种索引数据存储及检索方法、装置及存储介质
KR101972645B1 (ko) 클러스터링 저장 방법 및 장치
US10114908B2 (en) Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data
US7433869B2 (en) Method and apparatus for document clustering and document sketching
US10783115B2 (en) Dividing a dataset into sub-datasets having a subset of values of an attribute of the dataset
CN111868710B (zh) 搜索大规模非结构化数据的随机提取森林索引结构
CN108287840B (zh) 一种基于矩阵哈希的数据存储和查询方法
CN1979469A (zh) 索引及其扩展和查询方法
WO2018161548A1 (zh) 一种基于二值码字典树的搜索方法
US20120254173A1 (en) Grouping data
CN116450656B (zh) 数据处理方法、装置、设备及存储介质
EP3422205A1 (en) Database-archiving method and apparatus that generate index information, and method and apparatus for searching archived database comprising index information
Tang et al. Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
US10642918B2 (en) Efficient publish/subscribe systems
US8438173B2 (en) Indexing and querying data stores using concatenated terms
WO2013097065A1 (zh) 一种索引数据处理方法及设备
CN114911826A (zh) 一种关联数据检索方法和系统
KR20010109067A (ko) 특징 벡터 데이터 공간의 인덱싱 방법
Chauhan et al. Finding similar items using lsh and bloom filter
KR100892406B1 (ko) 정보 검색 방법 및 그 시스템
US11301448B2 (en) Method and system of a secondary index in a distributed data base system
Huang et al. A Multi-Block N-ary trie structure for exact r-neighbour search in hamming space
Habib et al. Processing universal quantification queries using mapreduce
CN115599597A (zh) 一种对云服务备份重复对象筛选的方法及系统

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019797156

Country of ref document: EP

Effective date: 20191111

NENP Non-entry into the national phase

Ref country code: DE