WO2019165763A1 - 一种用于查询数据的方法 - Google Patents

一种用于查询数据的方法 Download PDF

Info

Publication number
WO2019165763A1
WO2019165763A1 PCT/CN2018/100565 CN2018100565W WO2019165763A1 WO 2019165763 A1 WO2019165763 A1 WO 2019165763A1 CN 2018100565 W CN2018100565 W CN 2018100565W WO 2019165763 A1 WO2019165763 A1 WO 2019165763A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
index
column
key
identifier
Prior art date
Application number
PCT/CN2018/100565
Other languages
English (en)
French (fr)
Inventor
毕杰山
钟超强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019165763A1 publication Critical patent/WO2019165763A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Definitions

  • the present application relates to the field of storage, and more particularly to a method and apparatus for querying data in the field of storage.
  • the query process according to the data can be implemented through the inverted index.
  • the inverted index represents a correspondence between a data entity list and a keyword, wherein the data entity represents an object having the keyword, for example, the data entity may be a user, and the data entity list represents each data having the keyword.
  • a collection of entities is a collection of entities.
  • the system assigns a corresponding integer (Integer, Int) identity (ID) to each data entity, and can search for data through a correspondence between the constructed keyword and multiple IDs.
  • ID integer
  • the correspondence relationship is: Address: Longgang-> ⁇ 1, 2 ⁇ , wherein the keyword is: Address: Longgang, and the plurality of IDs are: 1, 2, and the corresponding relationship indicates that the carrier having IDs 1 and 2 has the Key words.
  • the corresponding ID may be determined based on the keyword, and the corresponding data entity is determined based on the ID.
  • the correspondence between the data entities and the IDs may change.
  • the correspondence between the keywords and the multiple IDs may have expired. Therefore, the actual query data is invalid.
  • you need to read the data in the underlying database you can find the data that meets the conditions, which seriously reduces the query efficiency. Especially when more keywords are included in the query conditions, the query may fail.
  • the application provides a method for querying data, which can effectively improve the query efficiency of the data.
  • a method for querying data comprising:
  • the first index information in the first index partition corresponding to the first data Updating, according to the P index keys, the row primary key of the first data, and the internal data identifier of the first data, the first index information in the first index partition corresponding to the first data, the first The row primary key of the data is used to look up the first data in the data area, the internal data identifier of the first data is unique in the first index partition, and the first index information includes for the stored M a first correspondence relationship and a second correspondence relationship of the strip data, wherein
  • the first correspondence relationship represents a one-to-one correspondence between the N index keys generated by the M pieces of data and the N sets of internal data identifiers, and each set of internal data identifiers includes at least one piece of the M pieces of data.
  • An internal data identifier wherein each set of internal data identifiers is an identifier for identifying data that satisfies a corresponding index key
  • the second correspondence relationship represents M row primary keys and the M pieces of data generated based on the M pieces of data
  • M and N are integers greater than or equal to 1.
  • the method for querying data provided by the embodiment of the present application, after acquiring data (for example, the first data), according to an index key generated based on at least part of the data (for example, L column data) of the first data,
  • the row primary key of the first data and the internal data identifier of the first data are updated with the first index information in the first index partition corresponding to the first data, wherein the first index information includes the first data for the stored M pieces of data a correspondence relationship and a second correspondence relationship, the first correspondence relationship represents a relationship between the N index keys generated based on the M pieces of data and the N sets of internal data identifiers, and the second correspondence relationship represents the M generated based on the M pieces of data The correspondence between the row primary key and the M internal data identifiers.
  • the internal data identifier of the data is unique in the index partition corresponding to the data, when the plurality of data files are combined into one data file, the correspondence between the row primary key and the internal data identifier of the data does not change, thereby The correspondence between the index key generated based on the data and the internal data identifier does not change, so that the data can be quickly read from the index information originally cached in the memory without requiring an index from the underlying database. Re-reading data in the information improves the efficiency of data query.
  • the generating, according to the L column data in the first data, P index keys including:
  • the value of the i traversal is in the range of [1, L], and the P index keys are generated by the following steps:
  • each keyword includes a participle corresponding to each of the keywords
  • each keyword includes a word segment and a corresponding to each of the keywords The keyword in the row primary key of the first data, or,
  • each keyword includes a corresponding one of the keys a word segmentation of the word and a column name of the i-th column data of the first data;
  • An index key corresponding to each of the keywords is generated based on each of the at least one keyword.
  • the method for querying data passes at least one of the word segmentation (for example, the i-th column data) of the L column data of the first data, the at least one word segment, and the first Extracting at least one keyword corresponding to the at least one word segment in any one of a row primary key of the data, the at least one word segment, and the column name of the i-th column data, which can effectively improve the flexibility of the system for extracting keywords , thereby improving the efficiency of data query.
  • the word segmentation for example, the i-th column data
  • the generating, according to each of the at least one keyword, an index key corresponding to each keyword including:
  • An index key corresponding to each of the keywords is generated by the each keyword, a column name of the i-th column data of the first data, and a first index partition identifier for identifying the first index partition.
  • the generating, according to each of the at least one keyword, an index key corresponding to each keyword including:
  • An index key corresponding to each of the keywords is generated by the each keyword and a first index partition identifier for identifying the first index partition.
  • the first index information is stored in a first storage area
  • the M pieces of data are stored in a second storage area, and the first storage area is isolated from the second storage area.
  • the method for querying data provided by the embodiment of the present application can change the data partition of the data table without affecting the index information by isolating the first storage area storing the index information from the second storage area storing the data. Content, and also does not affect the data in the data table when rebuilding the index information, effectively improving the processing speed of the data.
  • a method for querying data comprising:
  • each set of internal data identifiers includes an internal data identifier of at least one of the plurality of pieces of data, and each set of internal data identifiers is an identifier for identifying data that satisfies a corresponding index key;
  • the second correspondence relationship represents a one-to-one correspondence between a plurality of row primary keys generated based on the plurality of pieces of data and a plurality of internal data identifiers of the plurality of pieces of data, the row primary key Used to find data in the data area;
  • the method for querying data includes the first correspondence relationship and the second correspondence relationship, wherein the first correspondence relationship represents multiple indexes generated based on multiple pieces of data. a one-to-one correspondence between a key and a plurality of sets of internal data identifiers, the second correspondence relationship representing a one-to-one correspondence between a plurality of row primary keys and a plurality of internal data identifiers generated based on the plurality of pieces of data, and internal data of the data
  • the identifier is unique in the index partition corresponding to the data, so that when the plurality of data files are merged into one data file, the second correspondence does not change, and thus the first correspondence does not change. Therefore, when the data satisfying the query condition is queried, the data can be quickly read from the index information originally cached in the memory without re-reading the data from the index information in the underlying database, thereby improving the data. Query efficiency.
  • the index information of the S index partitions is stored in the first storage area
  • the data corresponding to the S index partitions is stored in the second storage area
  • the first storage area and the second storage area The area is isolated.
  • the method for querying data provided by the embodiment of the present application can change the data partition of the data table without affecting the index information by isolating the first storage area storing the index information from the second storage area storing the data. Content, and also does not affect the data in the data table when rebuilding the index information, effectively improving the processing speed of the data.
  • an apparatus for querying data for performing the method of the first aspect or any of the possible implementations of the first aspect comprises means for performing the method of the first aspect or any of the possible implementations of the first aspect.
  • an apparatus for querying data for performing the method of any of the second aspect or the second aspect of the second aspect comprises means for performing the method of any of the second aspect or any of the possible implementations of the second aspect.
  • a fifth aspect provides an apparatus for querying data, the apparatus comprising a processor and a memory; the memory is configured to store computer execution instructions, and the processor and the memory communicate with each other through an internal connection path .
  • the processor executes the computer-executed instructions stored by the memory to cause the device to perform any of the first aspect or any of the possible implementations of the first aspect when the device is in operation.
  • a sixth aspect provides an apparatus for querying data, the apparatus comprising a processor and a memory; the memory is configured to store computer execution instructions, and the processor and the memory communicate with each other through an internal connection path .
  • the processor executes the computer-executed instructions stored by the memory to cause the device to perform any of the second aspect or any of the possible implementations of the second aspect when the device is running.
  • a computer storage medium comprising computer execution instructions, the computer performing any of the above first to second aspects when the processor of the computer executes the computer execution instructions Either way of achieving.
  • a chip comprising a processor and a memory, the processor for executing the memory stored instructions, when the instructions are executed, the processor can implement the first aspect to Any of any of the possible implementations of the second aspect.
  • a computer program which, when executed on a computer, causes the computer to implement any of the first to second aspects of any of the possible embodiments.
  • FIG. 1 is a schematic diagram of a data storage system suitable for use in embodiments of the present application.
  • FIG. 2 is a schematic flowchart of a method for querying data according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of storing internal data identifiers in an underlying database according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for querying data according to an embodiment of the present application.
  • FIG 5 and 6 are schematic block diagrams of an apparatus for querying data in accordance with an embodiment of the present application.
  • FIG 7 and 8 are schematic structural diagrams of an apparatus for querying data according to an embodiment of the present application.
  • the correspondence between the data entities and the IDs may change, and thus the correspondence between the keywords and the multiple IDs may have failed.
  • the data written by the system at time t1 is ⁇ data 1, data 2, data 5, data 8, data 9, data 19 ⁇ , and is used to indicate that the index data between the data entity and the ID is ⁇ 1: data 1.
  • the index data used to indicate the keyword "shopping person” and the ID is: shopping person->1,3,5, then, when querying the data, if the keyword "shopping person” is input, it is first based on The index data between the keyword "shopping person” and the ID is searched to the qualified ID as ⁇ 1, 3, 5 ⁇ , and then the data is searched through the corresponding data entity ⁇ data 1, data 5, data 9 ⁇ .
  • the system newly writes data ⁇ data 3, data 12, data 15, data 28 ⁇ at time t2, where the index data used to represent the data entity and the ID is ⁇ 1:data 3,2:data 12, 3:data 15,4:data 28 ⁇ , the data entities data 3 and data 15 all contain the keyword "shopping talent", and the index data used to represent the keyword "shopping person” and the ID is: Shopping Daren->1,3.
  • the system combines the data of time t1 and time t3 to indicate that the index data between the data entity and the ID changes, that is, ⁇ 1:data 1,2:data 2,3:data 3,4: Data 5,5:data 8,6:data 9,d:Doc 12,8:doc 15,9:doc 19,10:doc 28 ⁇ , correspondingly, used to represent the keyword "shopping person" and ID
  • the index data is: Shopping Owner -> 1,3,4,6,8.
  • the index data stored between the data and the ID stored in the system at time t1 and time t2 and the index data between the data entity and the ID are invalidated.
  • the system resources occupied by the index data are very large, and in order to improve the read performance of the data, the merged data is inevitable and frequently occurs. Therefore, the invalidation of the index data makes the actual query data After you need to read the data in the underlying database, you can find the data that meets the conditions, which seriously reduces the query efficiency. Especially when more keywords are included in the query conditions, the query may fail.
  • the embodiment of the present application provides a method for querying data, which can effectively solve the above problem.
  • FIG. 1 is a schematic diagram of a data storage system suitable for use in an embodiment of the present application.
  • the data storage system 100 includes a terminal device 110 and a device 120 that can be coupled to the device 120 over a wired or wireless network.
  • the terminal device 110 has a request data inquiry function and a request data storage function.
  • the client device 110 may be installed with a client capable of requesting a data query function and requesting a data storage function, for example, the client may be a browser.
  • the terminal device 110 may be a mobile phone, a tablet computer, an e-reader, a personal computer, an in-vehicle device, a wearable device, or the like.
  • the terminal device 110 has a request data storage function.
  • the query data device 120 has a data query function and a data storage function, and can store data based on a data storage request sent by the user through the client of the terminal device 110, and perform data through the stored data based on the query request sent from the terminal device 110.
  • the device for querying data 120 can be a device for querying data and storing data, such as a computing device, storage device, or server.
  • the database set in the device 120 is used to store data.
  • the database may be a distributed database such as HBase, Mongo Database (Mongo Database, Mongo DB), Distributed Relational Database Service (DRDS), Volt Database (Volt Database), and Cassandra.
  • FIG. 1 is for illustrative purposes only and should not be construed as limiting the embodiments of the present application.
  • the data storage system may only include a query data device 120 for querying data device 120 that has not only a query function but also a request data query function.
  • the query data device 120 can receive the query condition input by the user through the client for querying the data device 120.
  • the embodiment of the present application is described by taking the device 120 for querying data as a storage device as an example.
  • the method for querying data described in the embodiments of the present application can be applied to a distributed storage system supporting Key Value (KV).
  • KV Key Value
  • data is stored in key-values, and multiple pairs of key-values are stored in corresponding files. You can quickly determine the key by looking up the key-value key. The value of the data value, which enables the ability to process business in large-scale real-time. If a row of data has multiple columns of data, each column of data will be stored as a separate Key Value, and multiple Key Values in the same row will have the same Key value.
  • the data when the data is saved to the distributed storage system, it is naturally sorted according to the lexicographic order of the Key of the data. In this way, it can ensure that the content of each part of the same piece of data (or different data of a data entity) is stored adjacently. If you want to query the content of each part of a piece of data, you can use the indexing mechanism of the distributed storage system. Quickly query content that meets the criteria.
  • each data record includes user code, transaction time, transaction amount, and transaction remark information.
  • Key user code + transaction time
  • Value transaction details.
  • each column of data will be stored as a separate Key Value, and multiple Key Values in the same row have the same Key value. Therefore, based on the two data in Table 1, eight Key Values as shown in Table 2 can be generated.
  • the storage device can store the data to be stored in different data partitions.
  • the index information for the data can also be stored in different index partitions.
  • the storage device can pre-set the data partition and the index partition for the data to be stored.
  • the storage device may set the data partition based on preset data partition information preset by the user for indicating a partition condition of the data, where the preset partition data information may include at least one of the number of the split node and the data partition.
  • the storage device may set an index partition based on preset index partition information for indicating a situation of an index partition, wherein the preset index partition information may be generated based on preset data partition information, or the preset index partition information
  • the configuration may be generated based on the preset data partition information and the configuration information, where the configuration information is used to configure a partition status of the index partition, for example, the configuration information includes the number of index partitions.
  • the data to be stored is multiple pieces of data, and the data to be stored is stored in the form of Key Value.
  • Each piece of data has a key, and multiple data partitions can be set for the data to be stored according to the key of the data.
  • the commonly used method of distributed Key Value data partitioning is Range partitioning. Below, the method of Range partitioning is briefly described.
  • Range partition that is, for the partition of the data according to the range of the lexicographic order of the key, the key of the data belongs to which partition of the lexicographic order, and which partition the data belongs to. That is, a data partition stores data within a range of key values. Such a storage mechanism can preserve the original order of data and effectively improve the data reading performance.
  • the preset data partition information set is A, B, C, D, E, F, G, H, and I, wherein the letter indicates the size of the key, and then nine data partitions can be set for the data to be stored.
  • the nine data partitions are:
  • Partition 1 [A, B)
  • Partition 2 [B, C)
  • Partition 3 [C, D)
  • Partition 4 [D, E)
  • Partition 5 [E, F)
  • Partition 6 [F, G)
  • Partition 8 [H, I)
  • each data partition is a left closed right open interval.
  • [A, B) indicates that data greater than or equal to A and smaller than B is stored in the partition 1 to
  • the partition 9 is taken as an example, and [I,) indicates that data of a key larger than I is stored in the partition 9.
  • each data partition may also be a left-closed right-closed interval, which is not limited in this embodiment of the present application.
  • the storage device may divide the data to be stored into N data partitions on average according to the value range of the key value of the data.
  • each data partition can be automatically fissile or expanded during the data storage process. For example, after the N data partitions are divided according to the preset data partition information, more and more data is stored over time. In this case, in order to avoid the data after the storage space of a certain data partition is full, the subsequent data Unable to continue to store to this data partition, the server can split the data partition.
  • the preset index partition information may be generated based on the preset data partition information and the configuration information. Assume that the number of index partitions configured in the configuration information is i, and the number of data partitions obtained by dividing the preset data partition information is j. For example, when i ⁇ j, each index partition corresponds to [j/i] data partitions, and when there are data partitions remaining, it belongs to one index partition in turn. For example, if the data partition information is A, B, C, D, E, F, G, H, I, and the configuration information is three, the partition corresponding to the preset index partition information is [A, D). , [D, G) and [G,). For another example, if the configuration information is four, the preset index partition information [A, C), [C, E), [E, G), and [G,).
  • data exists in the form of data tables.
  • Data in one data table can be stored in multiple data partitions.
  • index information of data in the same data table can also be stored in multiple index partitions.
  • the index partition stores index information of a part of the data.
  • the global data identifier of the data is for multiple index partitions corresponding to the data table to which the data belongs, that is, the global data identifier of the data is unique in the multiple index partitions corresponding to the data table;
  • the internal data of the data is for the index partition corresponding to the data, that is, the internal data identifier of the data is unique in the index partition corresponding to the data.
  • data #1 is one piece of data in data table #1
  • index information of data table #1 is stored in index partition #1 and index partition #2, respectively
  • index information of data #1 is stored in index partition #1.
  • the global data identifier of data #1 is unique among the two index partitions
  • the internal data identifier of data #1 is unique in index partition #1.
  • the internal data identifier of the data is an integer type identifier.
  • the internal data identifier of the data may be randomly allocated according to the data writing order. For example, in index partition #1, data #1 is the first data to be written, then the internal data identifier of data #1 may be 1, and data #2 is the second data to be written, then, The internal data identifier of data #2 can be 2.
  • the row primary key is a key for any piece of data, and the row corresponding to the row primary key can be quickly found in the data area of the stored data by the row primary key.
  • the global data identifier of the data is unique among the plurality of index partitions corresponding to the data table to which the data belongs, in an optional implementation manner, the global data identifier of the data is used as the row primary key of the data.
  • the index key is a Key generated according to a keyword extracted from the data, and at least one piece of data satisfying the query condition can be queried by using an index key.
  • FIG. 2 is a schematic flowchart of a method 200 for querying data according to an embodiment of the present application.
  • the execution body of the method 200 may be a storage device or a processor within the storage device.
  • step S210 the first data is acquired.
  • the first data includes multiple columns of data, and each column of data includes a column name and a corresponding column value, and each column of data represents different content.
  • the first data can be any row of data in a data table.
  • the first data may be any row of data shown in Table 3, wherein the first data includes 7 columns of data, the first column data is the ID of the data entity, and the second column data is the data.
  • the name of the entity is the telephone number of the data entity
  • the data in column 4 is the address of the data entity
  • the data in column 5 is the gender of the data entity
  • the data in column 6 is the education level of the data entity
  • the data in column 7 is the data in column 7.
  • the marital status of the data entity is the name of the data entity.
  • the ID of the data entity can be understood as the global data identifier of the first data, and the object identified by the global data identifier is the data entity described above, and the data entity can be a user.
  • the global data identifier of the first data may be used as a row primary key of the first data, and each column of data is a Value corresponding to a row primary key of the first data.
  • a combination of the global data identification of the first data + any column of data may be used as the row primary key of the first data.
  • step S210 the manner in which the storage device obtains the first data may be multiple, and the embodiment of the present application does not impose any limitation:
  • the storage device may receive the data storage request sent by the terminal device to obtain the first data, where the data storage request includes the first data;
  • the storage device may also acquire the first data from the terminal device autonomously;
  • the storage device may also obtain the first data from a database that stores the first data.
  • step S220 P index key keys are generated according to the L column data of the first data, and the L and the P are integers greater than or equal to 1.
  • the L column data is all column data or partial column data of the first data.
  • the L column data of the first data is determined according to the index configuration information.
  • the index configuration information may include indication information for indicating the construction of the index, for example, the indication information may specify which columns or column families to build the index.
  • the index configuration information may be stored in the metadata of the data table, or the index configuration information may be stored as a separate file.
  • a column family is a collection of one or more columns. Data of the same column family is located in the same storage path, and data of different column families is isolated in different storage paths.
  • Table 3 shows the data in the data table to be stored.
  • the first 4 columns of data belong to column family I, and the last 3 columns belong to column family F.
  • the indication information in the index configuration information indicates that the index is built for the first column data and the fourth column data in the column family I, and the index is indexed for all the column data in the column family F, then L is 6.
  • the storage device can generate an index key through steps S221 and S222. Next, step S220 will be described from these two steps, respectively.
  • each column of data of the first data includes at least one word segment
  • the word extractor may extract, from each column of data, at least one of the word segments of the each column of data.
  • the generating the P index keys according to the L column data in the first data includes:
  • each keyword includes a participle corresponding to the each keyword, or
  • each keyword includes a word segment corresponding to the each keyword and the first data Key words in the main key of the line, or,
  • each keyword includes a word segmentation corresponding to each keyword The column name of the i-th column data of the first data.
  • the i-th column data is any one of the L-column data used for constructing the index
  • the keyword corresponding to the i-th column data can be in the following three manners (ie, mode 1, mode 2, and mode 3)
  • the first data in the second row of Table 3 is taken as an example for illustration.
  • the word segmentation in the i-th column data of the first data is taken as a keyword corresponding to the i-th column data, that is, each keyword includes a word segment corresponding to each of the keywords.
  • the fourth column data in the second row of data in Table 3 includes two participles: Shandong, Jinan. Then, the extracted keywords are: Shandong, Jinan.
  • the word segmentation in the i-th column data of the first data and the keyword in the row main key of the first data are used as keywords corresponding to the i-th column data, that is, each keyword includes corresponding to The word segmentation of each keyword and the keyword in the row primary key of the first data.
  • the keyword is the row primary key of the first data
  • the keyword corresponding to the i-th column data is the i-th column data of the first data.
  • the global data identifier of the second row data is A0002
  • the global data identifier is used as the row primary key of the data
  • the third column data includes the word segmentation: 13555552222
  • the extracted keyword is: A000213555552222.
  • the third column data in the second row data in Table 3 is taken as an example.
  • the row primary key of the second row data is A0002 ⁇ 20180101, and the row primary key includes two keywords: A0002, 20180101, and the third column.
  • the data includes the participle: 13555552222.
  • the keyword "20180101” can be extracted from the row primary key, and the keywords in the third column data are generated by "20180101” and "13555552222", that is, the extracted keywords are: 20180101 Jinan.
  • the global data identifier of the second row data is A0002
  • the global data identifier is used as the row primary key of the data
  • the fourth column data includes two segmentation words. : Shandong, Jinan, then, the key words are: A0002 Shandong, A0002 Jinan.
  • the fourth column data in the second row data in Table 3 is taken as an example, and the row primary key of the second row data is A0002 ⁇ 20180101, and the row primary key includes keywords two keywords: A0002, 20180101,
  • the four columns of data include two participles: Shandong, Jinan, then you can extract the keyword "20180101" from the main key of the row, and generate a keyword in the third column of data from "20180101" and "Shandong", by "20180101” And "Jinan” generates another keyword in the third column of data, that is, the extracted keywords are: 20180101 Shandong, 20180101 Jinan.
  • the word segmentation in the i-th column data of the first data and the column name of the i-th column data of the first data are used as keywords corresponding to the i-th column data, that is, each keyword includes a corresponding The word segmentation of each keyword and the column name of the i-th column data of the first data.
  • the column of the third column data is named Phone
  • the data of the third column includes the word segmentation: 13555552222.
  • the extracted keyword is: Phone. :13555552222.
  • the column data of the fourth column is named Address
  • the data of the fourth column includes two participles: Shandong, Jinan, then, the extracted keywords For: Address: Shandong, Address: Jinan.
  • the index configuration information determining an extraction manner of extracting keywords based on each column of the L column data.
  • the index configuration information further includes an extraction method for indicating the extracted keywords.
  • the index configuration information may be set differently for different columns of data.
  • At least one of the word segmentation (for example, the i-th column data) of the L-column data of the first data, the at least one word segment, and the row primary key of the first data, the at least one word segment, and the first
  • extracting at least one keyword corresponding to the at least one word segment can effectively improve the flexibility of the system to extract keywords, thereby improving data query efficiency.
  • step S222 there are two ways to generate the P index keys based on the P keywords (ie, mode A and mode B). Next, at least one key generated based on the i-th column data of the first data is generated.
  • the words are examples, and the two methods are described separately.
  • the first index partition described below is an index partition pre-configured by the system for data including the first data, that is, an index generated based on the first data is stored in the first index partition.
  • the first data may be any data in a data table.
  • the first index partition may also be any one of a plurality of index partitions corresponding to the data in the data table.
  • the at least one index key is generated according to the at least one keyword, the column name of the i-th column data of the first data, and the first index partition identifier for identifying the first index partition.
  • the keywords extracted based on the above manner 1 are: Shandong, Jinan, and the first index partition identifier is A, then, corresponding to the keyword “ The index key of Shandong is "A ⁇ Address ⁇ Shandong", and the index key corresponding to the keyword "Jinan” is "A ⁇ Address ⁇ Jinan”.
  • the keyword extracted based on the above manner 3 is: Gender: Male, and the first index partition identifier is A, then, corresponding to the keyword
  • the index key for "Gender:Male” is "A ⁇ Gender ⁇ Gender:Male”.
  • the system can assign an alias to the Address, which reduces the number of bytes to store.
  • the content of the connection keyword, the column name, and the first index partition identifier may be referred to as a connector, for example, " ⁇ " in the above example.
  • the at least one index key is generated according to the at least one keyword and the first index partition identifier for identifying the first index partition.
  • the keywords extracted based on the above manner 1 are: Shandong, Jinan, and the first index partition identifier is A, then, corresponding to the keyword
  • the index key of "Shandong” is "A ⁇ Shandong”
  • the index key corresponding to the keyword "Jinan” is "A ⁇ Jinan”.
  • the keyword extracted based on the above manner 3 is: Gender: Male, and the first index partition identifier is A, then, corresponding to the keyword
  • the index key for "Gender:Male” is "A ⁇ Gender:Male”.
  • connection keyword the content of the connection keyword and the first index partition identifier is called a connector, for example, " ⁇ " in the above example.
  • step S230 the first index information is updated in the first index partition corresponding to the first data according to the P index keys, the row primary key of the first data, and the internal data identifier of the first data, where the first index information is updated.
  • the row primary key of the data is used to look up the first data in the data area
  • the internal data identifier of the first data is unique in the first index partition
  • the first index information includes the first part of the stored M pieces of data.
  • the first correspondence relationship represents a one-to-one correspondence between the N index keys generated by the M pieces of data and the N sets of internal data identifiers, and each set of internal data identifiers includes an internal data identifier of at least one piece of the M pieces of data.
  • Each set of internal data identifiers is used to identify an identifier that satisfies data of a corresponding index key
  • the second correspondence relationship represents a one-to-one correspondence between M row primary keys and M internal data identifiers generated based on the M pieces of data.
  • the M and the N are both integers greater than or equal to 1.
  • the system After acquiring the internal data identifier of the first data (for example, the system is configured in advance or timely for the first data) and the row primary key of the first data, after generating the P index keys of the first data, constructing Indexing the first data, and updating first index information in the first index partition corresponding to the first data.
  • the internal data identifier of the first data for example, the system is configured in advance or timely for the first data
  • the row primary key of the first data after generating the P index keys of the first data, constructing Indexing the first data, and updating first index information in the first index partition corresponding to the first data.
  • the first index information includes an index of the stored M pieces of data, the M pieces of data being data in a data partition corresponding to the first index partition; the first correspondence relationship is N indexes generated based on the M pieces of data.
  • the correspondence between the key and the internal data identifier of the N group, an index key is an index key corresponding to the data identified by the corresponding set of internal data identifiers, and the index key corresponding to the data is an index generated based on the keyword of the data extraction.
  • a key represents a correspondence between the M row primary keys generated based on the M pieces of data and the M internal data identifiers of the M pieces of data.
  • the storage device may query, according to the query condition, the internal data identifier of all the data that meets the index key by using the first correspondence relationship in the first index partition, and further, by using the second correspondence relationship, query the identifier corresponding to the internal data identifier.
  • the primary key of the row thus, the corresponding data is found by the row primary key.
  • the correspondence between an index key and a set of internal data identifiers is an inverted index described in the embodiment of the present application, and a set of internal data corresponding to an index key.
  • the identifier is the inverted index row list.
  • the embodiment of the present application provides the following implementations in three cases.
  • the internal data identifier of the first data is added to each group of internal data identifiers in the Q group internal data identifiers corresponding to the Q index keys. To update the first correspondence.
  • the internal data of the first data is added to each group of internal data identifiers in the Q group internal data identifiers corresponding to the Q index keys. Identifying, and adding a correspondence between an index key of the P index keys except the Q index keys and an internal data identifier of the first data, to update the first correspondence, where the Q is greater than or equal to 1 and less than the integer of P
  • the correspondence between the P index keys and the internal data identifier of the first data is added in the first correspondence.
  • the storage device may configure the internal data identifier in the corresponding index partition for the data in the data table, wherein the internal data for the data in Table 3
  • the status of the identification is shown in Table 4.
  • index partition #1 the data of the global data identifiers A0001 and A0002 is stored in a data partition (referred to as data partition #1 for ease of distinction and understanding), wherein the data is stored in the data partition #1.
  • the range of the row master key belongs to [A, B), the index information of the data of the data partition #1 is stored in the index partition #1, and the internal data identifier of the data of the global data identifier A0001 is 1, the global data identifier
  • the internal data of the data for A0002 is identified as 2; the data of the global data identifiers B0001 and B0002 is stored in a data partition (referred to as data partition #2 for ease of distinction and understanding), wherein in the data partition #2
  • the range of the row primary key of the stored data belongs to [B, C), the index information of the data of the data partition #1 is also stored in the index partition #1, and the internal data identifier of the data of the global data identifier B0001 is 3.
  • the global data identifiers A0001 and D0001 are both 1, the global data identifiers are A0001 and D0001.
  • the index information of the data is stored in different index partitions respectively.
  • the process of querying data is performed based on the index information of each index partition. Therefore, in different index partitions, the internal data identifiers of the data do not interfere with each other. That is, the internal data identification of the data is unique within the corresponding index partition.
  • the first index partition is index partition #1
  • the first data is global data identifier B0002 (ie, the fourth row of data in Table 3)
  • the M data is global data identifiers A0001, A0002.
  • the data of B0001, that is, M 3, then the second correspondence is as shown in Table 5.
  • the global data identifier is the row primary key of the data.
  • the storage device needs to index the data named "Address”, “Gender”, “Education”, and “Marital Status” in Table 3, and adopt the above method 1 for the data with the column name "Address”.
  • the keyword is generated and the method A generates an index key.
  • the above method 3 generates a keyword and the B generates an index key. Then, the first correspondence relationship as shown in Table 6 is generated.
  • the first column data in Table 6 is an index key generated based on the M data; the second column data is a corresponding internal data identifier, one index key corresponds to a set of internal data identifiers, and one set of internal data identifiers includes At least one internal data identifier; the third column data is multiple Key Values, and the content in ⁇ is a Key Value.
  • the first index information in the first index partition needs to be updated, and the second correspondence in the updated first index information is as follows. Table 7 shows.
  • the first correspondence in the updated first index information is as shown in Table 8.
  • the storage device is based on the first The P index keys of the data, the row primary key of the first data, and the internal data identifier of the first data begin to establish the first index information in the first index partition.
  • the first data is any data in the data table
  • the first index partition corresponding to the first data is any index partition corresponding to the plurality of index partitions of the data table.
  • one data ie, first data
  • a corresponding index partition are taken as an example for description. Therefore, for any piece of data in the data table, the index can be built and the index information in the index partition can be updated by step S230.
  • the method for querying data is based on an index key generated based on at least part of the data (for example, L column data) of the first data, and a row of the first data.
  • the primary key and the internal data identifier of the first data update the first index information in the first index partition corresponding to the first data, wherein the first index information includes a first correspondence and a first correspondence for the stored M pieces of data a second correspondence relationship, where the first correspondence relationship represents a relationship between the N index keys generated based on the M pieces of data and the N sets of internal data identifiers, where the second correspondence relationship represents M row primary keys and M generated based on the M pieces of data Correspondence between internal data identifiers.
  • the internal data identifier of the data is unique in the index partition corresponding to the data, when the plurality of data files are combined into one data file, the correspondence between the row primary key and the internal data identifier of the data does not change, thereby The correspondence between the index key generated based on the data and the internal data identifier does not change, so that the data can be quickly read from the index information originally cached in the memory without requiring an index from the underlying database. Re-reading data in the information improves the efficiency of data query.
  • a set of internal data identifiers corresponding to one index key is stored in one row of data, and a set of internal data identifiers corresponding to one index key may be stored in a Base+Delta manner, or The inverted index list corresponding to one index key can be stored in a Base+Delta manner.
  • a set of internal data identifiers includes the Base part and the Delta part. Below, the storage method of Base+Delta is described in detail.
  • the Base part includes at least one set of internal data identifiers.
  • the Base part does not exist in the initial state, and only exists after the first merge.
  • the Delta portion includes at least one Key Value, which is a newly added Key Value based on the Base portion, each Key Value associated with an internal data identifier or a small batch of internal data identifiers.
  • Each Key Value pair applies a change operation, and the change operation includes adding an operation, that is, adding an internal data identifier of the corresponding Key Value in the Base part, or the change operation includes deleting the operation, that is, deleting the corresponding Key in the Base part.
  • the internal data identifier of Value is a change operation, and the change operation includes adding an operation, that is, adding an internal data identifier of the corresponding Key Value in the Base part, or the change operation includes deleting the operation, that is, deleting the corresponding Key in the Base part.
  • the internal data identifier of the Base portion already stored in the merged mechanism may be merged with the internal data identifier of the incremental Key Value to generate a new Base portion.
  • the new Base part will replace the internal data identifier of the original Base part and a part of the Key Value in the Delta part, which will help speed up the query.
  • Figure 3 shows a diagram of storing a set of internal data identifiers or inverted index lists for an index key in the underlying database.
  • the base part of the internal data identifier of the group includes data identifiers of ⁇ 1, 3, 4, 7, 9, 10, 20 ⁇ , and the internal data identifiers of the five Key Values included in the Delta part, "+ " indicates an increase operation, "-" indicates a delete operation, and a third Key Value indicates that the corresponding three internal data identifiers are added to the Base portion, that is, the internal data identifier of the changed small batch described above.
  • the Base part and the Delta part are merged by a merge mechanism to generate an updated set of internal data identifiers or inverted index lists, namely ⁇ 1, 3, 5, 7, 9, 10, 22, 24, 25, 26, 27, 28 ⁇ .
  • the Key value data is stored in the storage mode of Base+Delta. Since the data of the Delta part is written to the disk, the data newly written to the Delta part can not affect the data of the Base part already stored, which can effectively improve.
  • the writing speed of the data; and, by combining the data of Base and Delta through the merging mechanism, the reading speed of the data can be effectively improved, thereby effectively reducing the delay of querying data and improving the query efficiency.
  • the Base and the Delta are merged, and the merged Base is a storage path of the merged file, and the merged file is stored in the merged Delta.
  • the embodiment of the present application also provides an optional implementation manner: the first index information is stored in the first storage area, and the M pieces of data are stored in the second storage area, where the first storage area is The second storage area is isolated.
  • the first storage area is used to store index information corresponding to the data table
  • the second storage area is used to store data in the data table
  • the second storage area is isolated from the first storage area, that is, the index partition is
  • the data partition is isolated, and the data in the data table and the corresponding index partition are stored in isolation, so that the data partition change of the data table does not affect the content of the index information, and the index information is not affected when the index information is reconstructed.
  • the data effectively improves the processing speed of the data.
  • the method for querying data provided by the embodiment of the present application, on the one hand, after acquiring data (for example, the first data), generating an index according to at least part of data (for example, L column data) based on the first data.
  • the key, the row primary key of the first data, and the internal data identifier of the first data update the first index information in the first index partition corresponding to the first data, wherein the first index information includes for the stored M strips a first correspondence relationship between the data and a second correspondence relationship, where the first correspondence relationship represents a relationship between the N index keys generated based on the M pieces of data and the N sets of internal data identifiers, and the second correspondence relationship represents the M pieces of data based on the M pieces of data Correspondence between the generated M row primary keys and M internal data identifiers.
  • the internal data identifier of the data is unique in the index partition corresponding to the data, when the plurality of data files are combined into one data file, the correspondence between the row primary key and the internal data identifier of the data does not change, thereby The correspondence between the index key generated based on the data and the internal data identifier does not change, so that the data can be quickly read from the index information originally cached in the memory without requiring an index from the underlying database. Re-reading data in the information improves the efficiency of data query.
  • the at least one of the L-column data of the first data eg, the i-th column data
  • the at least one word segment and the row primary key of the first data
  • the at least one word segmentation In any one of the column names of the i-th column data, extracting at least one keyword corresponding to the at least one word segment can effectively improve the flexibility of the system for extracting keywords, thereby improving data query efficiency.
  • the data partition change of the data table does not affect the content of the index information, and the index information is not reconstructed. Will affect the data in the data table, effectively improving the processing speed of the data.
  • the storage device may query data according to the query condition sent by the user. .
  • FIG. 4 is a schematic flowchart of a method 300 for querying data according to an embodiment of the present application.
  • the execution body of the method 300 may be a storage device in a device for querying data, or may be a processor in the storage device.
  • step S310 a query condition is acquired.
  • the storage server may receive a query condition sent by a client of the terminal device, where the query condition includes X keywords, and X is an integer greater than or equal to 1.
  • the query condition includes multiple (ie, X is greater than 1) keywords
  • the query condition further includes a logical operator for connecting between two adjacent keywords, wherein the logical operators include "and", " Non” and “or”.
  • “and” can be expressed as "&&”, "not” can be represented as "!, and "or” can be expressed as "
  • the query condition may be: Address: Longgang &&Gender: Male, that is, the object that needs to be queried must satisfy two keywords in the query condition at the same time.
  • step S320 the internal data identifier of the target data that satisfies the query condition is queried according to the first correspondence relationship in the index information of each index partition in the S index partitions, where the internal data identifier is in the index partition corresponding to the target data.
  • the S index partitions are index partitions determined according to the query condition, wherein the first correspondence relationship represents a one-to-one correspondence between multiple index keys generated based on multiple pieces of data and multiple sets of internal data identifiers.
  • Each set of internal data identifiers includes an internal data identifier of at least one of the plurality of pieces of data, and each set of internal data identifiers is an identifier for identifying data having a corresponding index key.
  • the query condition carries indication information for indicating metadata of a data table that is queried based on the query condition, where the metadata of the data table includes index information for indicating storage of the data table. Index partition information.
  • the storage device may determine, according to the query condition, S index partitions that need to be queried, and the S index partitions correspond to a data table that is queried based on the query condition.
  • each index partition stores index information of the corresponding data
  • the index information of each index partition includes a first correspondence relationship and a second correspondence relationship
  • the first correspondence relationship representation is based on the corresponding a one-to-one correspondence between a plurality of index keys generated by a plurality of pieces of data of each index partition and a plurality of sets of internal data identifiers
  • the second correspondence relationship representing a plurality of pieces generated based on pieces of data corresponding to the each index partition A one-to-one correspondence between the row primary key and a plurality of internal data identifiers of the plurality of pieces of data.
  • first correspondence relationship and the second correspondence relationship of each index partition For details about the first correspondence relationship and the second correspondence relationship of each index partition, reference may be made to the foregoing description of the first correspondence relationship and the second correspondence relationship in the first index information in the first index partition, where For the sake of brevity, we will not repeat them.
  • the X keywords in the query condition are extracted by the word extractor, and in each index partition in the S index partitions, based on the first index information of each index partition Corresponding relationship and the X keywords, querying X index keys corresponding to the X keywords in the first correspondence, and after finding the X index key, determining X groups corresponding to the X index keys
  • the data identifier calculates the internal data identifier of the X group according to the logical symbol of the query condition, and further queries the internal data identifier of the target data that satisfies the query condition.
  • the first correspondence in the first index partition shown in Table 8 above is taken as an example, and the query process in an index partition in step S320 is briefly described.
  • the decomposed keywords include: "Address: Longgang” and "Gender: Male”.
  • the index key corresponding to the keyword "Address: Longgang” is "A ⁇ Address ⁇ Longgang”
  • the internal data identifier corresponding to the index key "A ⁇ Address ⁇ Longgang” is ⁇ 1 ⁇
  • corresponding keywords The index key of "Gender:Male” is "A ⁇ Gender:Male”
  • the internal data identifier of the corresponding index key "A ⁇ Gender:Male” is ⁇ 1, 2 ⁇ , then the internal data of both keywords are satisfied.
  • the identifier is ⁇ 1 ⁇ , that is, the internal data identifier of the target data that satisfies the query condition is ⁇ 1 ⁇ .
  • a row primary key corresponding to the internal data identifier of the target data is searched in the second correspondence, and the target data is queried in the data area corresponding to the data table. And generate query results.
  • the query condition may carry indication information for indicating metadata of a data table that is queried based on the query condition, wherein the metadata of the data table further includes a data area for indicating storage of the data table Information.
  • the storage device can determine the data area corresponding to the data table based on the query condition, thereby querying the target data in the data area.
  • the corresponding row primary key may be determined as A0001 through the above table 8, and further, the data of the A0001 is searched in the data area. content.
  • the query process in order to improve the processing speed, can be implemented by constructing a bitmap index.
  • the index information may include a bitmap index and an index position
  • the bitmap index includes a correspondence relationship between the index key and the bitmap vector
  • the bitmap vector includes an index for indicating whether each piece of data satisfies the corresponding index key.
  • the index position includes the position of the index of each piece of data in the bitmap vector.
  • the index position may be analogous to the second correspondence between the plurality of row primary keys and the plurality of internal data identifiers, and the bitmap index may be analogous to the first correspondence between the plurality of index keys and the plurality of sets of internal data identifiers. .
  • Table 11 is an index position corresponding to the second correspondence of Table 7
  • Table 12 is a bitmap vector corresponding to the first correspondence of Table 8.
  • the decomposed keywords include: “Address: Longgang” and “Gender: Male”.
  • the index key corresponding to the keyword "Address: Longgang” is "A ⁇ Address ⁇ Longgang”
  • the bitmap index corresponding to the index key "A ⁇ Address ⁇ Longgang” is ⁇ 1000 ⁇
  • the index key of "Gender:Male” is "A ⁇ Gender:Male”
  • the bitmap index corresponding to the index key "A ⁇ Gender:Male” is ⁇ 1100 ⁇
  • the bitmap indexes ⁇ 1000 ⁇ and ⁇ 1100 ⁇ are logically "
  • the target data satisfying the query condition is the data located in the first bit in the bitmap index
  • the index key is determined by the index position in Table 11 as A0001; further, the data content of the A0001 is searched in the data area .
  • the method for querying data includes the first correspondence relationship and the second correspondence relationship, wherein the first correspondence relationship represents multiple indexes generated based on multiple pieces of data. a one-to-one correspondence between a key and a plurality of sets of internal data identifiers, the second correspondence relationship representing a one-to-one correspondence between a plurality of row primary keys and a plurality of internal data identifiers generated based on the plurality of pieces of data, and internal data of the data
  • the identifier is unique in the index partition corresponding to the data, so that when the plurality of data files are merged into one data file, the second correspondence does not change, and thus the first correspondence does not change. Therefore, when the data satisfying the query condition is queried, the data can be quickly read from the index information originally cached in the memory without re-reading the data from the index information in the underlying database, thereby improving the data. Query efficiency.
  • the method for querying data in the embodiment of the present application is described in detail above with reference to FIG. 2 to FIG. 4 .
  • the apparatus for querying data according to the embodiment of the present application is described in detail below with reference to FIG. 5 to FIG.
  • the technical features are equally applicable to the following device embodiments.
  • the apparatus for querying data in the embodiment of the present application may be deployed on at least one node in the distributed storage system.
  • FIG. 5 is a schematic block diagram of an apparatus for querying data according to an embodiment of the present application.
  • the apparatus includes a processing unit 410 and a storage unit 420, wherein the storage unit 420 is configured to store data and index information, and the processing unit 410 is configured to:
  • the first index information includes a first correspondence and a first correspondence for the stored M pieces of data Two correspondences, among them,
  • the first correspondence relationship represents a one-to-one correspondence between the N index keys generated by the M pieces of data and the N sets of internal data identifiers, and each set of internal data identifiers includes an internal data identifier of at least one piece of the M pieces of data.
  • Each set of internal data identifiers is an identifier for identifying data that satisfies a corresponding index key, and the second correspondence relationship represents M row primary keys generated based on the M pieces of data and M internal data identifiers of the M pieces of data.
  • a one-to-one correspondence between the M and the N are integers greater than or equal to 1.
  • the apparatus for querying data provided by the embodiment of the present application, after acquiring data (for example, the first data), according to an index key generated based on at least part of data (for example, L column data) of the first data,
  • the row primary key of the first data and the internal data identifier of the first data are updated with the first index information in the first index partition corresponding to the first data, wherein the first index information includes the first data for the stored M pieces of data a correspondence relationship and a second correspondence relationship, the first correspondence relationship represents a relationship between the N index keys generated based on the M pieces of data and the N sets of internal data identifiers, and the second correspondence relationship represents the M generated based on the M pieces of data
  • the correspondence between the row primary key and the M internal data identifiers is acquiring data (for example, the first data), according to an index key generated based on at least part of data (for example, L column data) of the first data.
  • the row primary key of the first data and the internal data identifier of the first data are updated with
  • the internal data identifier of the data is unique in the index partition corresponding to the data, when the plurality of data files are combined into one data file, the correspondence between the row primary key and the internal data identifier of the data does not change, thereby The correspondence between the index key generated based on the data and the internal data identifier does not change, so that the data can be quickly read from the index information originally cached in the memory without requiring an index from the underlying database. Re-reading data in the information improves the efficiency of data query.
  • processing unit 410 is specifically configured to:
  • each keyword includes a participle corresponding to the each keyword, or
  • each keyword includes a word segment corresponding to the each keyword and the first data Line master key, or
  • each keyword includes a word segmentation corresponding to each keyword The column name of the i-th column data of the first data;
  • An index key corresponding to each of the at least one keyword is generated based on each of the at least one keyword.
  • the apparatus for querying data passes at least one of the word segmentation (for example, the i-th column data) of the L column data of the first data, the at least one word segment, and the first Extracting at least one keyword corresponding to the at least one word segment in any one of a row primary key of the data, the at least one word segment, and the column name of the i-th column data, which can effectively improve the flexibility of the system for extracting keywords , thereby improving the efficiency of data query.
  • the word segmentation for example, the i-th column data
  • processing unit 410 is specifically configured to:
  • An index key corresponding to each of the keywords is generated by the each keyword, the column name of the i-th column data of the first data, and the first index partition identifier for identifying the first index partition.
  • processing unit 410 is specifically configured to:
  • An index key corresponding to each of the keywords is generated by the each keyword and a first index partition identifier for identifying the first index partition.
  • the first index information is stored in the first storage area, and the M pieces of data are stored in the second storage area, and the first storage area is isolated from the second storage area.
  • the apparatus for querying data provided by the embodiment of the present application can change the data partition of the data table without affecting the index information by isolating the first storage area storing the index information from the second storage area storing the data. Content, and also does not affect the data in the data table when rebuilding the index information, effectively improving the processing speed of the data.
  • the apparatus 400 may correspond to (eg, may be configured or be itself) a device (eg, a storage device) for querying data described in the above method 200, and each module or unit in the device 400 is configured to perform the above
  • a device eg, a storage device
  • each module or unit in the device 400 is configured to perform the above
  • the various operations or processes performed by the device for querying data in the method 200 are omitted here for avoiding redundancy.
  • the device 400 may be a device (for example, a storage device) for querying data
  • FIG. 7 shows a schematic structural diagram of a device 600 for querying data according to an embodiment of the present application.
  • the apparatus 600 for querying data may include a processor 610, a memory 620, and a processor 610 and a memory 620 in a communication connection.
  • the device 620 for querying data can be used to store instructions for executing the instructions stored by the memory 420.
  • the processing unit 410 in the device 400 shown in FIG. 5 may correspond to the processor 610 in the device 600 for querying data shown in FIG. 7, and the storage unit 420 in the device 400 shown in FIG. 5 may be Corresponding to the memory 620 in the device 600 for querying data shown in FIG.
  • the device 400 may be a chip (or a chip system) installed in a device (for example, a storage device) for querying data.
  • the device 400 may include: a processor and The memory is in communication with the processor.
  • the memory can be used to store instructions for executing instructions stored by the memory.
  • the processing unit 410 in the apparatus 400 shown in FIG. 5 can correspond to the processor, and the storage unit 420 in the apparatus 400 shown in FIG. 5 can correspond to the memory.
  • the apparatus for querying data provided by the embodiment of the present application, on the one hand, after acquiring data (for example, the first data), according to an index generated based on at least part of the data of the first data (for example, L column data)
  • the key, the row primary key of the first data, and the internal data identifier of the first data update the first index information in the first index partition corresponding to the first data
  • the first index information includes for the stored M strips a first correspondence relationship between the data and a second correspondence relationship, where the first correspondence relationship represents a relationship between the N index keys generated based on the M pieces of data and the N sets of internal data identifiers, and the second correspondence relationship represents the M pieces of data based on the M pieces of data Correspondence between the generated M row primary keys and M internal data identifiers.
  • the internal data identifier of the data is unique in the index partition corresponding to the data, when the plurality of data files are combined into one data file, the correspondence between the row primary key and the internal data identifier of the data does not change, thereby The correspondence between the index key generated based on the data and the internal data identifier does not change, so that the data can be quickly read from the index information originally cached in the memory without requiring an index from the underlying database. Re-reading data in the information improves the efficiency of data query.
  • the at least one of the L-column data of the first data eg, the i-th column data
  • the at least one word segment and the row primary key of the first data
  • the at least one word segmentation In any one of the column names of the i-th column data, extracting at least one keyword corresponding to the at least one word segment can effectively improve the flexibility of the system for extracting keywords, thereby improving data query efficiency.
  • the data partition change of the data table does not affect the content of the index information, and the index information is not reconstructed. Will affect the data in the data table, effectively improving the processing speed of the data.
  • FIG. 6 is a schematic block diagram of an apparatus for querying data according to an embodiment of the present application.
  • the apparatus includes a processing unit 510 and a storage unit 520, wherein the storage unit 520 is configured to store data and index information, and the processing unit 510 is configured to:
  • the S index partitions are index partitions determined according to the query condition, wherein the first correspondence relationship represents a one-to-one correspondence between a plurality of index keys generated based on the plurality of pieces of data and the plurality of sets of internal data identifiers, and each group of internal data Identifying an internal data identifier including at least one of the plurality of pieces of data, the set of internal data identifiers being an identifier for identifying data that satisfies the corresponding index key;
  • the second correspondence relationship represents a one-to-one correspondence between a plurality of row primary keys generated based on the plurality of pieces of data and a plurality of internal data identifiers of the plurality of pieces of data, wherein the row primary key is used to search for data in the data area;
  • the apparatus for querying data includes, according to the index information of the index partition that is configured, the first correspondence relationship and the second correspondence relationship, where the first correspondence relationship represents multiple indexes generated based on multiple pieces of data. a one-to-one correspondence between a key and a plurality of sets of internal data identifiers, the second correspondence relationship representing a one-to-one correspondence between a plurality of row primary keys and a plurality of internal data identifiers generated based on the plurality of pieces of data, and internal data of the data
  • the identifier is unique in the index partition corresponding to the data, so that when the plurality of data files are merged into one data file, the second correspondence does not change, and thus the first correspondence does not change. Therefore, when the data satisfying the query condition is queried, the data can be quickly read from the index information originally cached in the memory without re-reading the data from the index information in the underlying database, thereby improving the data. Query efficiency.
  • the index information of the S index partitions is stored in the first storage area, and the data corresponding to the S index partitions is stored in the second storage area, where the first storage area is isolated from the second storage area.
  • the apparatus for querying data provided by the embodiment of the present application can change the data partition of the data table without affecting the index information by isolating the first storage area storing the index information from the second storage area storing the data. Content, and also does not affect the data in the data table when rebuilding the index information, effectively improving the processing speed of the data.
  • the apparatus 500 may correspond to (eg, may be configured or be itself) a device (eg, a storage device) for querying data described in the above method 300, and each module or unit in the device 500 is configured to perform the above
  • a device eg, a storage device
  • each module or unit in the device 500 is configured to perform the above
  • the various operations or processes performed by the device for querying data in the method 300 are omitted here for avoiding redundancy.
  • the apparatus 500 may be a device (for example, a storage device) for querying data
  • FIG. 8 shows a schematic structural diagram of an apparatus 700 for querying data according to an embodiment of the present application.
  • the apparatus 700 for querying data may include a processor 710, a memory 720, and a processor 710 and a memory 720 in a communication connection.
  • the memory 720 can be used to store instructions for executing the instructions stored by the memory 420.
  • the processing unit 510 in the apparatus 500 shown in FIG. 6 may correspond to the processor 710 in the device 700 for querying data shown in FIG. 8, and the storage unit 520 in the apparatus 500 shown in FIG. Corresponding to the memory 720 in the device 700 for querying data shown in FIG.
  • the device 500 may be a chip (or a chip system) installed in a device (for example, a storage device) for querying data.
  • the device 500 may include: a processor and The memory is in communication with the processor.
  • the memory can be used to store instructions for executing instructions stored by the memory.
  • the processing unit 510 in the apparatus 500 shown in FIG. 6 can correspond to the processor, and the storage unit 520 in the apparatus 500 shown in FIG. 6 can correspond to the memory.
  • the index information of the constructed index partition includes a first correspondence relationship and a second correspondence relationship, where the first correspondence relationship is generated based on multiple pieces of data.
  • a one-to-one correspondence between a plurality of index keys and a plurality of sets of internal data identifiers wherein the second correspondence relationship represents a one-to-one correspondence between the plurality of row primary keys and the plurality of internal data identifiers generated based on the plurality of pieces of data, and the data
  • the internal data identifier is unique in the index partition corresponding to the data, so that when the plurality of data files are merged into one data file, the second correspondence does not change, and thus the first correspondence is not Will change, so that when the query meets the query data, you can quickly read the data from the index information that was originally cached in memory, without re-reading the data from the index information in the underlying database, improve The efficiency of data query.
  • the data partition change of the data table does not affect the content of the index information, and the index information is not reconstructed. Will affect the data in the data table, effectively improving the processing speed of the data.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • each step of the foregoing method embodiment may be completed by an integrated logic circuit of hardware in a processor or an instruction in a form of software.
  • the processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA), or the like. Programming logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly implemented by the hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory, and the processor reads the information in the memory and combines the hardware to complete the steps of the above method.
  • the memory in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM), a programmable read only memory (PROM), an erasable programmable read only memory (Erasable PROM, EPROM), or an electric Erase programmable read only memory (EEPROM) or flash memory.
  • the volatile memory can be a Random Access Memory (RAM) that acts as an external cache.
  • RAM Random Access Memory
  • many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (Synchronous DRAM).
  • SDRAM Double Data Rate SDRAM
  • DDR SDRAM Double Data Rate SDRAM
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM Synchronous Connection Dynamic Random Access Memory
  • DR RAM direct memory bus random access memory
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种用于查询数据的方法和装置,该方法包括:获取第一数据;根据该第一数据中的L列数据生成P个索引键;根据该P个索引键、该第一数据的行主键和该第一数据的内部数据标识,在该第一数据对应的第一索引分区中更新第一索引信息,该第一数据的内部数据标识在该第一索引分区中是唯一的,该第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,其中,该第一对应关系表示基于该M条数据生成的N个索引键与N组内部数据标识之间的一一对应关系,该第二对应关系表示基于该M条数据生成的M个行主键和该M条数据的M个内部数据标识之间的一一对应关系。因此,可以有效地提供数据的查询效率。

Description

一种用于查询数据的方法 技术领域
本申请涉及存储领域,更具体地,涉及存储领域中一种用于查询数据的方法和装置。
背景技术
在数据查询过程中,可以通过倒排索引实现根据数据的查询过程。其中,倒排索引表示数据实体列表与关键词之间的对应关系,其中,数据实体表示具备该关键词的对象,例如,数据实体可以为用户,数据实体列表即表示具备该关键词的各个数据实体的集合。
在现有技术中,系统为每个数据实体分配对应的整数(Integer,Int)身份标识(Identification,ID),可以通过构建的关键词与多个ID之间的对应关系查找数据。例如,对应关系为:Address:龙岗->{1,2},其中,关键词为:Address:龙岗,多个ID为:1,2,该对应关系表示ID为1和2的承载体具备该关键词。查询数据过程中,可以基于该关键词确定对应的ID,再基于ID确定对应的数据实体。
但是,当底层的数据文件被合并的时候,数据实体与ID之间的对应关系可能会发生变化,这样,导致上述关键词与多个ID之间的对应关系可能已经失效,因此,实际查询数据时,可能需要读取底层数据库中的数据之后,才能查找到满足条件的数据,严重降低了查询效率。尤其当查询条件中包括较多的关键词时,可能会导致查询失败。
发明内容
本申请提供一种用于查询数据的方法,能有效地提高数据的查询效率。
第一方面,提供了一种用于查询数据的方法,所述方法包括:
获取第一数据;
根据所述第一数据中的L列数据生成P个索引键,所述L为大于或等于1的整数,所述P为大于1的整数;
根据所述P个索引键、所述第一数据的行主键和所述第一数据的内部数据标识,在所述第一数据对应的第一索引分区中更新第一索引信息,所述第一数据的行主键用于在数据区中查找所述第一数据,所述第一数据的内部数据标识在所述第一索引分区中是唯一的,所述第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,其中,
所述第一对应关系表示基于所述M条数据生成的N个索引键与N组内部数据标识之间的一一对应关系,每组内部数据标识包括所述M条数据中的至少一条数据的内部数据标识,所述每组内部数据标识是用于标识满足对应的索引键的数据的标识,所述第二对应关系表示基于所述M条数据生成的M个行主键和所述M条数据的M个内部数据标识之间的一一对应关系,所述M和所述N都为大于或等于1的整数。
因此,本申请实施例提供的用于查询数据的方法,在获取数据(例如,第一数据)后,根据基于该第一数据的至少部分数据(例如,L列数据)生成的索引键、该第一数据的行主键和该第一数据的内部数据标识更新对应该第一数据的第一索引分区中的第一索 引信息,其中,该第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,该第一对应关系表示基于该M条数据生成的N个索引键与N组内部数据标识之间的关系,该第二对应关系表示基于M条数据生成的M个行主键和M个内部数据标识之间的对应关系。由于数据的内部数据标识在数据对应的索引分区中是唯一的,因此,当多个数据文件合并为一个数据文件时,数据的行主键与内部数据标识之间的对应关系不会发生变化,从而基于数据生成的索引键和内部数据标识之间的对应关系也不会发生变化,从而,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
可选地,所述根据所述第一数据中的L列数据生成P个索引键,包括:
在[1,L]范围内对i遍历取值,通过以下步骤生成所述P个索引键:
从以下任意一项中,提取至少一个关键词,所述任意一项包括:所述第一数据的第i列数据中的至少一个分词,或,所述第一数据的行主键,或,所述第一数据的第i列数据的列名,其中,所述至少一个关键词与所述第i列数据中的至少一个分词一一对应,
若所述任意一项包括所述第一数据的第i列数据中的至少一个分词,则每个关键词包括对应于所述每个关键词的分词,或,
若所述任意一项包括所述第一数据的第i列数据中的至少一个分词和所述第一数据的行主键,则每个关键词包括对应于所述每个关键词的分词和所述第一数据的行主键中的关键词,或,
若所述任意一项包括所述第一数据的第i列数据中的至少一个分词和所述第一数据的第i列数据的列名,则每个关键词包括对应于所述每个关键词的分词和所述第一数据的第i列数据的列名;
根据所述至少一个关键词中的每个关键词生成对应于所述每个关键词的索引键。
因此,本申请实施例提供的用于查询数据的方法,通过从第一数据的L列数据中的任一列数据(例如,第i列数据)中的至少一个分词、该至少一个分词和该第一数据的行主键、该至少一个分词和该第i列数据的列名中的任意一项中,提取对应于该至少一个分词的至少一个关键词,可以有效地提高系统提取关键词的灵活性,进而提高数据的查询效率。
可选地,所述根据所述至少一个关键词中的每个关键词生成对应于所述每个关键词的索引键,包括:
由所述每个关键词、所述第一数据的第i列数据的列名和用于标识所述第一索引分区的第一索引分区标识生成对应于所述每个关键词的索引键。
可选地,所述根据所述至少一个关键词中的每个关键词生成对应于所述每个关键词的索引键,包括:
由所述每个关键词和用于标识所述第一索引分区的第一索引分区标识生成对应于所述每个关键词的索引键。
可选地,所述第一索引信息存储在第一存储区中,所述M条数据存储在第二存储区中,所述第一存储区与所述第二存储区是隔离的。
因此,本申请实施例提供的用于查询数据的方法,通过将存储索引信息的第一存储区和存储数据的第二存储区隔离,可以使得数据表的数据分区变化并不会影响索引信息 的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
第二方面,提供了一种用于查询数据的方法,所述方法包括:
获取查询条件;
根据S个索引分区中每个索引分区的索引信息中的第一对应关系查询满足所述查询条件的目标数据的内部数据标识,所述内部数据标识在所述目标数据对应的索引分区中是唯一的,所述S个索引分区为根据所述查询条件确定的索引分区,其中,所述第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,每组内部数据标识包括所述多条数据中的至少一条数据的内部数据标识,所述每组内部数据标识是用于标识满足对应的索引键的数据的标识;
根据所述目标数据的内部数据标识和所述每个索引分区的索引信息中的第二对应关系,查询满足所述目标数据的行主键,并根据所述目标数据的行主键生成包括所述目标数据的查询结果,其中,所述第二对应关系表示基于所述多条数据生成的多个行主键和所述多条数据的多个内部数据标识之间的一一对应关系,所述行主键用于在数据区中查找数据;
反馈所述查询结果。
因此,本申请实施例提供的用于查询数据的方法,由于构建的索引分区的索引信息包括第一对应关系和第二对应关系,其中,第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,第二对应关系表示基于多条数据生成的多个行主键和多个内部数据标识之间的一一对应关系,并且,数据的内部数据标识在该数据对应的索引分区中是唯一的,这样,当多个数据文件被合并为一个数据文件时,该第二对应关系不会发生变化,进而,该第一对应关系也不会发生变化,从而,当查询满足查询条件的数据时,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
可选地,所述S个索引分区的索引信息存储在第一存储区中,所述S个索引分区对应的数据存储在第二存储区中,所述第一存储区与所述第二存储区是隔离的。
因此,本申请实施例提供的用于查询数据的方法,通过将存储索引信息的第一存储区和存储数据的第二存储区隔离,可以使得数据表的数据分区变化并不会影响索引信息的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
第三方面,提供了一种用于查询数据的装置,用于执行第一方面或第一方面的任意可能的实现方式中的方法。具体地,该装置包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的单元。
第四方面,提供了一种用于查询数据的装置,用于执行第二方面或第二方面的任意可能的实现方式中的方法。具体地,该装置包括用于执行第二方面或第二方面的任意可能的实现方式中的方法的单元。
第五方面,提供了一种用于查询数据的设备,所述设备包括处理器和存储器;所述存储器用于存储计算机执行指令,所述处理器和所述存储器之间通过内部连接通路互相通信。当所述设备运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以 使所述设备执行第一方面或第一方面的任意可能的实现方式中的任一方式。
第六方面,提供了一种用于查询数据的设备,所述设备包括处理器和存储器;所述存储器用于存储计算机执行指令,所述处理器和所述存储器之间通过内部连接通路互相通信。当所述设备运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述设备执行第二方面或第二方面的任意可能的实现方式中的任一方式。
第七方面,提供了一种计算机存储介质,所述计算机存储介质包括计算机执行指令,当计算机的处理器执行所述计算机执行指令时,所述计算机执行上述第一方面至第二方面的任意可能的实现方式中的任一方式。
第八方面,提供了一种芯片,所述芯片包括处理器和存储器,所述处理器用于执行所述存储器存储的指令,当所述指令被执行时,所述处理器可以实现第一方面至第二方面的任意可能的实现方式中的任一方式。
第九方面,提供了一种计算机程序,所述计算机程序在某一计算机上执行时,将会使所述计算机实现上述第一方面至第二方面任意可能的实施方式中的任一方式。
附图说明
图1是适用于本申请实施例的数据存储系统的示意图。
图2是根据本申请实施例提供的用于查询数据的方法的示意性流程图。
图3是根据本申请实施例提供的在底层数据库存储内部数据标识的示意图。
图4是根据本申请实施例提供的用于查询数据的方法的示意性流程图。
图5和图6是根据本申请实施例的用于查询数据的装置的示意性框图。
图7和图8是根据本申请实施例的用于查询数据的设备的示意性结构图。
具体实施方式
下面,结合背景技术,对现有技术的问题进行简单说明。
如背景技术所述,当底层的数据文件被合并的时候,数据实体与ID之间的对应关系可能会发生变化,这样,导致上述关键词与多个ID之间的对应关系可能已经失效。
例如,系统在时刻t1写入的数据为{data 1,data 2,data 5,data 8,data 9,data 19},用于表示数据实体与ID之间的索引数据为{1:data 1,2:data 2,3:data 5,4:data 8,5:data 9,6:data 19},其中,数据实体data 1、data 5和data 9中都包含了关键词“购物达人”,则用于表示关键词“购物达人”与ID之间的索引数据为:购物达人->1,3,5,那么,查询数据时,若输入关键词“购物达人”,则首先基于关键词“购物达人”与ID之间的索引数据查找到符合条件的ID为{1,3,5},进而通过对应的数据实体{data 1,data 5,data 9}查找数据。随后,系统在时刻t2新写入数据{data 3,data 12,data 15,data 28},其中,用于表示数据实体与ID之间的索引数据为{1:data 3,2:data 12,3:data 15,4:data 28},数据实体data 3和data 15中都包含了关键词“购物达人”,则用于表示关键词“购物达人”与ID之间的索引数据为:购物达人->1,3。在时刻t3,系统将时刻t1和时刻t3的数据合并,用于表示数据实体与ID之间的索引数据发生变化,即为{1:data 1,2:data 2,3:data 3,4:data 5,5:data 8,6:data 9,d:Doc 12,8:doc 15,9:doc 19,10:doc 28},对应地,用于表示关键词“购物达人”与ID 之间的索引数据为:购物达人->1,3,4,6,8。
这样,系统在时刻t1和时刻t2存储的用于表示关键词与ID之间的索引数据以及用于表示数据实体与ID之间的索引数据都会失效。在实际应用中,索引数据占用的系统资源很大,并且,为了提升数据的读取性能,合并数据是必然会发生且频繁发生的,因此,索引数据的失效,使得在实际查询数据的过程中,可能需要读取底层数据库中的数据之后,才能查找到满足条件的数据,严重降低了查询效率。尤其当查询条件中包括较多的关键词时,可能会导致查询失败。
基于上述问题,本申请实施例提供了一种用于查询数据的方法,能够有效地解决上述问题。
图1所示为适用于本申请实施例的数据存储系统的示意图。该数据存储系统100包括终端设备110和设备120,该终端设备可以通过有线或无线网络与设备120连接。
终端设备110具有请求数据查询功能和请求数据存储功能。具体而言,该终端设备110中可以安装具有能够请求数据查询功能和请求数据存储功能的客户端,例如,该客户端可以为浏览器。该终端设备110可以是手机、平板电脑、电子阅读器、个人计算机、车载设备、可穿戴设备等设备。可选地,该终端设备110具有请求数据存储功能。
用于查询数据设备120具有数据查询功能和数据存储功能,可以基于用户通过该终端设备110的客户端发送的数据存储请求来存储数据,基于来自终端设备110发送的查询请求通过存储的数据进行数据查询。该用于查询数据设备120可以为计算设备、存储设备或服务器等用于查询数据和存储数据的设备。其中,该设备120中设置的数据库用于存储数据。可选地,数据库可以为HBase、Mongo数据库(Mongo Database,Mongo DB)、分布型关系数据库服务(Distribute Relational Database Service,DRDS)、Volt数据库(Volt Database,Volt DB)、和Cassandra等分布式数据库。
应理解,图1所示的数据存储系统仅为示意性说明,不应对本申请实施例构成限定。
例如,数据存储系统可以仅包括用于查询数据设备120,该用于查询数据设备120不仅具有查询功能也具有请求数据查询功能。其中,该用于查询数据设备120可以通过该用于查询数据设备120中的客户端接收用户输入的查询条件。
为了描述方便,以用于查询数据的设备120为存储设备为例来描述本申请实施例。
下面,为了方便理解,首先对下文实施例中所涉及的相关内容以及相关术语做一简单介绍。
一、Key Value
本申请实施例中所述的用于查询数据的方法可以应用于支持键-值(Key Value,KV)的分布式存储系统。在支持键-值的存储系统中,数据是以键-值为存储单元的,多对键-值保存在对应的文件中,可以通过查找键-值的键Key,以快速确定该Key所对应的数据值value,从而能够实现大规模实时处理业务的能力。如果一行数据有多列数据,每一列数据会都被存成独立的Key Value,同一行的多个Key Value拥有相同的Key值。
并且,当数据被保存至分布式存储系统时,是按照数据的Key的字典顺序自然排序的。这样就可以保证同一条数据的各个部分内容(或者说,一个数据实体的不同数据)是相邻存放的,若想要查询某条数据的各个部分的内容,可以通过分布式存储系统的索引机制快速地查询满足条件的内容。
如表1所示,以网上交易系统中的两条数据为例,假设,每一条数据记录包括用户编码、交易时间、交易金额和交易备注信息。我们可以这样设计Key和Value,其中,Key:用户编码+交易时间;Value:交易的详细信息。
表1
Figure PCTCN2018100565-appb-000001
其中,每列数据会被存成独立的Key Value,同一行的多个Key Value拥有相同的Key值。因此,基于表1中的两条数据可以生成如表2所示的8条Key Value。
表2
Key:U00001201711110056->Value:[用户编码:U00001]
Key:U00001201711110056->Value:[交易时间:201711110056]
Key:U00001201711110056->Value:[交易金额:99]
Key:U00001201711110056->Value:[交易备注:衣服]
Key:U00002201711110120->Value:[用户编码:U00002]
Key:U00002201711110120->Value:[交易时间:201711110120]
Key:U00002201711110120->Value:[交易金额:198]
Key:U00002201711110120->Value:[交易备注:书籍]
二、数据分区和索引分区
在分布式存储系统中,存储设备可以将待存储的数据分别存储在不同的数据分区中,同理,针对于该数据的索引信息,也可以分别存储在不同的索引分区中。
因此,在存储数据之前,存储设备可以为待存储的数据预先设置数据分区和索引分区。
具体而言,该存储设备可以基于用户预先设置的用于表示数据的分区情况的预设数据分区信息设置数据分区,其中,该预设分区数据信息可以包括分割节点和数据分区的数量中的至少一种;该存储设备可以基于用于表示索引分区的情况的预设索引分区信息设置索引分区,其中,该预设索引分区信息可以基于预设数据分区信息生成,或者,该预设索引分区信息可以基于该预设数据分区信息和配置信息生成,该配置信息用于配置索引分区的分区状况,例如,该配置信息包括索引分区的数量。
首先,针对设置数据分区的过程进行说明。
假设,待存储的数据为多条数据,将该待存储的数据以Key Value的形式存储,每条数据均有一个key,可以按照数据的key为待存储的数据设置多个数据分区。目前,常用的分布式Key Value数据分区的方法为Range分区,下面,对Range分区的方法进行简单说明。
所谓Range分区,即,针对数据按照key的字典顺序的范围进行分区,数据的key 在字典顺序上属于哪个分区的区间,则数据属于哪个分区。也就是说,一个数据分区存储一个key值范围内的数据。这样的存储机制,可以保留数据原有的顺序,有效地提高数据的读取性能。
例如,设置的该预设数据分区信息为A,B,C,D,E,F,G,H,I,其中,字母表示key的大小,则可以为待存储的数据设置9个数据分区,9个数据分区分别为:
分区1:[A,B)
分区2:[B,C)
分区3:[C,D)
分区4:[D,E)
分区5:[E,F)
分区6:[F,G)
分区7:[G,H)
分区8:[H,I)
分区9:[I,)
在上述各个数据分区中,每个数据分区为左闭右开的区间,以分区1为例,[A,B)表示大于或等于A且小于B的key的数据存储在该分区1中,以分区9为例,[I,)表示大于I的key的数据存储在该分区9中。可选地,每个数据分区还可以为左开右闭的区间,本申请实施例对此并不做限定。
需要说明的是,上述设置数据分区的方式仅为示意性说明,本申请实施例还可以基于其他方式设置数据分区。例如,当该预设数据分区信息包括数据分区的数量N时,该存储设备可以按照数据的key值的取值范围将待存储的数据平均划分为N个数据分区。
另外,在数据存储过程中,每个数据分区可以自动裂变或者扩展。比如,按照预设数据分区信息划分得到N个数据分区之后,随着时间的推移存储的数据越来越多,此时,为了避免由于某个数据分区的存储空间被存满之后,之后的数据无法继续存储至该数据分区,服务器可以将数据分区进行分裂。
下面,针对设置索引分区的过程进行说明。
以预设索引分区信息可以基于预设数据分区信息和配置信息生成为例,假设,配置信息中配置的索引分区的数量为i个,预设数据分区信息划分得到的数据分区的个数为j个,则在i<j时,每个索引分区对应于[j/i]个数据分区,在剩余有数据分区时,则依次归属于一个索引分区。例如,该数据分区信息为A,B,C,D,E,F,G,H,I,该配置信息为3个,则该预设索引分区信息所对应的分区即为[A,D),[D,G)和[G,)。再例如,若该配置信息为4个,则该预设索引分区信息[A,C),[C,E)、[E,G)和[G,)。
三、数据的全局数据标识和内部数据标识
一般而言,数据会以数据表的形式存在,一个数据表中的数据可以存储在多个数据分区,此外,同一个数据表中的数据的索引信息也可以存储在多个索引分区中,一个索引分区存储一部分数据的索引信息。
在此种情况下,数据的全局数据标识是针对数据所属的数据表对应的多个索引分区的,即,数据的全局数据标识在数据表对应的多个索引分区是唯一的;数据的内部数据标识是针对数据所对应的索引分区的,即,数据的内部数据标识在数据对应的索引分区 中是唯一的。
例如,数据#1为数据表#1中的一条数据,数据表#1的索引信息分别存储在索引分区#1和索引分区#2中,数据#1的索引信息存储在索引分区#1中,那么,数据#1的全局数据标识在两个索引分区中是唯一的,数据#1的内部数据标识在索引分区#1中是唯一的。
可选地,数据的内部数据标识是整数类型的标识。
其中,数据的内部数据标识可以按照数据的写入顺序随机分配。例如,在索引分区#1中,数据#1是第一个被写入的数据,那么,数据#1的内部数据标识可以为1,数据#2是第二个被写入的数据,那么,数据#2的内部数据标识可以为2。
四、行主键和索引键
行主键是针对任一条数据的Key,可以通过行主键在存储数据的数据区中快速查找到行主键对应的数据。
此外,由于数据的全局数据标识在数据所属的数据表对应的多个索引分区中是唯一的,那么,在一种可选的实现方式中,将数据的全局数据标识作为数据的行主键。
索引键是根据从数据中提取的关键词生成的Key,可以通过索引键查询满足查询条件的至少一条数据。
其中,关于行主键和索引键的具体描述可以参考下文描述。
下面,结合图2至图3,对本申请实施例的用于查询数据的方法进行详细说明。
图2是本申请实施例的用于查询数据的方法200的示意性流程图。该方法200的执行主体可以为存储设备,也可以为存储设备内的处理器。
在步骤S210中,获取第一数据。
其中,该第一数据包括多列数据,每列数据包括列名以及对应的列值,每列数据表示不同的内容。该第一数据可以为一个数据表中的任一行数据。
以表3为例,例如,该第一数据可以为表3所示的任一行数据,其中,该第一数据包括7列数据,第1列数据为数据实体的ID,第2列数据为数据实体的姓名,第3列数据为数据实体的电话号码,第4列数据为数据实体的地址,第5列数据为数据实体的性别,第6列数据为数据实体的教育程度,第7列数据为数据实体的婚姻状态。
表3
Figure PCTCN2018100565-appb-000002
这里,数据实体的ID可以理解为该第一数据的全局数据标识,该全局数据标识所标识的对象即为上文所述的数据实体,该数据实体可以为用户。
可选地,对于该第一数据来说,可以将该第一数据的全局数据标识作为该第一数据的行主键,每列数据都是该第一数据的行主键对应的Value。
作为示例而非限定,可以将该第一数据的全局数据标识+任一列数据(例如,电话号码)的组合作为该第一数据的行主键。
在该步骤S210中,存储设备获取该第一数据的方式可以有多种,本申请实施例不做任何限定:
可选地,存储设备可以接收终端设备发送的数据存储请求获取该第一数据,其中,该数据存储请求中包括该第一数据;
可选地,该存储设备也可以自主从终端设备中获取该第一数据;
可选地,该存储设备也可以从存储该第一数据的数据库中获取该第一数据。
在步骤S220中,根据该第一数据的L列数据生成P个索引键key,该L和该P为大于或等于1的整数。
其中,该L列数据为该第一数据的全部列数据或部分列数据。
可选地,根据索引配置信息,确定该第一数据的L列数据。
该索引配置信息可以包括用于指示构建索引的指示信息,例如,该指示信息可以指定为哪些列或列族构建索引。其中,该索引配置信息可以存储在数据表的元数据中,或者,该索引配置信息可以存储为独立的文件。
这里,一个列族是一个或多个列的集合。同一个列族的数据,位于相同的存储路径中,而不同列族的数据则被隔离在不同的存储路径中。
继续以表3为例,表3所示为待存储的数据表中的数据,前4列数据属于列族I,后3列数据属于列族F。该索引配置信息中的指示信息指示为列族I中第1列数据和第4列数据构建索引,为列族F中的所有列数据构建索引,则L为6。
进而,该存储设备可以通过步骤S221和S222生成索引键。下面,分别从这2个步骤对步骤S220进行描述。
S221:从该L列数据中提取P个关键词。
针对一条数据(例如,第一数据)来说,该第一数据的每列数据包括至少一个分词,可以通过词提取器从每列数据中提取至少由该每列数据中的至少一个分词组成的P个关键词。
在一种可选的实现方式中,该根据该第一数据中的L列数据生成P个索引键,包括:
从以下任意一项中,提取至少一个关键词,所述任意一项包括:所述第一数据的第i列数据中的至少一个分词,或,所述第一数据的行主键,或,所述第一数据的第i列数据的列名,其中,所述至少一个关键词与所述第i列数据中的至少一个分词一一对应,
若该任意一项包括该第一数据的第i列数据中的至少一个分词,则每个关键词包括对应于该每个关键词的分词,或,
若该任意一项包括该第一数据的第i列数据中的至少一个分词和该第一数据的行主键,则每个关键词包括对应于该每个关键词的分词和该第一数据的行主键中的关键词, 或,
若该任意一项包括该第一数据的第i列数据中的至少一个分词和该第一数据的第i列数据的列名,则每个关键词包括对应于该每个关键词的分词和该第一数据的第i列数据的列名。
具体而言,该第i列数据为用于构建索引的L列数据中的任一列数据,对应于第i列数据的关键词可以通过以下3种方式(即,方式1、方式2和方式3)提取,下面,以表3中的第2行数据为该第一数据为例进行举例说明。
方式1
从该第一数据的第i列数据的至少一个分词中提取对应的至少一个关键词。
即,将该第一数据的第i列数据中的分词作为对应于该第i列数据的关键词,也就是说,每个关键词包括对应于该每个关键词的分词。
例如,以表3中的第2行数据中的第4列数据为例,第4列数据包括两个分词:山东、济南。那么,提取的关键词即为:山东、济南。
方式2
从该第一数据的第i列数据的至少一个分词和该第一数据的行主键中,提取对应于该至少一个分词的至少一个关键词。
即,将该第一数据的第i列数据中的分词和该第一数据的行主键中的关键词作为对应于该第i列数据的关键词,也就是说,每个关键词包括对应于该每个关键词的分词和该第一数据的行主键中的关键词。
其中,当该第一数据的行主键只有一个关键词时,这个关键词就是该第一数据的行主键,则,对应于该第i列数据的关键词由该第一数据的第i列数据中的分词和该第一数据的行主键组成。
例如,以表3中的第2行数据中的第3列数据为例,第2行数据的全局数据标识为A0002,将全局数据标识作为数据的行主键,第3列数据包括分词:13555552222,那么,提取的关键词即为:A000213555552222。
再例如,同样以表3中的第2行数据中的第3列数据为例,第2行数据的行主键为A0002^20180101,该行主键包括两个关键词:A0002、20180101,第3列数据包括分词:13555552222,则,可以从该行主键中提取关键词“20180101”,由“20180101”和“13555552222”生成第3列数据中的关键词,即,提取的关键词为:20180101济南。
再例如,以表3中的第2行数据中的第4列数据为例,第2行数据的全局数据标识为A0002,将全局数据标识作为数据的行主键,第4列数据包括两个分词:山东、济南,那么,提取的关键词为:A0002山东、A0002济南。
再例如,同样以表3中的第2行数据中的第4列数据为例,第2行数据的行主键为A0002^20180101,该行主键包括关键词两个关键词:A0002、20180101,第4列数据包括两个分词:山东、济南,则,可以从该行主键中提取关键词“20180101”,由“20180101”和“山东”生成第3列数据中的一个关键词,由“20180101”和“济南”生成第3列数据中的另一个关键词,即,提取的关键词为:20180101山东,20180101济南。
方式3
从该第一数据的第i列数据的至少一个分词和该第一数据的第i列数据的列名中,提取对应于该至少一个分词的至少一个关键词。
即,将该第一数据的第i列数据中的分词和该第一数据的第i列数据的列名作为对应于该第i列数据的关键词,也就是说,每个关键词包括对应于该每个关键词的分词和该第一数据的第i列数据的列名。
例如,以表3中的第2行数据中的第3列数据为例,第3列数据的列名为Phone,第3列数据包括的分词:13555552222,那么,提取的关键词即为:Phone:13555552222。
再例如,以表3中的第2行数据中的第4列数据为例,第4列数据的列名为Address,第4列数据包括两个分词:山东、济南,那么,提取的关键词为:Address:山东、Address:济南。
可选地,根据索引配置信息,确定基于L列数据中的每列数据提取关键词的提取方式。
也就是说,该索引配置信息中还包括用于指示提取关键词的提取方式。其中,针对不同列的数据,该索引配置信息可以设置不同的提取方式。
例如,继续以表3中的数据为例,假设需要对第3-5列数据提取关键词,可以为第3列数据设置方式3的提取方式,为第4列数据设置方式1的提取方式,为第5列数据设置方式2的提取方式。
因此,通过从第一数据的L列数据中的任一列数据(例如,第i列数据)中的至少一个分词、该至少一个分词和该第一数据的行主键、该至少一个分词和该第i列数据的列名中的任意一项中,提取对应于该至少一个分词的至少一个关键词,可以有效地提高系统提取关键词的灵活性,进而提高数据的查询效率。
S222,基于该P个关键词生成该P个索引键。
在步骤S222中,基于该P个关键词生成该P个索引键的方式有两种(即,方式A和方式B),下面,以基于该第一数据的第i列数据生成的至少一个关键词为例,分别对两种方式进行说明。
此外,下文描述的第一索引分区为系统为包括该第一数据在内的数据预先配置的索引分区,即,基于该第一数据生成的索引存储在该第一索引分区中。该第一数据可以为一个数据表中的任一条数据,相应的,该第一索引分区也可以为该数据表中的数据对应的多个索引分区中的任一个索引分区。
方式A
根据该至少一个关键词、该第一数据的第i列数据的列名和用于标识该第一索引分区的第一索引分区标识生成该至少一个索引键。
例如,以表3中的第2行数据中的第4列数据为例,基于上述方式1提取的关键词为:山东,济南,该第一索引分区标识为A,那么,对应于关键词“山东”的索引键为“A^Address^山东”,对应于关键词“济南”的索引键为“A^Address^济南”。
再例如,以表3中的第2行数据中的第5列数据为例,基于上述方式3提取的关键词为:Gender:Male,该第一索引分区标识为A,那么,对应于关键词“Gender:Male”的索引键为“A^Gender^Gender:Male”。
可选地,系统可以为Address分配一个别名,这样,可以减少要存储的字节数量。
在一个索引键中,连接关键词、列名和第一索引分区标识的内容可以称为连接符,例如,上述例子中的“^”。
方式B
根据该至少一个关键词和用于标识该第一索引分区的第一索引分区标识生成该至少一个索引键。
例如,继续以表3中的第2行数据中的第4列数据为例,基于上述方式1提取的关键词为:山东,济南,该第一索引分区标识为A,那么,对应于关键词“山东”的索引键为“A^山东”,对应于关键词“济南”的索引键为“A^济南”。
再例如,以表3中的第2行数据中的第5列数据为例,基于上述方式3提取的关键词为:Gender:Male,该第一索引分区标识为A,那么,对应于关键词“Gender:Male”的索引键为“A^Gender:Male”。
同理,在一个索引键中,连接关键词和第一索引分区标识的内容称为连接符,例如,上述例子中的“^”。
在步骤S230中,根据该P个索引键、该第一数据的行主键和该第一数据的内部数据标识,在该第一数据对应的第一索引分区中更新第一索引信息,该第一数据的行主键用于在数据区中查找该第一数据,该第一数据的内部数据标识在该第一索引分区中是唯一的,该第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,其中,
该第一对应关系表示基于该M条数据生成的N个索引键与N组内部数据标识之间的一一对应关系,每组内部数据标识包括该M条数据中的至少一条数据的内部数据标识,该每组内部数据标识是用于标识满足对应的索引键的数据的标识,该第二对应关系表示基于该M条数据生成的M个行主键和M个内部数据标识之间的一一对应关系,该M和该N都为大于或等于1的整数。
具体而言,通过获取该第一数据的内部数据标识(例如,系统为该第一数据预先或时时配置)和该第一数据的行主键,在生成第一数据的P个索引键后,构建该第一数据的索引,且更新该第一数据对应的第一索引分区中的第一索引信息。
该第一索引信息包括已存储的M条数据的索引,该M条数据是对应于该第一索引分区的数据分区中的数据;该第一对应关系是基于该M条数据生成的N个索引键与N组内部数据标识之间的对应关系,一个索引键即为对应的一组内部数据标识所标识的数据对应的索引键,数据对应的索引键即为基于数据提取的关键词生成的索引键;该第二对应关系表示基于该M条数据生成的M条行主键和该M条数据的M个内部数据标识之间的对应关系。
这样,存储设备可以基于查询条件,在该第一索引分区中,通过该第一对应关系查询满足索引键的所有数据的内部数据标识,进而,通过该第二对应关系查询对应于内部数据标识的行主键,从而,通过行主键查找对应的数据。
需要说明的是,在该第一对应关系中,一个索引键与一组内部数据标识之间的对应关系即为本申请实施例所描述的倒排索引,对应于一个索引键的一组内部数据标识即为倒排索引排列表。
针对如何更新该第一索引信息的过程,本申请实施例提供了如下3种情况下的实现方式。
情况1
若该P个索引键是该N个索引键中的Q个索引键,则在该Q个索引键对应的Q组内部数据标识中的每组内部数据标识中添加该第一数据的内部数据标识,以更新该第一对应关系。
此种情况下,P=Q,仅需要在该Q个索引键对应的Q组内部数据标识中的每组内部数据标识中添加该第一数据的内部数据标识。
情况2
若该P个索引键中存在该N个索引键中的Q个索引键,则在该Q个索引键对应的Q组内部数据标识中的每组内部数据标识中添加该第一数据的内部数据标识,并且,添加该P个索引键中除该Q个索引键以外的索引键与该第一数据的内部数据标识之间的对应关系,以更新该第一对应关系,该Q为大于或等于1且小于P的整数
即,在此种情况下,不仅在该Q组内部数据标识中的每组内部数据标识中添加该第一数据的内部数据标识,并且,还需要添加该P个索引键中除该Q个索引键以外的索引键与该第一数据的内部数据标识之间的对应关系。
应理解,当添加该P个索引键中除该Q个索引键以外的索引键与该第一数据的内部数据标识之间的对应关系时,对应关系的存在形式是:一个索引键对应一个该第一数据的内部数据标识。
情况3
若该P个索引键中不存在该N个索引键,则在该第一对应关系中添加该P个索引键与该第一数据的内部数据标识之间的对应关系。
即,该P个索引键与该N好索引键之间没有交集,则在该第一对应关系中添加该P个索引键与该第一数据的内部数据标识之间的对应关系。
同理,当在该第一对应关系中添加该P个索引键与该第一数据的内部数据标识之间的对应关系时,对应关系的存在形式为:一个索引键对应一个该第一数据的内部数据标识。
继续以表3中的数据为例,基于数据表中数据的分区情况,存储设备可以为数据表中的数据在对应的索引分区中配置内部数据标识,其中,针对表3中的数据的内部数据标识的情况如表4所示。
在索引分区#1中,全局数据标识为A0001和A0002的数据存储在一个数据分区(为了便于区分与理解,记为数据分区#1)中,其中,在该数据分区#1中,存储的数据的行主键的范围属于[A,B)中,数据分区#1的数据的索引信息都存储在索引分区#1中,并且,全局数据标识为A0001的数据的内部数据标识为1,全局数据标识为A0002的数据的内部数据标识为2;全局数据标识为B0001和B0002的数据存储在一个数据分区(为了便于区分与理解,记为数据分区#2)中,其中,在该数据分区#2中,存储的数据的行主键的范围属于[B,C)中,数据分区#1的数据的索引信息也都存储在索引分区#1中,并且,全局数据标识为B0001的数据的内部数据标识为3,全局数据标识为B0002的数据的内部数据标识为4。针对索引分区#2的解释可以参考针对索引分区#1的解释,此处不再赘述。
以全局数据标识为A0001和D0001的数据为例,从表4中可以看出,虽然全局数据标识为A0001和D0001的数据的内部数据标识为都1,但是,由于全局数据标识为A0001 和D0001的数据的索引信息都分别存储在不同的索引分区中,查询数据的过程是基于每个索引分区的索引信息进行的,因此,在不同的索引分区中,数据的内部数据标识是互不干扰的,即,数据的内部数据标识在对应的索引分区中是唯一的。
表4
Figure PCTCN2018100565-appb-000003
假设,该第一索引分区为索引分区#1,该第一数据为全局数据标识为B0002(即,表3中的第4行数据),那么,该M条数据为全局数据标识为A0001、A0002和B0001的数据,即,M=3,则,该第二对应关系如表5所示。其中,全局数据标识即为数据的行主键。
表5
Figure PCTCN2018100565-appb-000004
假设,存储设备需要对表3中列名为“Address”、“Gender”、“Education”和“Marital Status”的数据构建索引,并且,针对列名为“Address”的数据采用上文的方式1生成关键词以及方式A生成索引键,针对列名为“Gender”、“Education”和“Marital Status”的数据采用上文的方式3生成关键词以及B生成索引键。那么,生成如表6所示的第一对应关系。
其中,表6中的第1列数据即为基于该M条数据生成的索引键;第2列数据即为对应的内部数据标识,一个索引键对应一组内部数据标识,一组内部数据标识包括至少一个内部数据标识;第3列数据即为多个Key Value,{}中内容为一个Key Value。
表6
Figure PCTCN2018100565-appb-000005
Figure PCTCN2018100565-appb-000006
当需要为数据(即,全局数据标识为“B0002”)的数据构建索引时,需要更新该第一索引分区中的第一索引信息,更新后的该第一索引信息中的第二对应关系如表7所示。
表7
Figure PCTCN2018100565-appb-000007
更新后的该第一索引信息中的第一对应关系如表8所示。
表8
Figure PCTCN2018100565-appb-000008
需要说明的是,当对应于该第一索引分区的数据还未写入,且该第一数据是对应于该第一索引分区的第一次被写入的数据时,存储设备基于该第一数据的P个索引键、该第一数据的行主键和该第一数据的内部数据标识,开始建立该第一索引分区中的第一索引信息。
如前所述,该第一数据为数据表中的任一条数据,相应地,该第一数据对应的第一索引分区为对应于数据表的多个索引分区中的任一个索引分区。为了描述方便,本申请 实施例中以一个数据(即,第一数据)以及对应的一个索引分区为例进行说明。因此,针对数据表中任一条数据,都可以通过步骤S230构建索引以及更新索引分区中的索引信息。
因此,本申请实施例提供的用于查询数据的方法,在获取第一数据后,根据基于该第一数据的至少部分数据(例如,L列数据)生成的索引键、该第一数据的行主键和该第一数据的内部数据标识更新对应该第一数据的第一索引分区中的第一索引信息,其中,该第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,该第一对应关系表示基于该M条数据生成的N个索引键与N组内部数据标识之间的关系,该第二对应关系表示基于M条数据生成的M个行主键和M个内部数据标识之间的对应关系。由于数据的内部数据标识在数据对应的索引分区中是唯一的,因此,当多个数据文件合并为一个数据文件时,数据的行主键与内部数据标识之间的对应关系不会发生变化,从而基于数据生成的索引键和内部数据标识之间的对应关系也不会发生变化,从而,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
在本申请实施例中,对应于一个索引键的一组内部数据标识被存储在一行数据中,并且,可以采用Base+Delta的方式存储对应于一个索引键的一组内部数据标识,或者说,可以采用Base+Delta的方式存储对应于一个索引键的倒排索引列表。一组内部数据标识包括Base部分和Delta部分,下面,对Base+Delta的存储方式进行详细说明。
Base部分包括至少一个内部数据标识的集合,初始状态时刻不存在Base部分,只有经过第一次合并后才会存在。
Delta部分包括至少一个Key Value,该至少一个Key Value是在Base部分的基础上新增加的Key Value,每个Key Value都关联一个内部数据标识或一小批次内部数据标识。其中,每个Key Value对应用一个变更操作,该变更操作包括增加操作,即在Base部分增加对应的Key Value的内部数据标识,或者,该变更操作包括删除操作,即在Base部分删除对应的Key Value的内部数据标识。
当存在多个增量Key Value时,可以通过合并机制将已经存储在的Base部分的内部数据标识与增量的Key Value的内部数据标识进行合并,生成一个新的Base部分。新的Base部分将替代已经原有的Base部分与Delta部分中的一部分Key Value的内部数据标识,这样有利于加速查询。
图3所示为在底层数据库中存储针对一个索引键的一组内部数据标识或倒排索引列表的示意图。如图3所示,该组内部数据标识的Base部分包括的数据标识为{1,3,4,7,9,10,20},Delta部分包括的5个Key Value的内部数据标识,“+”表示增加操作,“-”表示删除操作,第3个Key Value表示对在Base部分增加对应的3个内部数据标识,即为上文描述的变更小批次的内部数据标识。通过合并机制将Base部分和Delta部分合并,生成更新后的一组内部数据标识或倒排索引列表,即{1,3,5,7,9,10,22,24,25,26,27,28}。
同时,为了更清楚地描述方案,继续以表3中的数据为例。当继续向表3中添加新的数据时,表3中的数据量会越来越多,一般来说,每个索引键对应的内部数据标识的个数会越来越多(即,同一行的Key Value的数据会越来越多),系统会将多个Key Value 进行自动合并。以索引键“A^Marital Status:Married”为例,表9所示为未合并前的索引键和内部数据标识之间的对应关系,表10所示为合并后的索引键和内部数据标识之间的对应关系。
表9
Figure PCTCN2018100565-appb-000009
表10
Figure PCTCN2018100565-appb-000010
这样,以Base+Delta的存储方式存储Key value数据,由于Delta部分的数据是写入磁盘的,可以使得新写入至Delta部分的数据不会影响已经存储的Base部分的数据,可以有效地提高数据的写入速度;并且,通过合并机制将Base与Delta的数据进行合并,可以有效地提高数据的读取速度,进而可以有效地减少了查询数据的时延,提高了查询效率。
可选地,当Base的数据大小达到第一大小时,则合并Base和Delta,合并后的Base为合并后的文件的存储路径,合并后的Delta中存储有合并后的文件。
为了加快处理速度,本申请实施例也提供了一种可选的实现方式:第一索引信息存储在第一存储区中,该M条数据存储在第二存储区中,该第一存储区与该第二存储区是隔离的。
也就是说,该第一存储区用于存储数据表对应的索引信息,该第二存储区用于存储数据表中的数据,该第一存储区域该第二存储区隔离,即,索引分区与数据分区是隔离的,数据表中的数据和对应的索引分区隔离存储,可以使得数据表的数据分区变化并不会影响索引信息的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
因此,本申请实施例提供的用于查询数据的方法,一方面,在获取数据(例如,第一数据)后,根据基于该第一数据的至少部分数据(例如,L列数据)生成的索引键、该第一数据的行主键和该第一数据的内部数据标识更新对应该第一数据的第一索引分区中的第一索引信息,其中,该第一索引信息包括针对已存储的M条数据的第一对应关系和 第二对应关系,该第一对应关系表示基于该M条数据生成的N个索引键与N组内部数据标识之间的关系,该第二对应关系表示基于M条数据生成的M个行主键和M个内部数据标识之间的对应关系。由于数据的内部数据标识在数据对应的索引分区中是唯一的,因此,当多个数据文件合并为一个数据文件时,数据的行主键与内部数据标识之间的对应关系不会发生变化,从而基于数据生成的索引键和内部数据标识之间的对应关系也不会发生变化,从而,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
另一方面,通过从第一数据的L列数据中的任一列数据(例如,第i列数据)中的至少一个分词、该至少一个分词和该第一数据的行主键、该至少一个分词和该第i列数据的列名中的任意一项中,提取对应于该至少一个分词的至少一个关键词,可以有效地提高系统提取关键词的灵活性,进而提高数据的查询效率。
再一方面,通过将存储索引信息的第一存储区和存储数据的第二存储区隔离,可以使得数据表的数据分区变化并不会影响索引信息的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
上文结合图2和图3详细说明了本申请实施例的用于查询数据过程为数据构建索引以及时时更新索引信息的过程,基于上述索引信息,存储设备可以根据用户发送的查询条件进行查询数据。
因此,本申请实施例还提供了一种用于查询数据的方法300,图4是根据本申请实施例的用于查询数据的方法300的示意性流程图。同理,该方法300的执行主体可以为用于查询数据的设备中的存储设备,也可以为存储设备内的处理器。
在步骤S310中,获取查询条件。
具体而言,存储服务器可以接收终端设备的客户端发送的查询条件,该查询条件包括X个关键词,X为大于或等于1的整数。当该查询条件包括多个(即,X大于1)关键词时,该查询条件还包括用于连接相邻两个关键词之间的逻辑运算符,其中,逻辑运算符包括“与”、“非”和“或”。例如,“与”可以表示为“&&”,“非”可以表示为“!”,“或”可以表示为“||”。
例如,查询条件可以为:Address:龙岗&&Gender:Male,即,表示需要查询的对象必须同时满足查询条件中的两个关键词。
在步骤S320中,根据S个索引分区中每个索引分区的索引信息中的第一对应关系查询满足该查询条件的目标数据的内部数据标识,该内部数据标识在该目标数据对应的索引分区中是唯一的,该S个索引分区为根据该查询条件确定的索引分区,其中,该第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,每组内部数据标识包括该多条数据中的至少一条数据的内部数据标识,该每组内部数据标识是用于标识具备对应的索引键的数据的标识。
在本申请实施例中,该查询条件中携带用于指示基于该查询条件查询的数据表的元数据的指示信息,其中,该数据表的元数据包括用于指示存储该数据表的索引信息的索引分区的信息。这样,存储设备可以基于该查询条件确定需要查询的S个索引分区,该S个索引分区与基于该查询条件查询的数据表对应。
在该S个索引分区中,每个索引分区存储对应的数据的索引信息,每个索引分区的 索引信息包括第一对应关系和第二对应关系,其中,该第一对应关系表示基于对应于该每个索引分区的多条数据生成的多个索引键和多组内部数据标识之间的一一对应关系,该第二对应关系表示基于对应于该每个索引分区的多条数据生成的多个行主键和该多条数据的多个内部数据标识之间的一一对应关系。
具体关于每个索引分区的第一对应关系和第二对应关系的描述可以参考上文针对该第一索引分区中的第一索引信息中的第一对应关系和第二对应关系的描述,此处为了简洁,不再赘述。
这样,获取该查询条件后,通过词提取器提取该查询条件中的X个关键词,在该S个索引分区中的每个索引分区中,基于该每个索引分区的索引信息中的第一对应关系和该X个关键词,在该第一对应关系中查询对应于该X个关键词的X个索引键,在查找到该X索引键后,确定该X个索引键对应的X组内部数据标识,根据查询条件的逻辑符号计算该X组内部数据标识,进而查询到满足该查询条件的目标数据的内部数据标识。
以上文所述的表8所示的第一索引分区中的第一对应关系为例,对步骤S320中在一个索引分区的查询过程进行简单说明。
假设,该查询条件为{Address:龙岗&&Gender:Male},分解后的关键词包括:“Address:龙岗”和“Gender:Male”。从表8中可以看出,对应关键词“Address:龙岗”的索引键为“A^Address^龙岗”,对应索引键“A^Address^龙岗”的内部数据标识为{1};对应关键词“Gender:Male”的索引键为“A^Gender:Male”,对应索引键“A^Gender:Male”的内部数据标识为{1,2},那么,同时满足这两个关键词的内部数据标识为{1},即,该满足该查询条件的目标数据的内部数据标识为{1}。
进而,在S330中,根据该目标数据的内部数据标识和该每个索引分区的索引信息中的第二对应关系,查询满足该目标数据的行主键,并根据该目标数据的行主键生成包括该目标数据的查询结果。
即,在该S个索引分区中的每个索引分区中,在该第二对应关系中查找对应于该目标数据的内部数据标识的行主键,进而在对应于数据表的数据区中查询目标数据,并生成查询结果。
其中,如前所述,该查询条件可以携带用于指示基于该查询条件查询的数据表的元数据的指示信息,其中,该数据表的元数据还包括用于指示存储该数据表的数据区的信息。这样,存储设备可以基于该查询条件确定对应于数据表的数据区,进而在数据区中查询目标数据。
继续以步骤S320中的例子为例,在确定该目标数据的内部数据标识为{1}时,可以通过上文的表8确定对应的行主键为A0001,进而,在数据区中查找A0001的数据内容。
在S340中,反馈该查询结果。
实际实现过程中,为了提高处理速度,可以通过构建位图索引实现查询过程。
这种情况下,索引信息可以包括位图索引以及索引位置,位图索引包括索引键和位图向量之间的对应关系,位图向量包括用于表示各条数据是否满足对应的索引键的索引,索引位置包括各条数据的索引在位图向量中的位置。
其中,索引位置可以类比于表示多个行主键和多个内部数据标识之间的第二对应关系,位图索引可以类比于表示多个索引键与多组内部数据标识之间的第一对应关系。
以表7中的第二对应关系为例,表11为对应于表7的第二对应关系的索引位置,表12为对应于表8的第一对应关系的位图向量。
表11
Figure PCTCN2018100565-appb-000011
表12
Figure PCTCN2018100565-appb-000012
Figure PCTCN2018100565-appb-000013
继续以查询条件为{Address:龙岗&&Gender:Male}为例,分解后的关键词包括:“Address:龙岗”和“Gender:Male”。从表12中可以看出,对应关键词“Address:龙岗”的索引键为“A^Address^龙岗”,对应索引键“A^Address^龙岗”的位图索引为{1000};对应关键词“Gender:Male”的索引键为“A^Gender:Male”,对应索引键“A^Gender:Male”的位图索引为{1100},对位图索引{1000}和{1100}进行逻辑“与”运算,得到满足该查询条件的目标数据为位于位图索引中的第1位上的数据;通过表11中的索引位置确定索引键为A0001;进而,在数据区中查找A0001的数据内容。
因此,本申请实施例提供的用于查询数据的方法,由于构建的索引分区的索引信息包括第一对应关系和第二对应关系,其中,第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,第二对应关系表示基于多条数据生成的多个行主键和多个内部数据标识之间的一一对应关系,并且,数据的内部数据标识在该数据对应的索引分区中是唯一的,这样,当多个数据文件被合并为一个数据文件时,该第二对应关系不会发生变化,进而,该第一对应关系也不会发生变化,从而,当查询满足查询条件的数据时,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
以上结合图2至图4详细描述了本申请实施例中用于查询数据的方法,下面,结合图5至图8详细描述根据本申请实施例的用于查询数据的装置,方法实施例所描述的技术特征同样适用于以下装置实施例。此外,本申请实施例中的用于查询数据的装置可以部署在分布式存储系统中的至少一个节点上。
图5所示为根据本申请实施例的用于查询数据的装置的示意性框图。如图5所示,该装置包括处理单元410和存储单元420,其中,该存储单元420用于存储数据和索引信息,该处理单元410用于:
获取第一数据;
根据该第一数据中的L列数据生成P个索引键,该L为大于或等于1的整数,该P为大于1的整数;
根据该P个索引键、该第一数据的行主键和该第一数据的内部数据标识,在该第一数据对应的第一索引分区中更新第一索引信息,该第一数据的行主键用于在数据区中查找该第一数据,该第一数据的内部数据标识在该第一索引分区中是唯一的,该第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,其中,
该第一对应关系表示基于该M条数据生成的N个索引键和N组内部数据标识之间的一一对应关系,每组内部数据标识包括该M条数据中的至少一条数据的内部数据标识,该每组内部数据标识是用于标识满足对应的索引键的数据的标识,该第二对应关系表示基于该M条数据生成的M个行主键和该M条数据的M个内部数据标识之间的一一对应关系,该M和该N都为大于或等于1的整数。
因此,本申请实施例提供的用于查询数据的装置,在获取数据(例如,第一数据)后,根据基于该第一数据的至少部分数据(例如,L列数据)生成的索引键、该第一数据的行主键和该第一数据的内部数据标识更新对应该第一数据的第一索引分区中的第一索引信息,其中,该第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,该第一对应关系表示基于该M条数据生成的N个索引键与N组内部数据标识之间的关系,该第二对应关系表示基于M条数据生成的M个行主键和M个内部数据标识之间的对应关系。由于数据的内部数据标识在数据对应的索引分区中是唯一的,因此,当多个数据文件合并为一个数据文件时,数据的行主键与内部数据标识之间的对应关系不会发生变化,从而基于数据生成的索引键和内部数据标识之间的对应关系也不会发生变化,从而,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
可选地,该处理单元410具体用于:
在[1,L]范围内对i遍历取值,通过以下步骤生成该P个索引键:
从以下任意一项中,提取至少一个关键词,所述任意一项包括:所述第一数据的第i列数据中的至少一个分词,或,所述第一数据的行主键,或,所述第一数据的第i列数据的列名,其中,所述至少一个关键词与所述第i列数据中的至少一个分词一一对应,
若该任意一项包括该第一数据的第i列数据中的至少一个分词,则每个关键词包括对应于该每个关键词的分词,或,
若该任意一项包括该第一数据的第i列数据中的至少一个分词和该第一数据的行主键,则每个关键词包括对应于该每个关键词的分词和该第一数据的行主键,或,
若该任意一项包括该第一数据的第i列数据中的至少一个分词和该第一数据的第i列数据的列名,则每个关键词包括对应于该每个关键词的分词和该第一数据的第i列数据的列名;
根据该至少一个关键词中的每个关键词生成对应于该每个关键词的索引键。
因此,本申请实施例提供的用于查询数据的装置,通过从第一数据的L列数据中的任一列数据(例如,第i列数据)中的至少一个分词、该至少一个分词和该第一数据的行主键、该至少一个分词和该第i列数据的列名中的任意一项中,提取对应于该至少一个分词的至少一个关键词,可以有效地提高系统提取关键词的灵活性,进而提高数据的查询效率。
可选地,该处理单元410具体用于:
由该每个关键词、该第一数据的第i列数据的列名和用于标识该第一索引分区的第一索引分区标识生成对应于该每个关键词的索引键。
可选地,该处理单元410具体用于:
由该每个关键词和用于标识该第一索引分区的第一索引分区标识生成对应于该每个关键词的索引键。
可选地,该第一索引信息存储在第一存储区中,该M条数据存储在第二存储区中,该第一存储区与该第二存储区是隔离的。
因此,本申请实施例提供的用于查询数据的装置,通过将存储索引信息的第一存储区和存储数据的第二存储区隔离,可以使得数据表的数据分区变化并不会影响索引信息 的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
该装置400可以对应(例如,可以配置于或本身即为)上述方法200中描述的用于查询数据的设备(例如,存储设备),并且,该装置400中各模块或单元分别用于执行上述方法200中用于查询数据的设备所执行的各动作或处理过程,这里,为了避免赘述,省略其详细说明。
在本申请实施例中,该装置400可以为用于查询数据的设备(例如,存储设备),图7示出了根据本申请实施例的用于查询数据的设备600的示意性结构图。如图7,该用于查询数据的设备600可以包括:处理器610、存储器620,处理器610和存储器620通信连接。该用于查询数据的设备620可以用于存储指令,该处理器610用于执行该存储器420存储的指令。
此种情况下,图5所示的装置400中的处理单元410可以对应图7所示的用于查询数据的设备600中的处理器610,图5所示的装置400中的存储单元420可以对应图7所示的用于查询数据的设备600中的存储器620。
在本申请实施例中,该装置400可以为安装在用于查询数据的设备(例如,存储设备)中的芯片(或者说,芯片系统),此情况下,该装置400可以包括:处理器和存储器,存储器与处理器通信连接。该存储器可以用于存储指令,该处理器用于执行该存储器存储的指令。
此种情况下,图5所示的装置400中的处理单元410可以对应该处理器,图5所示的装置400中的存储单元420可以对应该存储器。
因此,本申请实施例提供的用于查询数据的装置,一方面,在获取数据(例如,第一数据)后,根据基于该第一数据的至少部分数据(例如,L列数据)生成的索引键、该第一数据的行主键和该第一数据的内部数据标识更新对应该第一数据的第一索引分区中的第一索引信息,其中,该第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,该第一对应关系表示基于该M条数据生成的N个索引键与N组内部数据标识之间的关系,该第二对应关系表示基于M条数据生成的M个行主键和M个内部数据标识之间的对应关系。由于数据的内部数据标识在数据对应的索引分区中是唯一的,因此,当多个数据文件合并为一个数据文件时,数据的行主键与内部数据标识之间的对应关系不会发生变化,从而基于数据生成的索引键和内部数据标识之间的对应关系也不会发生变化,从而,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
另一方面,通过从第一数据的L列数据中的任一列数据(例如,第i列数据)中的至少一个分词、该至少一个分词和该第一数据的行主键、该至少一个分词和该第i列数据的列名中的任意一项中,提取对应于该至少一个分词的至少一个关键词,可以有效地提高系统提取关键词的灵活性,进而提高数据的查询效率。
再一方面,通过将存储索引信息的第一存储区和存储数据的第二存储区隔离,可以使得数据表的数据分区变化并不会影响索引信息的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
图6所示为根据本申请实施例的用于查询数据的装置的示意性框图。如图6所示, 该装置包括处理单元510和存储单元520,其中,该存储单元520用于存储数据和索引信息,该处理单元510用于:
获取查询条件;
根据S个索引分区中每个索引分区的索引信息中的第一对应关系查询满足该查询条件的目标数据的内部数据标识,该内部数据标识在该目标数据对应的索引分区中是唯一的,该S个索引分区为根据该查询条件确定的索引分区,其中,该第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,每组内部数据标识包括该多条数据中的至少一条数据的内部数据标识,该每组内部数据标识是用于标识满足对应的索引键的数据的标识;
根据该目标数据的内部数据标识和该每个索引分区的索引信息中的第二对应关系,查询满足该目标数据的行主键,并根据该目标数据的行主键生成包括该目标数据的查询结果,其中,该第二对应关系表示基于该多条数据生成的多个行主键和该多条数据的多个内部数据标识之间的一一对应关系,该行主键用于在数据区中查找数据;
反馈该查询结果。
因此,本申请实施例提供的用于查询数据的装置,由于构建的索引分区的索引信息包括第一对应关系和第二对应关系,其中,第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,第二对应关系表示基于多条数据生成的多个行主键和多个内部数据标识之间的一一对应关系,并且,数据的内部数据标识在该数据对应的索引分区中是唯一的,这样,当多个数据文件被合并为一个数据文件时,该第二对应关系不会发生变化,进而,该第一对应关系也不会发生变化,从而,当查询满足查询条件的数据时,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
可选地,该S个索引分区的索引信息存储在第一存储区中,该S个索引分区对应的数据存储在第二存储区中,该第一存储区与该第二存储区是隔离的。
因此,本申请实施例提供的用于查询数据的装置,通过将存储索引信息的第一存储区和存储数据的第二存储区隔离,可以使得数据表的数据分区变化并不会影响索引信息的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
该装置500可以对应(例如,可以配置于或本身即为)上述方法300中描述的用于查询数据的设备(例如,存储设备),并且,该装置500中各模块或单元分别用于执行上述方法300中用于查询数据的设备所执行的各动作或处理过程,这里,为了避免赘述,省略其详细说明。
在本申请实施例中,该装置500可以为用于查询数据的设备((例如,存储设备)),图8示出了根据本申请实施例的用于查询数据的设备700的示意性结构图。如图8所示,该用于查询数据的设备700可以包括:处理器710、存储器720,处理器710和存储器720通信连接。该存储器720可以用于存储指令,该处理器710用于执行该存储器420存储的指令。
此种情况下,图6所示的装置500中的处理单元510可以对应图8所示的用于查询数据的设备700中的处理器710,图6所示的装置500中的存储单元520可以对应图8 所示的用于查询数据的设备700中的存储器720。
在本申请实施例中,该装置500可以为安装在用于查询数据的设备(例如,存储设备)中的芯片(或者说,芯片系统),此情况下,该装置500可以包括:处理器和存储器,存储器与处理器通信连接。该存储器可以用于存储指令,该处理器用于执行该存储器存储的指令。
此种情况下,图6所示的装置500中的处理单元510可以对应该处理器,图6所示的装置500中的存储单元520可以对应该存储器。
因此,本申请实施例提供的用于查询数据的装置,一方面,由于构建的索引分区的索引信息包括第一对应关系和第二对应关系,其中,第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,第二对应关系表示基于多条数据生成的多个行主键和多个内部数据标识之间的一一对应关系,并且,数据的内部数据标识在该数据对应的索引分区中是唯一的,这样,当多个数据文件被合并为一个数据文件时,该第二对应关系不会发生变化,进而,该第一对应关系也不会发生变化,从而,当查询满足查询条件的数据时,可以快速地从原先缓存在内存中的索引信息中读取数据,而不需要从底层的数据库中的索引信息中重新读取数据,提高了数据的查询效率。
再一方面,通过将存储索引信息的第一存储区和存储数据的第二存储区隔离,可以使得数据表的数据分区变化并不会影响索引信息的内容,并且,在重建索引信息时也不会影响数据表中的数据,有效地提高了数据的处理速度。
应注意,本申请实施例上述方法实施例可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、 同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种用于查询数据的方法,其特征在于,所述方法包括:
    获取第一数据;
    根据所述第一数据中的L列数据生成P个索引键,所述L为大于或等于1的整数,所述P为大于1的整数;
    根据所述P个索引键、所述第一数据的行主键和所述第一数据的内部数据标识,在所述第一数据对应的第一索引分区中更新第一索引信息,所述第一数据的行主键用于在数据区中查找所述第一数据,所述第一数据的内部数据标识在所述第一索引分区中是唯一的,所述第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,其中,
    所述第一对应关系表示基于所述M条数据生成的N个索引键与N组内部数据标识之间的一一对应关系,每组内部数据标识包括所述M条数据中的至少一条数据的内部数据标识,所述每组内部数据标识是用于标识满足对应的索引键的数据的标识,所述第二对应关系表示基于所述M条数据生成的M个行主键和所述M条数据的M个内部数据标识之间的一一对应关系,所述M和所述N都为大于或等于1的整数。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述第一数据中的L列数据生成P个索引键Key,包括:
    在[1,L]范围内对i遍历取值,通过以下步骤生成所述P个索引键:
    从以下任意一项中,提取至少一个关键词,所述任意一项包括:所述第一数据的第i列数据中的至少一个分词,或,所述第一数据的行主键,或,所述第一数据的第i列数据的列名,其中,所述至少一个关键词与所述第i列数据中的至少一个分词一一对应,
    若所述任意一项包括所述第一数据的第i列数据中的至少一个分词,则每个关键词包括对应于所述每个关键词的分词,或,
    若所述任意一项包括所述第一数据的第i列数据中的至少一个分词和所述第一数据的行主键,则每个关键词包括对应于所述每个关键词的分词和所述第一数据的行主键中的关键词,或,
    若所述任意一项包括所述第一数据的第i列数据中的至少一个分词和所述第一数据的第i列数据的列名,则每个关键词包括对应于所述每个关键词的分词和所述第一数据的第i列数据的列名;
    根据所述至少一个关键词中的每个关键词生成对应于所述每个关键词的索引键。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述至少一个关键词中的每个关键词生成对应于所述每个关键词的索引键,包括:
    由所述每个关键词、所述第一数据的第i列数据的列名和用于标识所述第一索引分区的第一索引分区标识生成对应于所述每个关键词的索引键。
  4. 根据权利要求2所述的方法,其特征在于,所述根据所述至少一个关键词中的每个关键词生成对应于所述每个关键词的索引键,包括:
    由所述每个关键词和用于标识所述第一索引分区的第一索引分区标识生成对应于所述每个关键词的索引键。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述第一索引信息存储在第一存储区中,所述M条数据存储在第二存储区中,所述第一存储区与所述第二存 储区是隔离的。
  6. 一种用于查询数据的方法,其特征在于,所述方法包括:
    获取查询条件;
    根据S个索引分区中每个索引分区的索引信息中的第一对应关系查询满足所述查询条件的目标数据的内部数据标识,所述内部数据标识在所述目标数据对应的索引分区中是唯一的,所述S个索引分区为根据所述查询条件确定的索引分区,其中,所述第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,每组内部数据标识包括所述多条数据中的至少一条数据的内部数据标识,所述每组内部数据标识是用于标识满足对应的索引键的数据的标识;
    根据所述目标数据的内部数据标识和所述每个索引分区的索引信息中的第二对应关系,查询满足所述目标数据的行主键,并根据所述目标数据的行主键生成包括所述目标数据的查询结果,其中,所述第二对应关系表示基于所述多条数据生成的多个行主键和所述多条数据的多个内部数据标识之间的一一对应关系,所述行主键用于在数据区中查找数据;
    反馈所述查询结果。
  7. 根据权利要求6所述的方法,其特征在于,所述S个索引分区的索引信息存储在第一存储区中,所述S个索引分区对应的数据存储在第二存储区中,所述第一存储区与所述第二存储区是隔离的。
  8. 一种用于查询数据的装置,其特征在于,所述装置包括处理单元,所述处理单元用于:
    获取第一数据;
    根据所述第一数据中的L列数据生成P个索引键,所述L为大于或等于1的整数,所述P为大于1的整数;
    根据所述P个索引键、所述第一数据的行主键和所述第一数据的内部数据标识,在所述第一数据对应的第一索引分区中更新第一索引信息,所述第一数据的行主键用于在数据区中查找所述第一数据,所述第一数据的内部数据标识在所述第一索引分区中是唯一的,所述第一索引信息包括针对已存储的M条数据的第一对应关系和第二对应关系,其中,
    所述第一对应关系表示基于所述M条数据生成的N个索引键和N组内部数据标识之间的一一对应关系,每组内部数据标识包括所述M条数据中的至少一条数据的内部数据标识,所述每组内部数据标识是用于标识满足对应的索引键的数据的标识,所述第二对应关系表示基于所述M条数据生成的M个行主键和所述M条数据的M个内部数据标识之间的一一对应关系,所述M和所述N都为大于或等于1的整数。
  9. 根据权利要求8所述的装置,其特征在于,所述处理单元具体用于:
    在[1,L]范围内对i遍历取值,通过以下步骤生成所述P个索引键:
    从以下任意一项中,提取至少一个关键词,所述任意一项包括:所述第一数据的第i列数据中的至少一个分词,或,所述第一数据的行主键,或,所述第一数据的第i列数据的列名,其中,所述至少一个关键词与所述第i列数据中的至少一个分词一一对应,
    若所述任意一项包括所述第一数据的第i列数据中的至少一个分词,则每个关键词包括对应于所述每个关键词的分词,或,
    若所述任意一项包括所述第一数据的第i列数据中的至少一个分词和所述第一数据的行主键,则每个关键词包括对应于所述每个关键词的分词和所述第一数据的行主键中的关键词,或,
    若所述任意一项包括所述第一数据的第i列数据中的至少一个分词和所述第一数据的第i列数据的列名,则每个关键词包括对应于所述每个关键词的分词和所述第一数据的第i列数据的列名;
    根据所述至少一个关键词中的每个关键词生成对应于所述每个关键词的索引键。
  10. 根据权利要求9所述的装置,其特征在于,所述处理单元具体用于:
    由所述每个关键词、所述第一数据的第i列数据的列名和用于标识所述第一索引分区的第一索引分区标识生成对应于所述每个关键词的索引键。
  11. 根据权利要求9所述的装置,其特征在于,所述处理单元具体用于:
    由所述每个关键词和用于标识所述第一索引分区的第一索引分区标识生成对应于所述每个关键词的索引键。
  12. 根据权利要求8至11中任一项所述的装置,其特征在于,所述第一索引信息存储在第一存储区中,所述M条数据存储在第二存储区中,所述第一存储区与所述第二存储区是隔离的。
  13. 一种用于查询数据的装置,其特征在于,所述装置包括处理单元,所述处理单元用于:
    获取查询条件;
    根据S个索引分区中每个索引分区的索引信息中的第一对应关系查询满足所述查询条件的目标数据的内部数据标识,所述内部数据标识在所述目标数据对应的索引分区中是唯一的,所述S个索引分区为根据所述查询条件确定的索引分区,其中,所述第一对应关系表示基于多条数据生成的多个索引键与多组内部数据标识之间的一一对应关系,每组内部数据标识包括所述多条数据中的至少一条数据的内部数据标识,所述每组内部数据标识是用于标识满足对应的索引键的数据的标识;
    根据所述目标数据的内部数据标识和所述每个索引分区的索引信息中的第二对应关系,查询满足所述目标数据的行主键,并根据所述目标数据的行主键生成包括所述目标数据的查询结果,其中,所述第二对应关系表示基于所述多条数据生成的多个行主键和所述多条数据的多个内部数据标识之间的一一对应关系,所述行主键用于在数据区中查找数据;
    反馈所述查询结果。
  14. 根据权利要求13所述的装置,其特征在于,所述S个索引分区的索引信息存储在第一存储区中,所述S个索引分区对应的数据存储在第二存储区中,所述第一存储区与所述第二存储区是隔离的。
  15. 一种用于查询数据的设备,其特征在于,所述设备包括:
    存储器,用于存储指令;
    处理器,用于执行所述存储器存储的指令,并且,当所述处理器执行所述存储器存储的指令时,使得所述设备执行如权利要求1至5中任一项所述的方法。
  16. 一种用于查询数据的设备,其特征在于,所述设备包括:
    存储器,用于存储指令;
    处理器,用于执行所述存储器存储的指令,并且,当所述处理器执行所述存储器存储的指令时,使得所述设备执行如权利要求6或7所述的方法。
  17. 一种计算机存储介质,其特征在于,包括计算机执行指令,当计算机的处理器执行所述计算机执行指令时,所述计算机执行权利要求1至5中任一项所述的方法。
  18. 一种计算机存储介质,其特征在于,包括计算机执行指令,当计算机的处理器执行所述计算机执行指令时,所述计算机执行权利要求6或7所述的方法。
PCT/CN2018/100565 2018-02-28 2018-08-15 一种用于查询数据的方法 WO2019165763A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810167679.1 2018-02-28
CN201810167679.1A CN108427736B (zh) 2018-02-28 2018-02-28 一种用于查询数据的方法

Publications (1)

Publication Number Publication Date
WO2019165763A1 true WO2019165763A1 (zh) 2019-09-06

Family

ID=63157297

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100565 WO2019165763A1 (zh) 2018-02-28 2018-08-15 一种用于查询数据的方法

Country Status (2)

Country Link
CN (1) CN108427736B (zh)
WO (1) WO2019165763A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299101B (zh) * 2018-10-15 2020-12-01 上海达梦数据库有限公司 数据检索方法、装置、服务器和存储介质
CN109299106B (zh) * 2018-10-31 2020-09-22 中国联合网络通信集团有限公司 数据查询方法和装置
CN112131226A (zh) * 2020-09-28 2020-12-25 联想(北京)有限公司 索引获得方法、数据查询方法、及相关装置
CN114385620A (zh) * 2020-10-19 2022-04-22 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及可读存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101120337A (zh) * 2004-04-02 2008-02-06 易享信息技术(上海)有限公司 多租户数据库系统中的自定义实体和字段
CN104794123A (zh) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 一种为半结构化数据构建NoSQL数据库索引的方法及装置
CN105354255A (zh) * 2015-10-21 2016-02-24 华为技术有限公司 数据查询方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609154A (zh) * 2017-09-23 2018-01-19 浪潮软件集团有限公司 一种多源异构数据的处理方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101120337A (zh) * 2004-04-02 2008-02-06 易享信息技术(上海)有限公司 多租户数据库系统中的自定义实体和字段
CN104794123A (zh) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 一种为半结构化数据构建NoSQL数据库索引的方法及装置
CN105354255A (zh) * 2015-10-21 2016-02-24 华为技术有限公司 数据查询方法和装置

Also Published As

Publication number Publication date
CN108427736B (zh) 2020-01-17
CN108427736A (zh) 2018-08-21

Similar Documents

Publication Publication Date Title
WO2019165763A1 (zh) 一种用于查询数据的方法
CN112363979B (zh) 一种基于图数据库的分布式索引方法和系统
EP2863310B1 (en) Data processing method and apparatus, and shared storage device
US20180285376A1 (en) Method and apparatus for operating on file
CN107085570B (zh) 数据处理方法、应用服务器和路由器
US20090024794A1 (en) Enhanced Access To Data Available In A Cache
CN107203640B (zh) 通过数据库运行记录建立物理模型的方法及系统
CN109690522B (zh) 一种基于b+树索引的数据更新方法、装置及存储装置
CN107977396B (zh) 一种KeyValue数据库的数据表的更新方法与表数据更新装置
CN104021161A (zh) 一种聚簇存储方法及装置
CN108897874B (zh) 用于处理数据的方法和装置
WO2018205151A1 (zh) 数据更新方法和存储装置
US9323798B2 (en) Storing a key value to a deleted row based on key range density
CN113220659B (zh) 一种数据迁移的方法、系统、电子装置和存储介质
WO2021253688A1 (zh) 数据同步方法及装置、数据查询方法及装置
JP2020123320A (ja) インデックスを管理するための方法、装置、設備及び記憶媒体
WO2021258853A1 (zh) 词汇纠错方法、装置、计算机设备及存储介质
US11726743B2 (en) Merging multiple sorted lists in a distributed computing system
CN107609011B (zh) 一种数据库记录的维护方法和装置
US8396858B2 (en) Adding entries to an index based on use of the index
WO2018205689A1 (zh) 合并文件的方法、存储装置、存储设备和存储介质
CN104573112A (zh) Oltp集群数据库中页面查询方法及数据处理节点
WO2016192057A1 (zh) 索引表的更新方法和设备
CN111008198A (zh) 业务数据获取方法、装置、存储介质、电子设备
CN115328950A (zh) 一种基于二级索引的hbase查询方法、终端设备及存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18907859

Country of ref document: EP

Kind code of ref document: A1