CN113626446B

CN113626446B - Data storage and search method, device, electronic equipment and medium

Info

Publication number: CN113626446B
Application number: CN202111174094.0A
Authority: CN
Inventors: 单强
Original assignee: Alibaba China Co Ltd; Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-09-20
Anticipated expiration: 2041-10-09
Also published as: CN113626446A

Abstract

The embodiment of the disclosure discloses a data storage and search method, a data storage and search device, electronic equipment and a medium. The data storage method comprises the following steps: acquiring data to be stored through a virtual machine; acquiring an index of the data to be stored through the virtual machine; and storing the data to be stored and the index in an off-heap storage space of a host running the virtual machine through the virtual machine.

Description

Data storage and search method, device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of database technologies, and in particular, to a data storage and search method, apparatus, electronic device, and medium.

Background

With the rapid development of big data technology, the amount of data to be stored is rapidly increased, and the real-time requirement on data storage and search is higher and higher. However, the existing data storage and search technology is difficult to realize fast storage and search of large data volume, thereby causing a bottleneck of further popularization and application of the large data technology.

Disclosure of Invention

In order to solve the problems in the related art, embodiments of the present disclosure provide a data storage and search method, apparatus, electronic device, and medium.

In a first aspect, an embodiment of the present disclosure provides a data storage method.

Specifically, the data storage method includes:

acquiring data to be stored through a virtual machine;

acquiring an index of the data to be stored through the virtual machine;

and storing the data to be stored and the index in an off-heap storage space of a host running the virtual machine through the virtual machine.

With reference to the first aspect, in a first implementation manner of the first aspect, the index includes a range index and/or a hash index.

With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, if the index is a range index, the storing, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine includes:

if the index only contains a row of index data, storing the index data in an off-heap skip table, wherein keys of the off-heap skip table are the index data of the row, and the value of the off-heap skip table is the identification information of the corresponding data to be stored;

if the index contains a plurality of columns of index data, storing the index in a multi-level heap skip table, wherein each level of heap skip table stores the index data of a corresponding column, and the index data comprises: the key of each level of the off-heap skip list except the last level is the index data of the corresponding column, and the value of the off-heap skip list is the index data corresponding to the key in the index data of the next column; and the key of the last-level heap skip list is index data of a corresponding column, and the value is identification information of corresponding data to be stored.

With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, if the index is a hash index, the storing, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine includes:

generating a hash value according to the indexed columns;

and taking the hash value as a key, taking the identification information of the data to be stored corresponding to the hash value as a value, and storing the value in an off-heap hash table.

In a second aspect, a data searching method is provided in the embodiments of the present disclosure.

Specifically, the data searching method includes:

acquiring a search condition through a virtual machine;

according to the search condition, searching an index stored in an off-heap storage space of a host running the virtual machine through the virtual machine, and determining identification information of data meeting the search condition;

and acquiring the data stored in the off-heap storage space according to the identification information.

With reference to the second aspect, in a first implementation manner of the second aspect, the searching, by the virtual machine, an index stored in an off-heap storage space of a host running the virtual machine according to the search condition includes:

analyzing the search condition to obtain one or more hit indexes;

searching the hit index with the highest level number in the indexes,

and if the hit index with the maximum number of stages comprises a range index and a hash index, searching the hash index with the maximum number of stages in the index.

With reference to the first implementation manner of the second aspect, the present disclosure provides in a second implementation manner of the second aspect:

if the hit index is a range index, searching, by the virtual machine, an index stored in the off-heap storage space of the host running the virtual machine according to the search condition, including: searching an out-of-stack skip list storing the index according to the range of the hit index to obtain identification information meeting the searching condition;

if the hit index is a hash index, searching, by the virtual machine, an index stored in the off-heap storage space of the host running the virtual machine according to the search condition, including: and generating a hash value according to the hit index, and searching an off-heap hash table according to the hash value to obtain identification information meeting the searching condition.

In a third aspect, a data updating method is provided in the embodiments of the present disclosure.

Specifically, the data updating method includes:

acquiring update data through a virtual machine;

acquiring an index of the update data through the virtual machine;

determining, by the virtual machine, target data identification information corresponding to the index of the update data in an off-heap storage space of a host running the virtual machine according to the index of the update data;

and updating data corresponding to the target data identification information in the off-heap storage space by using the updating data through the virtual machine.

In a fourth aspect, a data deleting method is provided in the embodiments of the present disclosure.

Specifically, the data deleting method includes:

acquiring an index of data to be deleted through a virtual machine;

determining, by the virtual machine, target data identification information corresponding to the index of the data to be deleted in an off-heap storage space of a host running the virtual machine according to the index of the data to be deleted;

and deleting the data corresponding to the target data identification information in the off-heap storage space through the virtual machine.

In a fifth aspect, a data storage device is provided in embodiments of the present disclosure.

Specifically, the data storage device includes:

the device comprises a first acquisition module, a second acquisition module and a storage module, wherein the first acquisition module is configured to acquire data to be stored through a virtual machine;

the second acquisition module is configured to acquire the index of the data to be stored through the virtual machine;

a storage module configured to store, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine.

With reference to the fifth aspect, in a first implementation manner of the fifth aspect, the index includes a range index and/or a hash index.

With reference to the first implementation manner of the fifth aspect, in a second implementation manner of the fifth aspect, if the index is a range index, the storing, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine includes:

if the index contains a plurality of columns of index data, storing the index in a multi-level heap skip table, wherein each level of heap skip table stores the index data of a corresponding column, and the index data comprises: the key of each level of the out-of-pile skip list except the last level is the index data of the corresponding column, and the value of the out-of-pile skip list is the index data corresponding to the key in the index data of the next column; and the key of the last-level heap skip list is index data of a corresponding column, and the value is identification information of corresponding data to be stored.

With reference to the second implementation manner of the fifth aspect, in a third implementation manner of the fifth aspect, if the index is a hash index, the storing, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine includes:

generating a hash value according to the indexed columns;

In a sixth aspect, an embodiment of the present disclosure provides a data searching apparatus.

Specifically, the data search apparatus includes:

the third acquisition module is configured to acquire the search condition through the virtual machine;

a first determining module configured to determine, according to the search condition, identification information of data that satisfies the search condition by searching, by the virtual machine, an index stored in an out-of-heap storage space of a host that runs the virtual machine;

a fourth obtaining module configured to obtain the data stored in the off-heap storage space according to the identification information.

With reference to the sixth aspect, in a first implementation manner of the sixth aspect, the searching, by the virtual machine, an index stored in an off-heap storage space of a host running the virtual machine according to the search condition includes:

analyzing the searching condition to obtain one or more hit indexes;

searching the hit index with the highest level number in the indexes,

With reference to the first implementation manner of the sixth aspect, the present disclosure provides in a second implementation manner of the sixth aspect:

In a seventh aspect, an embodiment of the present disclosure provides a data updating apparatus.

Specifically, the data updating apparatus includes:

a fifth obtaining module configured to obtain the update data by the virtual machine;

a sixth obtaining module configured to obtain, by the virtual machine, an index of the update data;

a second determining module configured to determine, by the virtual machine, target data identification information corresponding to the index of the update data in an off-heap storage space of a host running the virtual machine according to the index of the update data;

an updating module configured to update, by the virtual machine, data in the off-heap storage space corresponding to target data identification information using the update data.

In an eighth aspect, an embodiment of the present disclosure provides a data deleting device.

Specifically, the data deleting apparatus includes:

a seventh obtaining module configured to obtain, by the virtual machine, an index of the data to be deleted;

a third determining module, configured to determine, by the virtual machine, target data identification information corresponding to an index of the to-be-deleted data in an out-of-heap storage space of a host running the virtual machine according to the index of the to-be-deleted data;

and the deleting module is configured to delete the data corresponding to the target data identification information in the off-heap storage space through the virtual machine.

In a ninth aspect, the present disclosure provides an electronic device, including a memory and a processor, wherein the memory is configured to store one or more computer instructions, and wherein the one or more computer instructions are executed by the processor to implement the method according to any one of the first to fourth aspects.

In a tenth aspect, an embodiment of the present disclosure provides a computer-readable storage medium having stored thereon computer instructions, which, when executed by a processor, implement the method according to any one of the first to fourth aspects.

In an eleventh aspect, the present disclosure provides, in an embodiment, a computer program product comprising computer instructions that, when executed by a processor, implement the method steps according to any one of the first to fourth aspects.

According to the technical scheme provided by the embodiment of the disclosure, the data magnitude of reading and writing of the virtual machine can be greatly improved by storing the data to be stored in the off-heap storage space according to the embodiment of the disclosure. Through storing the index in the storage space outside the heap, can realize the quick data search based on the index, compare in carrying out data search through the mode of full scan in heap memory, show the efficiency that promotes data search, satisfy the requirement of quick big data volume data search.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. The following is a description of the drawings.

Fig. 1 shows a block diagram of a Presto system according to an embodiment of the present disclosure.

FIG. 2 shows a flow diagram of a data storage method according to an embodiment of the present disclosure.

FIG. 3 illustrates an exemplary data table of data to be stored and its index according to an embodiment of the disclosure.

Fig. 4A shows a data structure diagram of a skip list.

Fig. 4B and 4C illustrate an off-heap skip list example according to an embodiment of the present disclosure.

Fig. 4D and 4E illustrate off-heap hash table examples in accordance with embodiments of the present disclosure.

FIG. 5 shows a flow diagram of a data lookup method according to an embodiment of the present disclosure.

FIG. 6 shows a flow chart of a data update method according to an embodiment of the present disclosure.

FIG. 7 shows a flow diagram of a data deletion method according to an embodiment of the present disclosure.

Fig. 8A illustrates a block diagram of a data storage device according to an embodiment of the present disclosure.

Fig. 8B shows a block diagram of a data lookup apparatus according to an embodiment of the present disclosure.

Fig. 8C illustrates a block diagram of a data update apparatus according to an embodiment of the present disclosure.

Fig. 8D illustrates a block diagram of a data deletion apparatus according to an embodiment of the present disclosure.

Fig. 9 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

FIG. 10 shows a schematic block diagram of a computer system suitable for use in implementing a method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Furthermore, parts that are not relevant to the description of the exemplary embodiments have been omitted from the drawings for the sake of clarity.

In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.

It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

In the present disclosure, the acquisition of the user information or the user data is an operation that is authorized, confirmed, or actively selected by the user.

As mentioned above, with the rapid development of big data technology, the amount of data to be stored is rapidly increasing, and the real-time requirement for data storage and search is higher and higher. However, the existing data storage and search technology is difficult to realize fast storage and search of large data volume, thereby causing a bottleneck of further popularization and application of the large data technology.

A Virtual Machine (Virtual Machine) refers to a computer system with complete hardware system functionality simulated by software. When creating a virtual machine in a computer, it is usually necessary to use part of the hard disk and memory capacity of the physical machine as the hard disk and memory capacity of the virtual machine.

A Java Virtual Machine (JVM) is a virtual machine implemented based on a Java system, and is widely used for implementing data storage and search functions of a database. For example, a distributed SQL (Structured Query Language) Query engine Presto, which is suitable for interactive SQL analysis queries and whose data size supports GB to PB bytes, may be implemented based on a virtual machine.

Fig. 1 shows a block diagram of the Presto system according to an embodiment of the present disclosure.

As shown in fig. 1, the Presto system adopts a master-slave (master-slave) model, in which a coordinator (coordinator) is a master device, a work device (worker) is a slave device, the coordinator is responsible for metadata management and scheduling of the work device, and the work device is responsible for calculation and read-write operation of data. Both the coordinator and the working device can be implemented using JVMs.

The method comprises the steps that a Presto client (client) sends a service request to a coordinator, the coordinator determines working equipment providing service for the client according to a preset scheduling rule, and the client communicates with the working equipment to achieve operations of data storage, searching, updating, deleting and the like.

If the working device stores the data in the heap memory of the JVM, when data is searched, the target data requested by the client needs to be searched by scanning the data stored in the heap memory in a full amount according to the search condition. Generally, a JVM instance has only one heap memory, which is several tens G in size, and this severely limits the amount of data stored, and it is difficult to satisfy the storage and search requirements of large amount of data. On the other hand, the GC (Garbage collector) of the JVM recovers the heap memory, and during the heap memory recovery, the application program running on the JVM may be suspended, which seriously affects the speed of operations such as data storage, search, update, and deletion. In an application scenario with high real-time requirements, such as implementing network fault warning based on data analysis, the slow response speed of data search means that the system needs longer time to sense the fault, thereby increasing the loss caused by the fault.

The host (i.e., physical machine) running the JVM typically has a much larger memory space than the heap memory, with one portion being used as heap memory and the other portion being used as off-heap memory, the size of which can reach the TB level. Therefore, in the embodiment of the present disclosure, operations such as storage, search, update, and deletion of data are implemented by reading and writing the off-heap memory through the virtual machine, so as to provide functions such as fast storage, search, update, and deletion of a large data volume.

As shown in fig. 2, the data storage method according to an embodiment of the present disclosure includes steps S201 to S203.

In step S201, data to be stored is acquired by a virtual machine;

in step S202, an index of the data to be stored is obtained by the virtual machine;

in step S203, the data to be stored and the index are stored in the off-heap storage space of the host running the virtual machine by the virtual machine.

According to the embodiment of the disclosure, the data to be stored is obtained by the virtual machine, for example, the data to be stored may be obtained from the client by the working device shown in fig. 1. For example, in fig. 1, a client transmits a request for storing data to a coordinator, and the coordinator determines that the working device 1 can provide a service to the client according to a preset scheduling rule and notifies the client. The client sends the data to be stored to the working device 1, the working device 1 executes the data storage method according to the embodiment of the present disclosure, and the data to be stored received from the client is stored in the off-heap storage space of the host running the virtual machine of the working device 1.

According to embodiments of the present disclosure, the off-heap storage space may be off-heap memory. Alternatively, according to an embodiment of the present disclosure, the off-heap storage space may be any other storage space besides the heap memory.

According to an embodiment of the present disclosure, data to be stored and an index of the data to be stored may be stored in a data table. The data table comprises one or more rows, each row stores corresponding data to be stored and a corresponding index, and the index comprises one or more columns of index data.

As shown in fig. 1, after the working device 1 receives the data table from the client, the virtual machine of the working device 1 analyzes the data table according to the table structure, obtains data to be stored and an index, and stores the data to be stored and the index in the off-heap storage space of the host running the virtual machine. If the data to be stored has a plurality of indexes, the indexes are all stored in the off-heap storage space.

According to the embodiment of the disclosure, the data to be stored can be stored in the off-heap storage space in the form of column storage or row storage according to an application scenario. The column storage refers to that the data to be stored in the same column is stored in a continuous off-heap storage space, and is suitable for data analysis, statistics and other operations, such as counting, averaging, summing and the like of the data. The line storage refers to storing the same line of data to be stored into a continuous off-heap storage space, and is suitable for operations such as data scanning and the like.

According to the embodiment of the disclosure, the data to be stored is stored in the off-heap storage space, so that the read-write data magnitude of the virtual machine can be greatly increased, for example, from GB to TB. Through storing the index in the storage space outside the heap, can realize the quick data search based on the index, compare in carrying out data search through the mode of full scan in heap memory, show the efficiency that promotes data search, satisfy the requirement of quick big data volume data search. In addition, when the JVM is used for realizing data storage and search, the off-heap storage space is not influenced by the GC on the heap memory recycling operation, and the data storage speed is not influenced by the GC. The embodiment of the disclosure makes full use of the off-heap storage space of the host running the virtual machine, provides the functions of fast data storage and search with large data volume while not changing the overall architecture of the existing query engine (for example, Presto system), realizes good connection with the existing system, has low cost for upgrading and transforming the system, and is convenient for popularization and application in a large range.

According to an embodiment of the present disclosure, the index includes a range index and/or a hash index. The range index is typically used for data lookup conditional on a range of values, and the hash index is typically used for data lookup conditional on a specified value.

For example, as shown in fig. 3, the data to be stored may include "network address" and "service class", the range index may include "acquisition time" and "traffic", and may further include a combination of "acquisition time" and "traffic", and the hash index may include "region" and "service type", and may further include a combination of "region" and "service type".

According to an embodiment of the present disclosure, if the index is a range index, the storing, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine includes:

Fig. 4A shows a data structure diagram of a skip list.

A skip list is an ordered linked list that can be binary searched. As shown in fig. 4A, the skip list adds a multi-level index on the ordered linked list, and the index is used to implement fast lookup. When searching, firstly, the last position smaller than the current searching element is searched on the highest-level index, then the searching is continued by jumping to the next-level index until the lowest-level index is jumped, and the position of the element to be searched is very close to the position of the element to be searched (if the searching element exists). Since multiple elements can be skipped at a time according to the index, the search speed is fast.

By using the off-heap skip list to establish the range lookup index, the data in a certain range can be looked up with O (logN) complexity, wherein N is the number of rows of the data list. For example, equal to 2 for N ³² In the case of (2), the complexity of lookup using the off-heap skip table is 32, while the in-heap memory can only be scanned in full amount, and the lookup complexity is O (2) ³² ). Therefore, the complexity of data search can be obviously reduced by establishing the range index by using the off-heap skip table.

According to an embodiment of the present disclosure, the identification information of the data to be stored may be a pointer to the data to be stored. Alternatively, when the data to be stored includes one or more lines of data, the identification information of the data to be stored may be a line number of the data to be stored.

Taking the table in fig. 3 as an example, assuming that the index only includes the "acquisition time" column, the keys of the off-heap skip table are respectively "18: 00", "18: 01", and "18: 02", the values corresponding to the keys "18: 00" are

row numbers

1, 2, and 3, the values corresponding to the keys "18: 01" are

row numbers

4 and 5, and the values corresponding to the keys "18: 02" are

row numbers

6, 7, and 8, as shown in fig. 4B.

Assuming that the index includes a "collection time" column and a "traffic" column, as shown in fig. 4C, the keys of the first-level heap skip list are collection time "18: 00", "18: 01", and "18: 02", respectively, the value corresponding to the key "18: 00" is the second-level heap skip list 1, the value corresponding to the key "18: 01" is the second-level heap skip list 2, and the value corresponding to the key "18: 02" is the second-level heap skip list 3. The keys in the second-level heap skip table 1 are respectively flow "3G", "5G" and "8G", the row number corresponding to the key "3G" is 1, the row number corresponding to the key "5G" is 2, and the row number corresponding to the key "8G" is 3. The keys of the second-level heap skip table 2 are respectively flow "2G" and flow "5G", the row number corresponding to the key "2G" is 4, and the row number corresponding to the key "5G" is 2. The keys in the second-stage heap skip table 3 are respectively flow "1G", "4G" and "6G", the row number corresponding to the key "1G" is 6, the row number corresponding to the key "4G" is 7, and the row number corresponding to the key "6G" is 8.

According to an embodiment of the present disclosure, if the index is a hash index, the storing, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine includes:

generating a hash value according to the indexed column;

The hash index is realized based on a hash table, for each row of data, a hash value is calculated for index data of all columns in the hash index, and the hash values calculated for different index data are different. And taking the hash value as a key, taking the identification information of the data to be stored corresponding to the hash value as a value, and storing the value in an off-heap hash table.

Taking the table in fig. 4A as an example, assuming that the index includes only the "region" column, hash values h1, h2, and h3 of "beijing", "shanghai", and "hangzhou" are calculated respectively and stored as keys in the off-heap hash table, as shown in fig. 4D, where the value corresponding to the key "h 1" is

row numbers

1, 3, 5, and 6, the value corresponding to the key "h 2" is

row numbers

2 and 7, and the value corresponding to the key "h 3" is

row numbers

4 and 8.

Assuming that the index includes a "region" column and a "service type" column, hash values h4, h5, h6, h7, h8, and h9 of "beijing short video", "shanghai short video", "beijing live", "hangzhou short video", "beijing video conference", and "shanghai live" are calculated, respectively, and stored as keys in an off-heap hash table, as shown in fig. 4E, where a value corresponding to the key "h 4" is row number 1, a value corresponding to the key "h 5" is row number 2, a value corresponding to the key "h 6" is

row numbers

3 and 6, a value corresponding to the key "h 7" is

row numbers

4 and 8, a value corresponding to the key "h 8" is row number 5, and a value corresponding to the key "h 9" is row number 7.

By building a direct lookup index using the off-heap hash table, data equal to a certain value can be looked up with O (1) complexity. For example, for a data table with a number of rows of 2 ³² In the case of (1), the lookup using the off-heap hash table has a complexity of O (1), while the in-heap storage can only perform a full scan, and the lookup has a complexity of O (2) ³² ). Therefore, the complexity of data search can be obviously reduced by establishing the direct search index by using the off-heap hash table.

As shown in fig. 5, the data search method according to the embodiment of the present disclosure includes steps S501 to S503.

In step S501, a search condition is obtained by the virtual machine;

in step S502, according to the search condition, searching, by the virtual machine, an index stored in an off-heap storage space of a host running the virtual machine, and determining identification information of data that satisfies the search condition;

in step S503, the data stored in the off-heap storage space is acquired according to the identification information.

According to the embodiment of the present disclosure, the search condition is obtained by the virtual machine, for example, the search condition may be obtained from the client by the working device shown in fig. 1. For example, in fig. 1, the client sends a request for finding data to the coordinator, and the coordinator determines that the working device 2 can provide service to the client according to a preset scheduling rule and notifies the client. The client sends the search condition to the working device 2, the working device 2 executes the data search method according to the embodiment of the disclosure, in the off-heap storage space of the host running the virtual machine of the working device 2, according to the search condition search index received from the client, the identification information of the data meeting the search condition is determined, and then the data stored in the off-heap storage space is obtained according to the identification information.

According to the embodiment of the disclosure, by storing data in the off-heap storage space, the data magnitude of reading and writing of the virtual machine can be greatly increased from GB to TB. Through the storage index, the index-based rapid data search can be realized, and compared with the data search in a heap memory in a full scanning mode, the data search efficiency is obviously improved, and the requirement of rapid large-data-volume data search is met. In addition, the off-heap storage space is not influenced by GC on the heap memory recovery operation, and the data searching speed is not influenced by GC. The embodiment of the disclosure makes full use of the off-heap storage space of the host running the virtual machine, provides the functions of fast data storage and search with large data volume while not changing the overall architecture of the existing query engine (for example, Presto system), realizes good connection with the existing system, has low cost for upgrading and transforming the system, and is convenient for popularization and application in a large range.

According to embodiments of the present disclosure, data and an index of the data may be stored in a data table. The data table comprises one or more rows, each row stores corresponding data and a corresponding index, and the index comprises one or more columns of index data.

According to an embodiment of the present disclosure, the identification information of the data may be a pointer to the data. Alternatively, when the data includes one or more lines of data, the identification information of the data may be a line number of the data.

As shown in fig. 1, after the working device 2 receives the search condition from the client, the virtual machine root that implements the working device 2 parses the search condition, and obtains an index (hereinafter referred to as a "hit index") that is stored in the off-heap storage space of the host that runs the virtual machine of the working device 2 and that matches the search condition. For example, for the table shown in FIG. 3, if the index "acquisition time" is stored in the off-heap storage space and the search condition includes "acquisition time is 18:00~18: 01", the index "acquisition time" is hit. If the number of the hit indexes is equal to zero, the indexes meeting the search condition are not stored in the off-heap storage space, and at the moment, the data stored in the off-heap storage space is scanned in a full amount to obtain the data meeting the search condition.

According to an embodiment of the present disclosure, the searching, by the virtual machine, an index stored in an off-heap storage space of a host running the virtual machine according to the search condition includes:

analyzing the search condition to obtain one or more hit indexes;

searching the hit index with the highest level number in the indexes,

Since the more the number of stages of the index is, the lower the search complexity is, the hit index with the most number of stages is searched in the index, and the data to be searched can be acquired with the minimum search complexity. In addition, since the lookup complexity of the hash index is lower than that of the range index, when the index number is the same, the hash index is preferentially used for lookup, and the data to be looked up can be acquired with a smaller lookup complexity than that of the range index.

According to an embodiment of the present disclosure, if the hit index is a range index, the searching, by the virtual machine, an index stored in an off-heap storage space of a host running the virtual machine according to the search condition includes: searching an out-of-stack skip list storing the index according to the range of the hit index to obtain identification information meeting the searching condition;

For example, for the table shown in fig. 3, assuming that the search conditions are "acquisition time is 18:00 to 18:01, and flow is 3G to 5G", the hit index is the range index stored in the multi-stage off-stack skip table shown in fig. 4C, and the indexes are searched step by step in the multi-stage off-stack skip table according to the search conditions until the row number satisfying the search conditions is obtained. For example, a second-stage skip table 1 and a second-stage skip table 2 are obtained from a first-stage off-heap skip table according to the collection time of 18: 00-18: 01, line numbers of 1, 2 and 5 are obtained according to the flow of 3G-5G, and then corresponding data stored in the off-heap storage space are obtained according to the line numbers.

By using the off-heap skip list to establish the range lookup index, the data in a certain range can be looked up with O (logN) complexity, wherein N is the number of rows of the data list. For example, equal to 2 for N ³² For the case of (2), the complexity of the lookup using the out-of-heap skip table is 32, while the complexity of the full scan stored in the heap is O (2) ³² ). Therefore, the complexity of data search can be obviously reduced by establishing the range index by using the off-heap skip table.

For another example, for the table shown in fig. 3, assuming that the lookup condition is "the region is beijing and the service type is live broadcast", the hit index is the hash index stored in the off-heap hash table shown in fig. 4D, the hash value is calculated according to the lookup condition to be h6, the

line numbers

3 and 6 are obtained in the hash table, and then the corresponding data stored in the off-heap storage space is obtained according to the line numbers.

By building a direct lookup index using the off-heap hash table, data equal to a certain value can be looked up with O (1) complexity. For example, for a data table row number of 2 ³² For the case of (1), the complexity of the lookup using the off-heap hash table is O, and the complexity of the full scan stored in-heap is O (2) ³² ). Therefore, the complexity of data search can be obviously reduced by establishing the direct search index by using the off-heap hash table.

As shown in fig. 6, the data updating method according to the embodiment of the present disclosure includes steps S601 to S604.

In step S601, update data is acquired by the virtual machine;

in step S602, an index of the update data is obtained by the virtual machine;

in step S603, determining, by the virtual machine, target data identification information corresponding to the index of the update data in an off-heap storage space of a host running the virtual machine according to the index of the update data;

in step S604, the virtual machine updates data corresponding to the target data identification information in the off-heap storage space by using the update data.

According to the embodiment of the present disclosure, the obtaining of the update data by the virtual machine may be, for example, obtaining the update data and an index of the update data from the client by the working device shown in fig. 1. For example, in fig. 1, the client sends a request for updating data to the coordinator, and the coordinator determines that the working device 2 can provide a service to the client according to a preset scheduling rule and notifies the client. The client sends the update data and the index of the update data to the working device 2, the working device 2 executes the data update method according to the embodiment of the disclosure, searches in the off-heap storage space of the host running the virtual machine of the working device 2 according to the index of the update data received from the client, determines the target data identification information corresponding to the index of the update data, and then updates the data corresponding to the target data identification information in the off-heap storage space by using the update data.

According to the embodiment of the disclosure, the index is stored in the off-heap storage space, so that the index-based rapid data updating can be realized, and compared with the data updating in the heap memory in a full scanning mode, the data updating efficiency is remarkably improved, and the requirement of rapid large-data-volume data updating is met. In addition, the off-heap storage space is not influenced by the GC on the heap memory recycling operation, and the data updating speed is not influenced by the GC. The embodiment of the disclosure makes full use of the off-heap storage space of the host running the virtual machine, provides a fast data updating function with large data volume while not changing the overall architecture of the existing query engine (for example, Presto system), realizes good connection with the existing system, has low cost for upgrading and reconstructing the system, and is convenient for popularization and application in a large range.

According to an embodiment of the present disclosure, if the index of the update data is a range index, the determining, according to the index of the update data, target data identification information corresponding to the index of the update data in an off-heap storage space of a host running the virtual machine includes: and searching an out-of-stack skip list storing the index according to the index of the updated data to obtain the target data identification information corresponding to the index of the updated data.

For example, for the table shown in fig. 3, assuming that the index of the update data is "acquisition time is 18:00 and flow rate is 3G", in the first-stage off-heap skip table shown in fig. 4C, the second-stage skip table 1 is obtained according to "acquisition time is 18: 00", the row number 1 is obtained in the second-stage skip table 1 according to "flow rate is 3G", and then, corresponding data is updated in the off-heap storage space according to the row number.

According to an embodiment of the present disclosure, if the index of the update data is a hash index, the determining, according to the index of the update data, target data identification information corresponding to the index of the update data in an off-heap storage space of a host running the virtual machine includes: and generating a hash value according to the index of the updated data, and searching an off-heap hash table according to the hash value to obtain target data identification information corresponding to the index of the updated data.

For example, for the table shown in fig. 3, assuming that the index of the update data is "beijing in area and live broadcast in service type", the hash value is calculated according to the index of the update data to be h6, the

line numbers

3 and 6 are obtained in the hash table, and then the corresponding data is updated in the off-heap storage space according to the line numbers.

As shown in fig. 7, the data deletion method according to the embodiment of the present disclosure includes steps S701 to S703.

In step S701, an index of data to be deleted is obtained by the virtual machine;

in step S702, determining, by the virtual machine, target data identification information corresponding to the index of the data to be deleted in an off-heap storage space of a host running the virtual machine according to the index of the data to be deleted;

in step S703, deleting, by the virtual machine, data in the off-heap storage space corresponding to the target data identification information.

According to the embodiment of the disclosure, the index of the data to be deleted is obtained by the virtual machine, for example, the index of the data to be deleted may be obtained from the client by the working device shown in fig. 1. For example, in fig. 1, the client sends a request for updating data to the coordinator, and the coordinator determines that the working device 2 can provide a service to the client according to a preset scheduling rule and notifies the client. The client sends the index of the data to be deleted to the working equipment 2, the working equipment 2 executes the data deletion method according to the embodiment of the disclosure, searches in the off-heap storage space of the host of the virtual machine running the working equipment 2 according to the index of the data to be deleted received from the client, determines the target data identification information corresponding to the index of the data to be deleted, and then deletes the data corresponding to the target data identification information in the off-heap storage space through the virtual machine.

According to the embodiment of the disclosure, indexes are stored in the off-heap storage space, so that rapid data deletion based on the indexes can be realized, and compared with the method of performing data deletion in a heap memory in a full scanning mode, the data deletion efficiency is remarkably improved, and the requirement of rapid large-data-volume data deletion is met. In addition, the off-heap storage space is not influenced by GC on the heap memory recovery operation, and the data deletion speed is not influenced by GC. The embodiment of the disclosure makes full use of the off-heap storage space of the host running the virtual machine, provides a fast data deletion function with large data volume while not changing the overall architecture of the existing query engine (for example, Presto system), realizes good connection with the existing system, has low cost for upgrading and transforming the system, and is convenient for large-scale popularization and application.

According to an embodiment of the present disclosure, if the index of the data to be deleted is a range index, determining, according to the index of the data to be deleted, target data identification information corresponding to the index of the data to be deleted in an off-heap storage space of a host running the virtual machine, includes: and searching an out-of-stack skip list storing the index according to the index of the data to be deleted to obtain the target data identification information corresponding to the index of the data to be deleted.

For example, for the table shown in fig. 3, assuming that the index of the data to be deleted is "acquisition time 18:00 and flow rate 3G", a second-stage skip table 1 is obtained in the first-stage off-heap skip table shown in fig. 4C according to "acquisition time 18: 00", a row number 1 is obtained in the second-stage skip table 1 according to "flow rate 3G", and then, corresponding data is deleted in the off-heap storage space according to the row number.

According to an embodiment of the present disclosure, if the index of the data to be deleted is a hash index, determining, according to the index of the data to be deleted, target data identification information corresponding to the index of the data to be deleted in an off-heap storage space of a host running the virtual machine, includes: and generating a hash value according to the index of the data to be deleted, and searching an off-heap hash table according to the hash value to obtain target data identification information corresponding to the index of the data to be deleted.

For example, for the table shown in fig. 3, assuming that the index of the data to be deleted is "beijing in area and live broadcast in service type", the hash value is calculated according to the index of the data to be deleted and is h6, the

line numbers

3 and 6 are obtained in the hash table, and then the corresponding data is deleted in the off-heap storage space according to the line numbers.

Fig. 8A illustrates a block diagram of a data storage device according to an embodiment of the present disclosure. The apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of both.

As shown in fig. 8A, the data storage 810 includes a first obtaining module 811, a second obtaining module 812, and a storage module 813.

The first obtaining module 811 is configured to obtain data to be stored by the virtual machine;

the second obtaining module 812 is configured to obtain, by the virtual machine, an index of the data to be stored;

the storage module 813 is configured to store the data to be stored and the index in the off-heap storage space of the host running the virtual machine through the virtual machine.

According to an embodiment of the present disclosure, the index includes a range index and/or a hash index.

if the index only contains a row of index data, storing the index data in an out-of-heap skip list, wherein a key of the out-of-heap skip list is the index data of the row, and a value of the out-of-heap skip list is identification information of corresponding data to be stored;

generating a hash value according to the indexed columns;

Fig. 8B shows a block diagram of a data lookup apparatus according to an embodiment of the present disclosure. The apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of both.

As shown in fig. 8B, the data searching apparatus 820 includes a third obtaining module 821, a first determining module 822, and a fourth obtaining module 823.

The third obtaining module 821 is configured to obtain the search condition through the virtual machine;

the first determining module 822 is configured to determine, according to the search condition, identification information of data that meets the search condition by the virtual machine searching for an index stored in an off-heap storage space of a host running the virtual machine;

the fourth obtaining module 823 is configured to obtain the data stored in the off-heap storage space according to the identification information.

analyzing the search condition to obtain one or more hit indexes;

searching the hit index with the highest level number in the indexes,

Fig. 8C illustrates a block diagram of a data update apparatus according to an embodiment of the present disclosure. The apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of both.

As shown in fig. 8C, the data updating apparatus 830 includes a fifth obtaining module 831, a sixth obtaining module 832, a second determining module 833, and an updating module 834.

The fifth acquiring module 831 is configured to acquire update data through the virtual machine;

the sixth obtaining module 832 is configured to obtain an index of the update data through the virtual machine;

the second determining module 833 is configured to determine, by the virtual machine, target data identification information corresponding to the index of the update data in an off-heap storage space of a host running the virtual machine according to the index of the update data;

the update module 834 is configured to update, by the virtual machine, data in the off-heap storage space corresponding to target data identification information using the update data.

Fig. 8D illustrates a block diagram of a data deletion apparatus according to an embodiment of the present disclosure. The apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of both.

As shown in fig. 8D, the data deleting device 840 includes a seventh obtaining module 841, a third determining module 842, and a deleting module 843.

The seventh obtaining module 841 is configured to obtain, by the virtual machine, an index of the data to be deleted;

the third determining module 842 is configured to determine, by the virtual machine, target data identification information corresponding to the index of the data to be deleted in the off-heap storage space of the host running the virtual machine according to the index of the data to be deleted;

the deletion module 843 is configured to delete, by the virtual machine, data in the off-heap storage space corresponding to the target data identification information.

The present disclosure also discloses an electronic device, and fig. 9 shows a block diagram of the electronic device according to an embodiment of the present disclosure.

As shown in fig. 9, the electronic device 900 includes a memory 901 and a processor 902, wherein the memory 901 is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor 902 to implement a method according to an embodiment of the disclosure.

The embodiment of the disclosure provides a data storage method, which includes:

acquiring data to be stored through a virtual machine;

acquiring an index of the data to be stored through the virtual machine;

generating a hash value according to the indexed columns;

The embodiment of the disclosure provides a data searching method, which comprises the following steps:

acquiring a search condition through a virtual machine;

analyzing the search condition to obtain one or more hit indexes;

searching the hit index with the highest level number in the indexes,

The embodiment of the present disclosure provides a data updating method, including:

acquiring update data through a virtual machine;

acquiring an index of the update data through the virtual machine;

and updating data corresponding to the identification information of the target data in the off-heap storage space by using the updating data through the virtual machine.

The embodiment of the disclosure provides a data deleting method, which includes:

acquiring an index of data to be deleted through a virtual machine;

and deleting data corresponding to the target data identification information in the off-heap storage space through the virtual machine.

FIG. 10 shows a schematic block diagram of a computer system suitable for use in implementing methods according to embodiments of the present disclosure.

As shown in fig. 10, the computer system 1000 includes a processing unit 1001 that can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the system 1000 are also stored. The processing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary. The processing unit 1001 may be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.

In particular, the above described methods may be implemented as computer software programs according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising computer instructions that, when executed by a processor, implement the method steps described above. In such embodiments, the computer program product may be downloaded and installed from a network via communications portion 1009 and/or installed from removable media 1011.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present disclosure may be implemented by software or by programmable hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.

As another aspect, the present disclosure also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the electronic device or the computer system in the above embodiments; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method of data storage, comprising:

acquiring data to be stored through a virtual machine;

acquiring an index of the data to be stored through the virtual machine;

storing the data to be stored and the index in an off-heap storage space of a host running the virtual machine through the virtual machine, wherein the off-heap storage space comprises any other storage space except a heap memory;

wherein: the index comprises a range index and/or a hash index;

if the index is a range index and the index contains only one column of index data, storing the index data in an out-of-heap skip table; if the index is a range index and contains multi-column index data, storing the index in a multi-stage heap skip table, and storing the index data of a corresponding column in each stage of heap skip table;

and if the index is the hash index, storing the index in an off-heap hash table.

2. The method of claim 1, wherein:

the off-heap storage space is an off-heap memory.

3. The method of claim 1, wherein if the index is a range index, the storing, by the virtual machine, the data to be stored and the index in off-heap storage space of a host running the virtual machine comprises:

if the index only contains a row of index data, the key of the off-heap skip list is the index data of the row, and the value of the off-heap skip list is the identification information of the corresponding data to be stored;

if the index contains a plurality of columns of index data, the key of each level of the heap skip list except the last level is the index data of the corresponding column, the value is the heap skip list of the index data corresponding to the key in the index data of the next column, the key of the heap skip list of the last level is the index data of the corresponding column, and the value is the identification information of the corresponding data to be stored.

4. The method of claim 1, wherein if the index is a hash index, the storing, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine comprises:

generating a hash value according to the indexed columns;

5. A method of data lookup comprising:

acquiring a search condition through a virtual machine;

according to the search condition, searching an index stored in an off-heap storage space of a host running the virtual machine through the virtual machine, and determining identification information of data meeting the search condition, wherein the off-heap storage space comprises any other storage space except a heap memory;

acquiring data stored in the off-heap storage space according to the identification information;

wherein the index comprises a range index and/or a hash index; if the index is a range index and the index contains only one column of index data, the index data is stored in an out-of-heap skip table; if the index is a range index and contains multi-column index data, the index is stored in a multi-stage heap skip table, and each stage of heap skip table stores the index data of a corresponding column; if the index is a hash index, the index is stored in an off-heap hash table.

6. The method of claim 5, wherein the searching, by the virtual machine, for the index stored in the off-heap storage space of the host running the virtual machine according to the search condition comprises:

analyzing the search condition to obtain one or more hit indexes;

searching the hit index with the highest level number in the indexes,

7. The method of claim 6, wherein:

8. A data update method, comprising:

acquiring update data through a virtual machine;

acquiring an index of the update data through the virtual machine;

determining, by the virtual machine, target data identification information corresponding to the index of the update data in an off-heap storage space of a host running the virtual machine according to the index of the update data, where the off-heap storage space includes any other storage space except a heap memory;

updating, by the virtual machine, data in the off-heap storage space corresponding to the target data identification information using the update data;

wherein the index comprises a range index and/or a hash index; if the index is a range index and the index contains only one column of index data, storing the index data in an out-of-heap skip table; if the index is a range index and contains multi-column index data, storing the index in a multi-stage heap skip table, and storing the index data of a corresponding column in each stage of heap skip table; and if the index is the hash index, storing the index in an off-heap hash table.

9. A method of data deletion, comprising:

acquiring an index of data to be deleted through a virtual machine;

determining, by the virtual machine, target data identification information corresponding to the index of the data to be deleted in an off-heap storage space of a host running the virtual machine according to the index of the data to be deleted, where the off-heap storage space includes any other storage space except a heap memory;

deleting data corresponding to the target data identification information in the off-heap storage space through the virtual machine;

wherein the index comprises a range index and/or a hash index; if the index is a range index and the index contains only one column of index data, the index data is stored in an out-of-heap skip table; if the index is a range index and contains multi-column index data, the index is stored in a multi-stage heap skip table, and each stage of heap skip table stores the index data of a corresponding column; and if the index is the hash index, storing the index in the off-heap hash table.

10. A data storage device comprising:

the first acquisition module is configured to acquire data to be stored through a virtual machine;

a storage module configured to store, by the virtual machine, the data to be stored and the index in an off-heap storage space of a host running the virtual machine, where the off-heap storage space includes any other storage space except a heap memory;

11. A data lookup apparatus comprising:

a first determining module, configured to search, by the virtual machine, an index stored in an off-heap storage space of a host running the virtual machine according to the search condition, and determine identification information of data that meets the search condition, where the off-heap storage space includes any other storage space except a heap memory;

a fourth obtaining module configured to obtain the data stored in the off-heap storage space according to the identification information;

12. A data update apparatus comprising:

a second determining module, configured to determine, by the virtual machine, target data identification information corresponding to an index of the update data in an off-heap storage space of a host running the virtual machine according to the index of the update data, where the off-heap storage space includes any other storage space except a heap memory;

an updating module configured to update, by the virtual machine, data corresponding to target data identification information in the off-heap storage space using the update data;

wherein the index comprises a range index and/or a hash index; if the index is a range index and the index only contains a column of index data, storing the index data in an out-of-heap skip table; if the index is a range index and contains multi-column index data, storing the index in a multi-stage heap skip table, and storing the index data of a corresponding column in each stage of heap skip table; and if the index is the hash index, storing the index in an off-heap hash table.

13. An electronic device comprising a memory and a processor; wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of any of claims 1-9.

14. A readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the method steps of any of claims 1-9.