CN107577436B - Data storage method and device - Google Patents

Data storage method and device Download PDF

Info

Publication number
CN107577436B
CN107577436B CN201710842915.0A CN201710842915A CN107577436B CN 107577436 B CN107577436 B CN 107577436B CN 201710842915 A CN201710842915 A CN 201710842915A CN 107577436 B CN107577436 B CN 107577436B
Authority
CN
China
Prior art keywords
data
current
partition
query
storage file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710842915.0A
Other languages
Chinese (zh)
Other versions
CN107577436A (en
Inventor
王旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Shiqu Information Technology Co ltd
Original Assignee
Hangzhou Shiqu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Shiqu Information Technology Co ltd filed Critical Hangzhou Shiqu Information Technology Co ltd
Priority to CN201710842915.0A priority Critical patent/CN107577436B/en
Publication of CN107577436A publication Critical patent/CN107577436A/en
Application granted granted Critical
Publication of CN107577436B publication Critical patent/CN107577436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a data storage method, when storing data, because the partition index only stores the maximum data and the minimum data in the appointed partition of a table partition, and the partition index only stores the maximum data, the minimum data, the total amount of stored data, the total amount of empty data, the sum of all stored data and the average value of the stored data in a column storage file, the storage space occupied by the partition index and the partition index is very small, even when storing mass data, the partition index and the partition index can be basically stored in a memory, and the frequent replacement of the memory caused by the huge index file is avoided. Therefore, the data storage method can completely store the partition indexes and the block indexes in the memory when mass data is stored, and the data query speed is improved. In addition, the invention also discloses a data storage device with the effects as above.

Description

Data storage method and device
Technical Field
The present invention relates to the field of data storage, and in particular, to a data storage method and apparatus.
Background
With the rapid development of the internet, the data volume becomes larger and larger, especially some non-core data such as a buried point log and a monitoring log become very huge, and if the traditional data storage mode is continuously used for storing the massive data, not only is a great deal of storage resources wasted, but also the query speed of the data becomes very slow.
For example, currently, a relatively widely-used Mysql database with relatively low compression is used, and a B + tree index which occupies a relatively large memory space is adopted. Since the compression ratio of the Mysql database is low, if the size of the storage space occupied by the stored data is approximately a when the size of the stored data is a, the size of the storage space occupied by the B + tree index corresponding to the stored data is also approximately a. This corresponds to the Mysql database storing data of size a, requiring memory space of size 2 a. Therefore, although the Mysql database is used for storing a small amount of data, although the B + tree index corresponding to the small amount of data is large, since the data amount itself is small, the complete B + tree index corresponding to the small amount of data can be stored in the memory completely without causing memory replacement, so that the query speed of the data is high, but when the Mysql database is used for storing a large amount of data, the complete B + tree index corresponding to the large amount of data is large, so that the complete B + tree index cannot be stored in the memory completely, thereby causing memory replacement, and further causing the query speed of the data to be slow.
Therefore, how to improve the query speed of data when mass data storage is performed is a technical problem that needs to be solved by those skilled in the art at present.
Disclosure of Invention
The invention aims to provide a data storage method and a data storage device, which can improve the query speed of data when mass data is stored.
In order to solve the above technical problem, the present invention provides a data storage method, including:
before current data is written in, when no partition is not full, a partition containing a column storage file is established in a table partition;
splitting the current data according to columns, and writing the current data into a current column storage file according to columns;
after the current data is written, establishing a blocking index corresponding to each current column storage file in a current blocking, and recording the maximum data, the minimum data, the total stored data, the total empty data, the sum of each stored data and the average value of the stored data in the current column storage file in the blocking index;
after all written data are written, establishing a partition index corresponding to a specified partition in a current table partition, and recording maximum data and minimum data in the specified partition in the partition index.
Preferably, when there are no unfilled partitions, establishing partitions containing column storage files in the table partition specifically includes:
and when the table partition is not full, newly building a current table partition, and building the partition containing the column storage file in the current table partition.
Preferably, the data storage method further comprises:
after the written data is written, scanning written file data in the current storage directory;
when the column storage file scanned into the full block stores uncompressed data, compressing the uncompressed data.
Preferably, the data storage method further comprises:
after the current data is written, calculating the difference value between the maximum data and the minimum data in the current column storage file, and uniformly dividing the difference value into N range segments;
establishing a range segment index corresponding to each current column storage file in the current block, and marking the distribution condition of numerical data in the current column storage file in the range segment index;
wherein N is a positive integer.
Preferably, the data storage method further comprises:
and after the current data is written, establishing character bit indexes corresponding to the current column storage files in the current blocks, and marking the distribution condition of character type data in the current column storage files in the character bit indexes.
Preferably, the data storage method further comprises:
after the current data is written, loading each index corresponding to the current row storage file into a JAVA virtual machine memory;
when a query request is received, if the indexes corresponding to the row storage files where the current query data are located are stored in the memory of the JAVA virtual machine, retrieving the current query data according to the indexes stored in the memory of the JAVA virtual machine.
Preferably, the data storage method further comprises:
after query data are obtained, storing all decompressed data in the column storage file where the query data are located into a JAVA off-heap memory;
when a query request is received, if the current query data are stored in the JAVA off-heap memory, the current query data are obtained from the JAVA off-heap memory.
Preferably, the storing all the decompressed data in the column storage file in which the query data is located to the JAVA off-heap memory specifically includes:
storing the decompressed data to the JAVA off-heap memory in a byte stream mode;
correspondingly, the obtaining the current query data from the JAVA out-of-heap memory specifically includes:
retrieving the current query data from the JAVA off-heap memory in the form of the byte stream;
and converting the current query data into current query data in a character string form, and acquiring the current query data in the character string form.
Preferably, the data storage method further comprises:
after query data are obtained, writing the query data into a JAVA virtual machine memory, and clearing the query data stored in the JAVA virtual machine memory if the query data stored in the JAVA virtual machine memory are not used within a preset time;
correspondingly, when an inquiry request for acquiring the inquiry data is received within a preset time, the inquiry data is directly acquired from the JAVA off-heap memory.
In order to solve the above technical problem, the present invention further provides a data storage device, including:
the establishing module is used for establishing blocks containing column storage files in the table partitions before the current data is written in and when the blocks are not fully stored;
the writing module is used for splitting the current data according to columns and writing the current data into a current column storage file according to columns;
a block index creating module, configured to create a block index corresponding to each current row storage file in a current block after the current data is completely written in, and record, in the block index, a maximum data, a minimum data, a total stored data amount, a total empty data amount, a sum of each stored data, and an average value of the stored data in the current row storage file;
and the partition index creating module is used for creating a partition index corresponding to a specified partition section in the current table partition after all write-in data are written in, and recording the maximum data and the minimum data in the specified partition section in the partition index.
Therefore, when the data storage method provided by the invention is used for storing data, the partition index only stores the maximum data and the minimum data in the designated partition of the table partition, and the partition index only stores the maximum data, the minimum data, the total amount of stored data, the total amount of empty data, the sum of all stored data and the average value of the stored data in the column storage file, so that the storage space occupied by the partition index and the partition index is very small, even when massive data is stored, the index file can be basically stored in the memory, and the frequent replacement of the memory caused by the huge index file is avoided. Therefore, when a user inquires data, the table partition meeting the inquiry condition can be quickly positioned by directly utilizing the partition index in the memory, after the table partition is positioned, the column storage file meeting the inquiry condition is quickly positioned by utilizing the partition index stored in the memory, and then the data meeting the inquiry condition is only searched from the related column storage files. Therefore, when mass data is stored, the data storage method provided by the invention can improve the data query speed by completely storing the index file in the memory, and can also reduce the data retrieval range by retrieving the query data only in the related column storage files, thereby further improving the data query speed. In addition, the invention also provides a data storage device with the effects as above.
Drawings
In order to illustrate the embodiments of the present invention more clearly, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a data storage method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a correspondence relationship between table partitions, blocks, column storage files, and indexes in the data storage method according to the embodiment of the present invention;
FIG. 3 is a flow chart of another data storage method according to an embodiment of the present invention;
FIG. 4 is a flow chart of another data storage method according to an embodiment of the present invention;
FIG. 5 is a flow chart of another data storage method according to an embodiment of the present invention;
fig. 6 is a structural diagram of a data storage device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.
The invention aims to provide a data storage method and a data storage device, which can improve the query speed of data when mass data is stored.
In order to make the technical solutions of the present invention better understood, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flowchart of a data storage method according to an embodiment of the present invention. As shown in fig. 1, the data storage method includes:
s10: before the current data is written, when no unfilled blocks exist, blocks containing column storage files are established in the table partition.
S11: and splitting the current data according to columns, and writing the current data into a current column storage file according to the columns.
S12: after the current data is written, establishing a block index corresponding to each current column storage file in the current block, and recording the maximum data, the minimum data, the total stored data, the total empty data, the sum of each stored data and the average value of the stored data in the current column storage file in the block index.
S13: after all written data are written, establishing a partition index corresponding to a specified partition in the current table partition, and recording maximum data and minimum data in the specified partition in the partition index.
It should be noted that the present invention is preferably used for storing substantially ordered data. The current data refers to data with fixed data size to be written, and the fixed data size is smaller than or equal to the number of rows of the maximum storable data of one column storage file. For example, when at most 10 ten thousand rows of data can be stored in one column of storage file, and the write data amount is greater than 10 ten thousand rows, the fixed data amount is 10 ten thousand rows; when one column storage file can store 10 ten thousand rows of data at most and the written data amount is less than or equal to 10 ten thousand rows, the fixed data amount is the same as the written data amount. The write data amount is the total data amount of all the write data, and for the case that the write data amount is larger than the fixed data amount, the write data amount is divided into a plurality of fixed data amounts, and the write data with the fixed data amount is circularly written as the current data.
Fig. 2 is a schematic diagram illustrating a correspondence relationship between table partitions, blocks, column storage files, and indexes in the data storage method according to the embodiment of the present invention. As shown in fig. 2, the partition is a block partition 20 including a column storage file and an index corresponding to the column storage file, and the table partition is a partition 2 including a fixed number of block partitions 20.
In addition, the current block is a block in which current data is written, and the column storage file in the current block is a current column storage file.
In specific implementation, for step S10, after receiving a write request for writing data, first obtain information about the amount of write data carried in the write request and determine the amount of write data, then determine whether there are unfilled chunks in the storage space, if yes, go to step S11, otherwise, newly establish a chunk containing a column storage file in the table partition as a current chunk to store the current data in the current column storage file. Furthermore, it is understood that there are two cases where there are no unfilled partitions in the storage space, the first case being: the storage space has no table partitions, that is, the storage space has no partitions, or the storage space has table partitions, and all the existing table partitions are full; the second case is: the storage space has a table partition, and although the table partition is not full, all the blocks of the table partition are full. For the first case, a new table partition is needed to be established as the current table partition, and then blocks are established in the current table partition; for the second case, it is not necessary to newly create a table partition, and it is only necessary to newly create a partition in the existing table partition that is not full. That is, after receiving the write request, before the current data is written, the table partition or the block to which the current data belongs is determined, and only when no block or the block is full, the block including the column storage file is newly created in the table partition. Wherein, the current table partition refers to the table partition containing the current block.
In order to store the write data in order when there is no table partition in the storage space, as a preferred embodiment, when there is no unfilled partition, establishing a partition containing a column storage file in the table partition specifically includes: and when the table partition is not full, newly building a current table partition, and building a partition containing the column storage file in the current table partition.
For step S11, after the write data is successfully verified, the current data is split by columns (i.e., each row of current data is split into different columns according to data types, and the same column contains data of the same data type), then the current data is written into the current column storage file concurrently by columns, after the current data is written into the current column storage file, each column storage file stores one type of data, and if the current data is split into three columns according to time, age, and house number, only time data is stored in the column storage file storing time, only age data is stored in the column storage file storing age, only house number is stored in the column storage file storing house number, and the data of the same row correspond to each other. The verification of the written data refers to verifying whether a table exists, whether a field exists, whether a column exists, whether a data type is consistent, whether the data length is ultra-long, and the like.
If a column storage file can store 10 ten thousand rows of data, a table partition can contain at most 100 blocks, and each block contains column storage files which are different from each other and correspond to each other. In a specific implementation, if there is no table partition in the current storage space, after receiving a write request requesting to write 101 ten thousand rows of data, 10 ten thousand rows of write data are used as current data, and before the current data is written, 1 table partition is newly created in advance as a current table partition, a partition is created in the current table partition as a current partition, then the current data is concurrently written into a current column storage file in the current partition in a column, after the current data is written, a new partition is continuously created in the current table partition as a current partition, then 10 ten thousand rows of unwritten data are concurrently written into a current column storage file in the newly created current partition in a column, and so on, until 10 ten thousand rows of write data are respectively written into 10 partitions, and the remaining 1 ten thousand rows of data are not written into the storage space, and at this time, a partition is still created in the current table partition, and taking the remaining 1 ten thousand rows of unwritten data as current data, writing the current data into the column storage file in the block in a column-by-column manner, and finally writing all the written data into the storage space to finish data storage.
In another embodiment, 1 table partition including 11 blocks currently exists, and only 1 ten thousand rows of data are stored in the 11 th block, then when a write request for writing 206 ten thousand rows of data is received, 9 ten thousand rows of write data are concurrently written as current data in a current column storage file in the 11 th block in a column-wise manner, after the current data are written, a block is continuously built in the current table partition as a current block, 10 ten thousand rows of unwritten write data are concurrently written as current data in a current column storage file in the newly-built current block in a column-wise manner, and so on until 199 ten thousand rows of write data are respectively written in 20 blocks, and when the remaining 7 ten thousand rows of write data are unwritten in a storage space, a block is still newly built in the current table partition, and the remaining 7 ten thousand rows of unwritten data are concurrently written in a column-wise manner as current data in a column storage file in the block, and finally, writing all the written data into the storage space to finish data storage.
In another embodiment, a table partition having 31 blocks already exists in the current storage space, and only 7 ten thousand rows of data are stored in the 31 st block, then when a write request for writing 900 ten thousand rows of data is received, 3 ten thousand rows of write data are concurrently written as current data into a column storage file in the 31 st block in a column-by-column manner, after the current data are written, a new block is continuously built in the current table partition as the current block, 10 ten thousand rows of unwritten write data are concurrently written as current data into a current column storage file in the newly-built current block in a column-by-column manner, and so on until 693 ten thousand rows of write data are respectively written into 70 blocks, the current table partition is full, and when the remaining 207 ten thousand rows of write data are not written into the storage space, a new table partition needs to be built as the current table partition, and a new block is built in the new current table partition as the current block, and writing 10 ten thousand rows of data in the remaining 207 ten thousand rows of data serving as current data into a current column storage file in a newly-built current block in a column-by-column manner, then establishing a block in a current table partition as the current block, writing 10 ten thousand rows of data in the remaining 197 ten thousand rows of data serving as current data into the current column storage file in the newly-built current block in a column-by-column manner, and repeating the steps until 200 ten thousand rows of written data are respectively written into 20 blocks, and the remaining 7 ten thousand rows of data are not written into the storage space, at this time, still newly establishing a block in the current table partition, writing the remaining 7 ten thousand rows of unwritten data serving as current data into the column storage file in the block in a column-by-column manner, and finally writing all the written data into the storage space to finish data storage.
Of course, it is understood that when other write requests are received, according to the rules of the above three embodiments, before the current data is written, when there are no unfilled partitions, the partitions containing the column storage file are created in the table partition.
In addition, it should be noted that the designated partition mentioned in step S13 must be a data column storing basic ordered data.
For step S12, after the current data is written, it means that the current block is fully written or the written data is completely written into the storage space, in order to quickly retrieve the query data after receiving the query request, a block index corresponding to each current column storage file in the current block is immediately established, that is, each column storage file has a block index corresponding to it, and the maximum data, the minimum data, the total stored data, the total empty data, the sum of the stored data and the average value of the stored data in the corresponding column storage file are recorded in the block index, so after determining which table partition the query data is in, it can be determined whether the query data is smaller than the maximum value recorded in the block index and larger than the minimum value recorded in the block index, and whether the data in the corresponding column storage file meets other query conditions according to other information recorded in the block index, and determining whether the query data is possibly in the column storage file corresponding to the block index, so that the column storage file where the query data is located is quickly positioned according to the block index, the query data is retrieved from the column storage file, the query range is reduced, and the query speed of the data is improved, wherein the query data is data meeting query conditions.
Similarly, for step S13, after the write data is completely written, in order to quickly retrieve the query data after receiving the query request, a partition index corresponding to the designated partition in the table partition is immediately established, and since the designated partition corresponds to the data column storing the basic ordered data, it is only necessary to record the maximum data and the minimum data in the designated partition of the table partition corresponding to the partition index, so as to reflect what the stored data in the designated partition is approximately, and by determining whether the query data is smaller than the maximum data recorded in the partition index and larger than the minimum data recorded in the partition index, it is determined whether the query data is likely to be in the table partition corresponding to the partition index, so that the table partition where the query data is located can be quickly located according to the partition index after receiving the query request, and searching query data in the corresponding table partition, reducing the query range and improving the query speed of the data. The maximum data and the minimum data are both index-type data, and it is understood that the sum of stored data and the average of stored data are also both index-type data.
In addition, it should be noted that when the data in the column storage file is updated, the partition index and the block index corresponding to the column storage file are updated in real time.
Therefore, in the data storage method provided by this embodiment, when data storage is performed, since the partition index only stores the maximum data and the minimum data in the designated partition of the table partition, and the partition index only stores the maximum data, the minimum data, the total amount of stored data, the total amount of empty data, the sum of stored data, and the average value of stored data in the column storage file, storage spaces occupied by the partition index and the partition index are both very small, and even when mass data storage is performed, the index file can be basically stored in the memory, thereby avoiding frequent memory replacement due to the huge index file. Therefore, when a user inquires data, the table partition meeting the inquiry condition can be quickly positioned by directly utilizing the partition index in the memory, after the table partition is positioned, the column storage file meeting the inquiry condition is quickly positioned by utilizing the partition index stored in the memory, and then the data meeting the inquiry condition is only searched from the related column storage files. Therefore, when mass data is stored, the data storage method provided by this embodiment can not only improve the data query speed by completely storing the index file in the memory, but also reduce the data search range by searching the query data only in the related column storage file, thereby further improving the data query speed.
In a specific implementation, if a synchronous compression operation is performed on data in a column storage file during the process of writing the data into the column storage file, not only the writing speed of the data is reduced, but also a CPU peak is likely to be caused, and especially when the concurrency of writing the data is high, the CPU is often run high, thereby reducing the query speed of the data. Therefore, in order to save the storage space and increase the writing speed and the query speed of the data, the present embodiment further improves on the above embodiments, and the data in the column storage file that is fully written can be asynchronously compressed.
Fig. 3 is a flowchart of another data storage method according to an embodiment of the present invention. As shown in fig. 3, as a preferred embodiment, after the step S13 is executed on the basis of fig. 1, the method further includes:
s30: and after the written data are written, scanning the written file data in the current storage directory.
S31: when the column storage file scanned into the full block stores uncompressed data, the uncompressed data is compressed.
It should be noted that, for step S30, the daemon thread may be used to scan the folder in the current storage directory to filter the last modification time of the folder, and when the last modification time of the currently scanned folder is less than a preset time (e.g., 1 day), the daemon thread continues to scan the compressed identifiers of the column storage files in the full-stored blocks in the folder. For step S31, if the compression identifier of the column storage file in the already full block is scanned as an uncompressed identifier, the uncompressed data stored in the column storage file is compressed, and after the data compression is completed, the compression identifier of the column storage file is modified as a compressed identifier, and the uncompressed data stored in the column storage file is deleted. The preset time is preset according to actual needs.
Moreover, it is worth noting that when data query is performed, if only a certain data needs to be acquired, only a certain column of storage files need to be decompressed, and all the column of storage files do not need to be decompressed, so that the speed of data query can be further increased.
Therefore, the data storage method provided by the embodiment can save the storage space on the premise of not influencing data writing by adopting asynchronous compression, and can not cause the CPU peak value to cause the data query speed to become slow, so that the data storage method can improve the data query speed.
In specific implementation, in order to increase the query speed of data, a more detailed index can be established to reduce the search range to a smaller extent, besides positioning the search range to a table partition through a partition index and reducing the search range to a storage file through a partition index. Therefore, the present embodiment is further improved on the basis of the above embodiment, and by establishing the range segment index corresponding to the column storage file, the range segment index is used to mark which range segments of the data are included in the column storage file corresponding to the range segment index, and when a user queries data, the column storage file where the query data is located can be locked through the range segment index. For example, if the minimum data stored in a column storage file is 0, the maximum data is 749, but the data 400 is not stored, when the user queries the data 400, if the query data is retrieved according to the block index, it is considered that the column storage file may have the data 400, and all the data stored in the entire column storage file is retrieved, but the data 400 is not finally retrieved. And, if the range segment index is established, whether the data of which range segment is in the column storage file can be marked in the range segment index, if data for which the range segment is not located 250-499 in the column store file is marked in the range segment index, when a user queries the data, although query data 400 is between the largest and smallest data stored in the column store file, but since the data of the range segment that is not 250-499 in the column of storage files is marked in the range segment index, then the column of storage files is deemed to have no 400 data, no retrieval operation is required for all data in the column of storage files, and directly retrieving the data stored in the column storage file in which the 250-499 range segment is marked as 1 in the index of other range segments, thereby reducing the retrieval range of the data and further improving the query speed of the data.
Fig. 4 is a flowchart of another data storage method according to an embodiment of the present invention. As shown in fig. 4, as a preferred embodiment, on the basis of fig. 1, the method further includes:
s40: after the current data is written, calculating the difference value between the maximum data and the minimum data in the current column storage file, and uniformly dividing the difference value into N range segments.
S41: and establishing a range segment index corresponding to each current column storage file in the current block, and marking the distribution condition of the numerical data in the current column storage file in the range segment index.
It should be noted that in the present embodiment, step S40 is arranged to be executed after step S11 is executed, and step S41 is arranged to be executed after step S12 is executed, but in a specific implementation, step S40 may be executed before or after any step after step S11 is executed, and step S41 may be executed after step S40. The distribution of the numerical data refers to the range segments in which the data in the column storage file are distributed. Moreover, when the data in the column storage file is updated, the range segment index corresponding to the column storage file is also updated in real time.
In addition, N mentioned in step S40 is a positive integer, and when the finer the range segment index is desired, N can be set to be larger, but at the same time, the larger N is, the larger the storage space occupied by the range segment index is, so the value of N can be set in advance according to actual needs. Generally, when one table partition includes 100 partitions and a column storage file in a partition can store 10 ten thousand rows of data, N may be set to 1024, that is, the maximum difference between data stored in a column storage file is uniformly divided into 1024 range segments, and then the range segment index that indicates whether there are 1024 range segments in the column storage file occupies only about one percent of the storage space occupied by all data in the storage file.
In a specific implementation, for step S41, data in the column storage file having the range segment may be marked with 1, data in the column storage file not having the range segment may be marked with 0, for example, the range segment of 0-249 is marked with 1, which indicates that the column storage file includes data of 0-249, the range segment of 250-.
Therefore, the data storage method provided by the embodiment can further narrow the data retrieval range and improve the data query speed when the data query is performed through the range segment index which occupies a very small storage space.
In a specific implementation, the data storage method provided by the present invention is also suitable for storing the data of the character string, and this embodiment is further improved on the basis of the above embodiment, in which a position of a certain character in the data of the character string stored in the column storage file is marked by the established character bit index corresponding to the column storage file, so that when a user queries the data of the character string, the column storage file where the queried data of the character string is located is quickly located by the character bit index.
Fig. 5 is a flowchart of another data storage method according to an embodiment of the present invention. As shown in fig. 5, as a preferred embodiment, in addition to fig. 4, the method further includes:
s50: after the current data is written, establishing character bit indexes corresponding to the current column storage files in the current blocks, and marking the distribution condition of character type data in the current column storage files in the character bit indexes.
It should be noted that in the present embodiment, the step S50 is scheduled to be executed after the step S41 is executed, but in a specific implementation, the step S50 may be executed before or after any step after the step S11 is executed. The distribution of character-type data is where a certain character is in the character string data stored in the column storage file, and when the data in the column storage file is updated, the character bit index corresponding to the column storage file is updated in real time.
In a specific implementation, for step S50, it may be marked with 0 and 1 whether a certain character in the data stored in the column storage file is in a certain position. For example, if the possible positions of the character a are 64, the 64 positions are numbered from 1 and are numbered up to 64, that is, the M (1 ≦ M ≦ 64) positions are numbered as M, then the positions with a are marked as 1, and the positions without a are marked as 0, and similarly, the marking rules of other characters are the same as the marking rules of the character a, and the storage space occupied by the character bit index based on the character marking rules is only one percent of the storage space occupied by the data in the column storage file. Specifically, when a user needs to query the character string ABC, it is first determined whether a appears at the position of 1, B appears at the position of 2, and C appears at the position of 3, that is, it is only determined whether the 1 position of the search data has the marks of 0, 2, 0 and 3, and 0 and 3, respectively, and one of the marks is 0, it is determined that the data in the currently searched column storage file does not have ABC, so that it can be determined whether the query data exists in the currently searched column storage file only by the index of the character bits without searching all the data in all the column storage files to determine whether the query data exists in the currently searched column storage file, thereby further reducing the search range of the data, the query speed of the data is improved.
Therefore, the data storage method provided by the embodiment can further reduce the data retrieval range and improve the data query speed when the data query is performed through the character bit index which occupies a very small storage space.
Moreover, in the specific implementation, when the query condition of the user is multiple, the partition index may be used to determine the table partition whose own stored data satisfies the first query condition, then the partition index, the range segment index and the character bit index are used to continuously search and determine the table partition in which all the data stored in the table partition satisfies the first query condition, and the first related column storage file whose own stored data satisfies the first query condition, and then the partition index, the range segment index and the character bit index may be used to continuously search and determine the first strong related column storage file or the first related column storage file in which all the data stored in the table partition satisfies the second query condition, and the second strong related column storage file whose own stored data satisfies the second query condition, and so on. Then, when only the number of query data meeting the query condition is queried, only decompression and retrieval of the data stored in the relevant column storage file are needed, while for the data stored in the strong relevant column storage file, decompression and retrieval are not needed, and the number of the query data meeting the query condition can be determined only according to the index corresponding to the strong relevant column storage file, so that unnecessary decompression and retrieval are avoided, the retrieval range is narrowed, and the query speed of the data is improved. Moreover, even if query data meeting the query conditions are queried, a certain number of query data are often obtained, and decompression and retrieval are not required to be performed on each column storage file, and if query data contained in data stored in strongly relevant column storage files meet the query requirements, decompression and retrieval of data stored in relevant column storage files can be avoided, so that the data retrieval range is further reduced, and the data retrieval speed is improved.
In order to further increase the query speed of the data, the index corresponding to the updated data in the column storage file can be directly loaded to the memory of the JAVA virtual machine, and when the updated data needs to be queried, the index corresponding to the column storage file where the updated data is located does not need to be called from the memory, so that the query speed of the data can be further increased. As a preferred implementation, the data storage method provided in the foregoing embodiment further includes: after the current data is written, loading each index corresponding to the current column storage file into a JAVA virtual machine memory; when receiving the query request, if the memory of the JAVA virtual machine stores the index corresponding to the column storage file where the current query data is located, retrieving the current query data according to the index stored in the memory of the JAVA virtual machine. And if the memory of the JAVA virtual machine does not store the index corresponding to the column storage file where the current query data is located, retrieving the current query data according to the index stored in the memory. Specifically, when the data in each row storage file is updated, the index corresponding to the row storage file in which the updated data is located may be loaded to the JAVA virtual machine memory, and the loadable index amount in the JAVA virtual machine memory may be changed through the configuration file. In addition, when the loaded index amount in the JAVA virtual machine memory reaches the maximum value, the least recently used algorithm (LRU algorithm) can be used for replacement.
In order to further improve the query speed of data, the data stored in the column storage file which is recently queried can be stored in the JAVA off-stack memory, so that when a user acquires the data stored in the column storage file which is recently queried again, the decompressed data can be directly acquired from the JAVA off-stack memory, and the column storage file does not need to be called from a hardware storage device or a memory, and the uncompressing operation is performed on the uncompressing data in the column storage file to acquire the query data, thereby saving the data query time and improving the query speed of the data. As a preferred implementation, the data storage method provided in the foregoing embodiment further includes: after the query data are obtained, storing all decompressed data in the column storage file where the query data are located in a JAVA off-heap memory; when a query request is received, if the current query data are stored in the JAVA off-heap memory, the current query data are obtained from the JAVA off-heap memory. And if the current query data are not stored in the JAVA off-heap memory, acquiring the column of storage files from the hardware storage device or the memory, and decompressing the uncompressing data in the column of storage files to acquire the current query data. The data amount stored in the JAVA off-heap memory can also be modified by the configuration file, and when the data amount stored in the JAVA off-heap memory reaches a maximum value, it can be replaced by the LRU algorithm. In addition, it is noted that the maximum data amount of the cache data in the JAVA off-heap memory may be dynamically adjusted according to the total amount and the used amount of the JAVA off-heap memory, for example, the usage of the JAVA off-heap memory is detected every ten seconds, and when the usage of the JAVA off-heap memory is greater than eighty percent, the maximum data amount is reduced by ten percent; when the used amount of the JAVA off-heap memory is less than seventy percent, the maximum data amount is increased by ten percent.
In the specific implementation, since the character string in JAVA is stored by unicode coding, and the unicode coding uses 3 to 4 bytes to identify one character, if the character string mode in JAVA is used to store the decompressed data in the storage file in the JAVA off-heap memory, a lot of storage space will be wasted. For this reason, as a preferred embodiment, the storing all the decompressed data in the column storage file where the query data is located to the JAVA heap external memory specifically includes: and storing the decompressed data to a JAVA off-heap memory in a byte stream mode. Correspondingly, the obtaining of the current query data from the JAVA off-heap memory specifically includes: retrieving current query data in a JAVA off-heap memory in a byte stream mode; and converting the current query data into the current query data in the form of character strings, and acquiring the current query data in the form of the character strings.
In order to further increase the data query speed, the recent query data may be stored in the JAVA virtual machine memory, and similarly, the data size of the query data that may be stored in the JAVA virtual machine memory may be modified by the configuration file, and when the data size of the query data stored in the virtual memory reaches the maximum data size, the data size may be replaced by the LRU algorithm. In addition, it should be noted that, when the query data stored in the JAVA virtual machine memory is not reused within a certain time, the query data is cleared out of the JAVA virtual machine memory. Specifically, as a preferred embodiment, the method for storing data further includes: writing the query data into the memory of the JAVA virtual machine each time, and clearing the query data stored in the memory of the JAVA virtual machine if the query data stored in the memory of the JAVA virtual machine is not used within the preset time.
The above detailed description is made on the embodiment of the data storage method provided by the present invention, and the present invention also provides a data storage device corresponding to the method.
Fig. 6 is a structural diagram of a data storage device according to an embodiment of the present invention. As shown in fig. 5, the data storage device includes:
an establishing module 60, configured to establish a block including a column storage file in a table partition before current data is written in and when there is no unfilled block;
the writing module 61 is configured to split the current data according to columns, and write the current data into a current column storage file according to columns;
a block index creating module 62, configured to create block indexes corresponding to the current column storage files in the current block after the current data is completely written in, and record, in the block indexes, the maximum data, the minimum data, the total stored data, the total empty data, the sum of the stored data, and the average value of the stored data in the current column storage file;
and a partition index creating module 63, configured to create, after all the written data are written, a partition index corresponding to the specified partition in the current table partition, and record, in the partition index, maximum data and minimum data in the specified partition.
Therefore, in the data storage device provided by this embodiment, when storing data, the partition index only stores the maximum data and the minimum data in the designated partition of the table partition, and the partition index only stores the maximum data, the minimum data, the total amount of stored data, the total amount of empty data, the sum of each stored data and the average value of the stored data in the column storage file, so that the storage space occupied by the partition index and the partition index is very small, even when storing mass data, the index file can be basically stored in the memory, and frequent memory replacement caused by the huge index file is avoided. Therefore, when a user inquires data, the table partition meeting the inquiry condition can be quickly positioned by directly utilizing the partition index in the memory, after the table partition is positioned, the column storage file meeting the inquiry condition is quickly positioned by utilizing the partition index stored in the memory, and then the data meeting the inquiry condition is only searched from the related column storage files. Therefore, when mass data is stored, the data storage device provided in this embodiment can not only improve the data query speed by completely storing the index file in the memory, but also reduce the data search range by searching the search data only in the related column storage file, thereby further improving the data query speed.
The data storage method and device provided by the invention are described in detail above. The embodiments are described in a progressive mode in the specification, the emphasis of each embodiment is different from that of other embodiments, and the same and similar parts among the embodiments are referred to each other.
It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (9)

1. A method of storing data, comprising:
before current data is written in, when no partition is not full, a partition containing a column storage file is established in a table partition;
splitting the current data according to columns, and writing the current data into a current column storage file according to columns;
after the current data is written, establishing a blocking index corresponding to each current column storage file in a current blocking, and recording the maximum data, the minimum data, the total stored data, the total empty data, the sum of each stored data and the average value of the stored data in the current column storage file in the blocking index;
after all written data are written, establishing a partition index corresponding to a specified partition in a current table partition, and recording maximum data and minimum data in the specified partition in the partition index;
after the current data is written, calculating the difference value between the maximum data and the minimum data in the current column storage file, and uniformly dividing the difference value into N range segments;
establishing a range segment index corresponding to each current column storage file in the current block, and marking the distribution condition of numerical data in the current column storage file in the range segment index;
wherein N is a positive integer.
2. The data storage method of claim 1, wherein when there are no unfilled partitions, creating partitions containing column storage files in a table partition specifically comprises:
and when the table partition is not full, newly building a current table partition, and building the partition containing the column storage file in the current table partition.
3. The data storage method of claim 1, further comprising:
after the written data is written, scanning written file data in the current storage directory;
when the column storage file scanned into the full block stores uncompressed data, compressing the uncompressed data.
4. The data storage method of claim 1, further comprising:
and after the current data is written, establishing character bit indexes corresponding to the current column storage files in the current blocks, and marking the distribution condition of character type data in the current column storage files in the character bit indexes.
5. The data storage method of claim 1, further comprising:
after the current data is written, loading each index corresponding to the current row storage file into a JAVA virtual machine memory;
when a query request is received, if the indexes corresponding to the row storage files where the current query data are located are stored in the memory of the JAVA virtual machine, retrieving the current query data according to the indexes stored in the memory of the JAVA virtual machine.
6. The data storage method of claim 5, further comprising:
after the current query data are obtained, storing all decompressed data in the column storage file in which the current query data are located in a JAVA off-heap memory;
when a new query request is received, if the current query data corresponding to the new query request is stored in the JAVA off-heap memory, the current query data corresponding to the new query request is acquired from the JAVA off-heap memory.
7. The data storage method according to claim 6, wherein the storing all the decompressed data in the column storage file in which the query data is located into the JAVA off-heap memory specifically comprises:
storing the decompressed data to the JAVA off-heap memory in a byte stream mode;
correspondingly, the obtaining the current query data from the JAVA out-of-heap memory specifically includes:
retrieving the current query data from the JAVA off-heap memory in the form of the byte stream;
and converting the current query data into current query data in a character string form, and acquiring the current query data in the character string form.
8. The data storage method of claim 5, further comprising:
after the current query data are obtained, writing the current query data into a JAVA virtual machine memory, and if the current query data stored in the JAVA virtual machine memory are not used within a preset time, clearing the current query data stored in the JAVA virtual machine memory;
correspondingly, when an inquiry request for acquiring the current inquiry data is received within a preset time, the current inquiry data is directly acquired from the memory of the JAVA virtual machine.
9. A data storage device, comprising:
the establishing module is used for establishing blocks containing column storage files in the table partitions before the current data is written in and when the blocks are not fully stored;
the writing module is used for splitting the current data according to columns and writing the current data into a current column storage file according to columns;
a block index creating module, configured to create a block index corresponding to each current row storage file in a current block after the current data is completely written in, and record, in the block index, a maximum data, a minimum data, a total stored data amount, a total empty data amount, a sum of each stored data, and an average value of the stored data in the current row storage file;
the partition index creating module is used for creating a partition index corresponding to a specified partition section in a current table partition after all write-in data are written in, and recording maximum data and minimum data in the specified partition section in the partition index;
the data storage device is further configured to:
after the current data is written, calculating the difference value between the maximum data and the minimum data in the current column storage file, and uniformly dividing the difference value into N range segments;
establishing a range segment index corresponding to each current column storage file in the current block, and marking the distribution condition of numerical data in the current column storage file in the range segment index;
wherein N is a positive integer.
CN201710842915.0A 2017-09-18 2017-09-18 Data storage method and device Active CN107577436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710842915.0A CN107577436B (en) 2017-09-18 2017-09-18 Data storage method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710842915.0A CN107577436B (en) 2017-09-18 2017-09-18 Data storage method and device

Publications (2)

Publication Number Publication Date
CN107577436A CN107577436A (en) 2018-01-12
CN107577436B true CN107577436B (en) 2020-07-07

Family

ID=61036045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710842915.0A Active CN107577436B (en) 2017-09-18 2017-09-18 Data storage method and device

Country Status (1)

Country Link
CN (1) CN107577436B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256064B (en) * 2018-01-16 2019-09-17 北京三快在线科技有限公司 A kind of data search method and device
CN109325031B (en) * 2018-09-13 2021-08-03 上海达梦数据库有限公司 Data statistical method, device, equipment and storage medium
CN110755063B (en) * 2018-10-06 2023-06-02 江苏创越医疗科技有限公司 Low-delay electrocardiogram drawing method
CN109815241B (en) * 2019-01-31 2021-05-11 上海达梦数据库有限公司 Data query method, device, equipment and storage medium
CN110442576A (en) * 2019-07-02 2019-11-12 北京奇艺世纪科技有限公司 Data query method, apparatus, server and computer-readable medium
CN110555037B (en) * 2019-09-12 2020-10-23 苏州新希望科技有限公司 Smart city data sharing system
CN110704433B (en) * 2019-09-23 2023-03-28 北京优炫软件股份有限公司 Brin index construction method of columnar storage data, data retrieval method and device
CN111400346A (en) * 2020-03-13 2020-07-10 苏州浪潮智能科技有限公司 Method, equipment, device and medium for improving execution efficiency of database all-in-one machine
CN113722623A (en) * 2021-09-03 2021-11-30 锐掣(杭州)科技有限公司 Data processing method and device, electronic equipment and storage medium
CN117234436B (en) * 2023-11-14 2024-02-20 苏州元脑智能科技有限公司 Method, device, storage system and product for expanding capacity of disk array

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376119A (en) * 2014-12-03 2015-02-25 天津南大通用数据技术股份有限公司 Data access method and device adapted to super-large scale column-storage database
CN104834650A (en) * 2014-02-12 2015-08-12 清华大学 Method and system for generating effective query tasks
CN105408857A (en) * 2013-07-29 2016-03-16 亚马逊科技公司 Generating a multi-column index for relational databases by interleaving data bits for selectivity
CN106844541A (en) * 2016-12-30 2017-06-13 晶赞广告(上海)有限公司 A kind of on-line analytical processing method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9195657B2 (en) * 2010-03-08 2015-11-24 Microsoft Technology Licensing, Llc Columnar storage of a database index
US9390115B2 (en) * 2013-10-11 2016-07-12 Oracle International Corporation Tables with unlimited number of sparse columns and techniques for an efficient implementation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105408857A (en) * 2013-07-29 2016-03-16 亚马逊科技公司 Generating a multi-column index for relational databases by interleaving data bits for selectivity
CN104834650A (en) * 2014-02-12 2015-08-12 清华大学 Method and system for generating effective query tasks
CN104376119A (en) * 2014-12-03 2015-02-25 天津南大通用数据技术股份有限公司 Data access method and device adapted to super-large scale column-storage database
CN106844541A (en) * 2016-12-30 2017-06-13 晶赞广告(上海)有限公司 A kind of on-line analytical processing method and device

Also Published As

Publication number Publication date
CN107577436A (en) 2018-01-12

Similar Documents

Publication Publication Date Title
CN107577436B (en) Data storage method and device
US11238098B2 (en) Heterogenous key-value sets in tree database
US20200175070A1 (en) Low ram space, high-throughput persistent key-value store using secondary memory
US9672235B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
CN110383261B (en) Stream selection for multi-stream storage
US8838551B2 (en) Multi-level database compression
KR101708261B1 (en) Managing storage of individually accessible data units
US11580162B2 (en) Key value append
KR100856245B1 (en) File system device and method for saving and seeking file thereof
TW201841122A (en) Key-value store tree
US20040205044A1 (en) Method for storing inverted index, method for on-line updating the same and inverted index mechanism
US20130304770A1 (en) Method and system for storing data in a database
US20140359233A1 (en) Read-write control method for memory, and corresponding memory and server
US20140032568A1 (en) System and Method for Indexing Streams Containing Unstructured Text Data
US11886401B2 (en) Database key compression
Amur et al. Design of a write-optimized data store
US20110153677A1 (en) Apparatus and method for managing index information of high-dimensional data
WO2011137684A1 (en) Search method and device based on information records of embedded system
US8285691B2 (en) Binary method for locating data rows in a compressed data block
EP2164005B1 (en) Content addressable storage systems and methods employing searchable blocks
CN116048396B (en) Data storage device and storage control method based on log structured merging tree
CN115827653B (en) Pure column type updating method and device for HTAP and mass data
Zhang Towards Space-Efficient High-Performance In-Memory Search Structures
CN114691681A (en) Data processing method and device, electronic equipment and readable storage medium
CN115729471A (en) Method, device, equipment and storage medium for deduplication query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant