CN108197324B

CN108197324B - Method and apparatus for storing data

Info

Publication number: CN108197324B
Application number: CN201810117471.9A
Authority: CN
Inventors: 陈浩; 牟宇航; 马如悦
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-02-06
Filing date: 2018-02-06
Publication date: 2021-07-16
Anticipated expiration: 2038-02-06
Also published as: CN108197324A

Abstract

The embodiment of the application discloses a method and a device for storing data. One embodiment of the method comprises: reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set; storing an empty set in the plurality of sub-data sets to a first storage space according to a predetermined first format; determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets; and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the plurality of hash value sets to a second storage space according to a predetermined second format.

Description

Method and apparatus for storing data

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for storing data.

Background

Cardinality (Cardinality), refers to the number of different data in a data set. At present, in the radix statistics process of Big data (Big data), the original data generally needs to be processed. For different sets, when calculating the cardinality of the set each time, it is often necessary to perform the same algorithm from the original data in a traversal manner, respectively, to find the cardinality of the set.

Currently, the radix solving algorithm can be widely applied to the fields of UV (number of visitors, which refers to the number of different users visiting a website in a period of time) statistics and the like in website visiting analysis.

Disclosure of Invention

The embodiment of the application provides a method and a device for storing data.

In a first aspect, an embodiment of the present application provides a method for storing data, where the method includes: reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set; storing an empty set in the plurality of sub-data sets to a first storage space according to a predetermined first format; determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets; and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.

In some embodiments, the above method further comprises: for each hash value set in the plurality of hash value sets, determining a byte array corresponding to the hash value set, wherein the position of an element in the byte array is determined based on data on a predetermined number of bits in each hash value in the hash value set, the element in the byte array is determined based on the position of the first 1 in the binary data represented by the hash value in the hash value set, and the length of the byte array is predetermined.

In some embodiments, the above method further comprises: and storing the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.

In some embodiments, the above method further comprises: and storing the byte arrays which accord with the third preset storage condition in the determined byte arrays into a fourth storage space according to a predetermined fourth format.

In some embodiments, the above method further comprises: and creating a materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.

In some embodiments, the above method further comprises: and determining the cardinality of the target data set according to the largest byte array in the determined byte arrays.

In a second aspect, an embodiment of the present application provides an apparatus for storing data, the apparatus including: the reading unit is configured to read a plurality of sub-data sets included in a target data set, wherein the sub-data sets are obtained by dividing the target data set; the first storage unit is configured to store an empty set in the plurality of sub-data sets into a first storage space according to a predetermined first format; the first determining unit is configured to determine a hash value of each data in the plurality of sub-data sets to obtain a plurality of hash value sets; and the second storage unit is configured to store each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.

In some embodiments, the above apparatus further comprises: and a second determining unit, configured to determine, for each of the plurality of sets of hash values, a byte array corresponding to the set of hash values, where a position of an element in the byte array is determined based on data on a predetermined number of bits in each of the hash values in the set of hash values, the element in the byte array is determined based on a position of a first 1 in the binary data represented by the hash value in the set of hash values, and a length of the byte array is predetermined.

In some embodiments, the above apparatus further comprises: and the third storage unit is configured to store the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.

In some embodiments, the above apparatus further comprises: and the fourth storage unit is configured to store the byte arrays meeting the third preset storage condition in the determined byte arrays to a fourth storage space according to a predetermined fourth format.

In some embodiments, the above apparatus further comprises: and the creating unit is configured to create the materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.

In some embodiments, the above apparatus further comprises: and the third determining unit is configured to determine the cardinality of the target data set according to the largest byte array in the determined byte arrays.

In a third aspect, an embodiment of the present application provides a server for storing data, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the embodiments of the method for storing data described above.

In a fourth aspect, the present application provides a computer-readable medium for storing data, on which a computer program is stored, which when executed by a processor implements the method of any one of the embodiments of the method for storing data as described above.

According to the method and the device for storing data provided by the embodiment of the application, a plurality of subdata sets included in a target data set are read, then an empty set in the subdata sets is stored in a first storage space according to a predetermined first format, then a hash value of each data in the subdata sets is determined to obtain a plurality of hash value sets, and finally each hash value and the number of the hash values in the hash value sets meeting a first preset storage condition in the hash value sets are stored in a second storage space according to a predetermined second format, so that the flexibility of data processing is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which embodiments of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for storing data according to the present application;

FIG. 3A is a schematic illustration of an application scenario of a method for storing data according to the present application;

FIG. 3B is a schematic diagram of data storage in the application scenario of FIG. 3A;

FIG. 3C is a schematic illustration of yet another application scenario of a method for storing data according to the present application;

FIG. 3D is a schematic diagram of data storage in the application scenario of FIG. 3C;

FIG. 4 is a flow diagram of yet another embodiment of a method for storing data according to the present application;

FIG. 5 is a data storage schematic of a method for storing data according to the present application;

FIG. 6 is yet another data storage schematic of a method for storing data according to the present application;

FIG. 7 is a schematic block diagram illustrating one embodiment of an apparatus for storing data according to the present application;

FIG. 8 is a schematic block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for storing data or the apparatus for storing data of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a data search application, a data statistics application, a web browser application, a shopping application, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting data processing, including but not limited to smart phones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background data processing server that provides support for multiple sub data sets uploaded by the

terminal devices

101, 102, 103. The background data processing server may analyze the received data processing request and feed back the processing result (e.g., the obtained hash value sets, the number of hash values in each hash value set, etc.) to the terminal device.

It should be noted that the method for storing data provided by the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for storing data is generally disposed in the server 105.

It should be further noted that the method for storing data provided by the embodiment of the present application may be applied to a distributed server, and accordingly, the apparatus for storing data may be disposed in the distributed server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the information processing method operates does not need to perform data transmission with the terminal device, the system architecture may not include a network and the terminal device.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for storing data in accordance with the present application is shown. The method for storing data comprises the following steps:

step 201, reading a plurality of sub data sets included in a target data set.

In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the method for storing data operates may read a plurality of sub data sets included in a target data set from other electronic devices (for example, terminal devices) through a wired connection manner or a wireless connection manner. And the sub data sets are obtained by dividing the target data set. The target data may be any type of data. Including but not limited to at least one of: numbers, characters, constructs, and arrays. The target data may be data that meets a specific condition. For example, the specific condition may be a value greater than 10, a sex of a male, or the like. It should be noted that the target data set may include the same elements (i.e., data). The number of elements included in the target data set may be 10 hundred million, 100 hundred million (on the order of large data), and so on. A child data set may be, but is not limited to, a data set partitioned as follows: dividing target data into data sets based on the maximum data volume which can be read by the electronic equipment at a single time; and the data set is obtained by dividing the target data based on the distributed streaming data statistics. It will be appreciated that the target data set may be comprised of a plurality of sub data sets divided.

For example, please refer to fig. 3A. The server (i.e., the electronic device) reads a plurality of child data sets 301 included in a target data set (e.g., {1, 24, 67, 24, 1900, 2634, 4571, 264 … … }) stored in a data storage server.

Step 202, storing an empty set of the plurality of sub-data sets to a first storage space according to a predetermined first format.

In this embodiment, based on the plurality of sub data sets obtained in step 201, the electronic device may store an empty set of the plurality of sub data sets in a first storage space according to a predetermined first format. The first format may be various formats determined in advance. For example, the first format described above may be characterized by a particular character (e.g., null, empty, etc.). The size of the first storage space may be 1 byte, 2 bytes, etc. Here, the electronic device may store each empty set (i.e., the specific character corresponding to the empty set) in a first storage space.

It should be noted that, in practice, when data in the form is not filled in, etc., the plurality of sub data sets may include an empty set, and the storage of the empty set helps record the state of the original form.

For example, please refer to fig. 3B. The electronic device stores a specific character (e.g., empty) corresponding to an empty set in the multiple sub-data sets into a first storage space a according to a predetermined first format (e.g., empty) (as shown by reference numeral 302).

Step 203, determining a hash value of each data in the plurality of sub data sets to obtain a plurality of hash value sets.

In this embodiment, the electronic device may determine a hash value of each data in the read sub data sets, and obtain a plurality of hash value sets. And the number of the obtained hash value sets is equal to the number of the read sub data sets. It is to be understood that the hash value of the data may be determined by, but is not limited to, the following algorithm: the method comprises the following steps of MurmurHash hash algorithm (a non-encryption type hash function which is suitable for general hash retrieval operation, compared with other popular hash functions, the random distribution characteristic of a hash value obtained by the function is better in performance for keys (keys) with stronger regularity), FNV (feeder Noll Vo) hash algorithm, and CRC (Cyclic Redundancy Check) algorithm.

By way of example, please continue to refer to fig. 3C. The electronic device determines a hash value of each data in the sub-data sets 301 by using a hash algorithm, so as to obtain a plurality of hash value sets 303.

And 204, storing each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.

In this embodiment, the electronic device may further store, in the second storage space, each hash value in the hash value set meeting the first preset storage condition in the hash value set obtained in step 203 and the number of the hash values according to a predetermined second format. The first preset storage condition may be that the size of the storage space occupied by the hash value set is smaller than a predetermined storage space size (for example, 1024 bits). The second format may be a predetermined variety of formats. For example, the number may be stored in a storage space of a certain size (e.g., 4 bytes, 8 bytes, etc.), and the hash values may be stored in storage spaces consecutive to the storage space of the certain size. Here, the size of the storage space for storing the respective hash values may be the same. The number of the hash value sets meeting the first preset storage condition and the number of the second storage spaces may be the same.

By way of example, please continue to refer to fig. 3D. The electronic device stores, in the second storage space 304, each hash value (including 0X356EF34E,0XA96B3452) and the number of hash values (e.g. 10000) in a hash value set (e.g. {0X356EF34E,0XA96B3452, … … }) that meets a first preset storage condition (e.g. the size of the storage space occupied by the hash value set is smaller than 1024 bits) according to the following format: the storage space a of the second storage space 304 may store a specific character (e.g., an explicit) which may be used to characterize that the data stored in the second storage space 304 is stored according to the second format; the storage space b stores the number of hash values (e.g., 10000); each storage space c stores a hash value. In this example, the above-mentioned storage space may be a continuous storage space. The size of the storage space a may be 1 byte (or other sizes), the size of the storage space b may be 4 bytes (or other sizes), and the size of each storage space c may be 8 bytes (or other sizes).

It should be noted that the storage format of the data can be identified by storing the specific characters.

In some usage cases, the electronic device may further determine, based on a hyperlogog algorithm, a byte array corresponding to each hash value set. And then storing the non-0 element in the byte array which meets the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array into a third storage space according to a predetermined third format. The second preset storage condition may be that the size of the storage space occupied by the byte array is smaller than the size of the storage space occupied by the hash value set corresponding to the byte array. The third format may be a predetermined variety of formats. For example, the position of the non-0 element in the byte array where the non-0 element is located may be stored in a storage space of a specific size (e.g., 2048 bytes, etc.), and each non-0 element may be stored in a storage space consecutive to the storage space of the specific size. Here, the size of the storage space for storing the respective non-0 elements may be the same. The number of the byte arrays and the number of the third storage spaces may be the same.

In some usage cases, the electronic device may further store, in a fourth predetermined format, a byte array that meets a third preset storage condition in the determined byte arrays to a fourth storage space. Wherein the third preset storage condition may be that the ratio of the number of non-0 elements in the byte array to the number of total elements is greater than a preset percentage threshold (e.g., 80%). The fourth format may be a predetermined variety of formats. For example, each element in the byte array may be stored with contiguous storage space. Here, the size of the storage space for storing the respective elements may be the same. The number of the byte arrays and the number of the fourth storage spaces may be the same.

In the method provided by the above embodiment of the application, a plurality of sub data sets included in a target data set are read, then according to a predetermined first format, empty sets in the plurality of sub data sets are stored in a first storage space, then hash values of each data in the plurality of sub data sets are determined, a plurality of hash value sets are obtained, and finally, each hash value and the number of hash values in the hash value sets meeting a first preset storage condition in the plurality of hash value sets are stored in a second storage space according to a predetermined second format, so that an intermediate result in a solving process is effectively utilized, based on the storage of the intermediate result, the advance processing of original data is realized, the rapid determination of the base number of the sets is facilitated, and the flexibility of data processing is improved.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for storing data is shown. The process 400 of the method for storing data includes the steps of:

step 401, reading a plurality of sub data sets included in the target data set.

In this embodiment, step 401 is substantially the same as step 201 in the corresponding embodiment of fig. 2, and is not described here again.

Step 402, storing an empty set of the plurality of sub-data sets in a first storage space according to a predetermined first format.

In this embodiment, step 402 is substantially the same as step 202 in the corresponding embodiment of fig. 2, and is not described herein again.

Step 403, determining a hash value of each data in the plurality of sub-data sets to obtain a plurality of hash value sets.

In this embodiment, step 403 is substantially the same as step 203 in the corresponding embodiment of fig. 2, and is not described herein again.

Step 404, storing each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the plurality of hash value sets to a second storage space according to a predetermined second format.

In this embodiment, step 404 is substantially the same as step 204 in the corresponding embodiment of fig. 2, and is not described herein again.

Step 405, for each hash value set of the plurality of hash value sets, determining a byte array corresponding to the hash value set.

In this embodiment, the electronic device may further determine, for each hash value set of the multiple hash value sets, a byte array corresponding to the hash value set. Wherein the position of the element in the byte array is determined based on the data on the first predetermined number of bits in each hash value in the hash value set, the element in the byte array is determined based on the position of the first 1 in the binary data characterized by the hash value in the hash value set, and the length of the byte array is predetermined, for example, the length of the byte array may be 64 bits (bit).

Illustratively, if the length of the byte array is N (e.g., 64) bits, and the hash value of each data in the sub-data set obtained in step 403 is L bits (e.g., 32 bits), the first log thereof is required₂Data on N bits (e.g., 6 bits) to determine the position of an element in the byte array. Leaving L-N bits (e.g., 26 bits), then log is required for each element in the byte array₂The (L-N) bits (e.g., 5 bits) record the position of the first 1 in the binary data characterized by the hash value. Wherein if the data on the first 6 bits of the hash value is "000001", the data is converted into decimal data as 1, so the first element in the byte array can be used to record the position of the first 1 in the binary data characterized by the hash value.

In the above example, when the data on the first predetermined number of bits in the two (or more) hash values are the same (i.e. the element located at the same position in the byte array is required to record the position of the first 1 in the binary data represented by different hash values), the data on the first predetermined number of bits in the hash value located at the back (or any position) of the first 1 in the binary data represented by different hash values may be used to determine which hash value the element at the same position is determined by (e.g. determined by the hash value located at the back of the position of the first 1 in the represented binary data). Here, the above-mentioned "first 1 position is located backward" can be determined as follows: for example, if the first 1 position of the binary data represented by a certain hash value is 5 (i.e. the first 4 bits of the binary data are all 0, and the 5 th bit is 1), and the first 1 position of the binary data represented by another hash value is 3 (i.e. the first 2 bits of the binary data are all 0, and the 3 rd bit is 1), the hash value following the first 1 position is the hash value of the first 1 position of the represented binary data being 5.

According to the above steps, the electronic device may determine the byte array corresponding to each hash value set.

Step 406, storing the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.

In this embodiment, the electronic device may further store, in the determined byte array, a non-0 element in the byte array that meets the second preset storage condition and a position of the non-0 element in the byte array according to a predetermined third format, to a third storage space.

The second preset storage condition may be that the size of the storage space occupied by the byte array is smaller than the size of the storage space occupied by the hash value set corresponding to the byte array. The third format may be a predetermined variety of formats. For example, the position of the non-0 element in the byte array where the non-0 element is located may be stored in a storage space of a specific size (e.g., 2048 bytes, etc.), and each non-0 element may be stored in a storage space consecutive to the storage space of the specific size. Here, the size of the storage space for storing the respective non-0 elements may be the same. The number of the byte arrays and the number of the third storage spaces may be the same.

As an example, please continue to refer to fig. 5. The storage space a of the second storage space 501 may be used to store a specific character (e.g., sparse, etc.) representing the third format; the storage space d may be used to store the position of the non-0 element in the byte array where it is located; each storage space e may be used to store one non-0 element. The storage space may be a continuous storage space. The size of the storage space a may be 1 byte (or other sizes), the size of the storage space d may be 2048 bytes (or other sizes), and the size of each storage space e may be 1 byte (or other sizes).

Step 407, storing the byte array meeting the third preset storage condition in the determined byte arrays to a fourth storage space according to a predetermined fourth format.

In this embodiment, the electronic device may further store, in a fourth storage space, a byte array that meets a third preset storage condition in the determined byte arrays according to a predetermined fourth format.

Wherein the third preset storage condition may be that the ratio of the number of non-0 elements in the byte array to the number of total elements is greater than a preset percentage threshold (e.g., 80%). The fourth format may be a predetermined variety of formats. For example, each element in the byte array may be stored with contiguous storage space. Here, the size of the storage space for storing the respective elements may be the same. Here, the number of the above byte arrays and the number of the fourth storage space may be the same.

By way of example, please continue to refer to fig. 6. Wherein, the storage space a of the third storage space 601 can be used for storing specific characters (e.g. full, etc.) representing the fourth format; each memory space e may be used to store one element of the byte array. The storage space may be a continuous storage space. The size of the storage space a may be 1 byte (or other sizes), and the size of all the storage spaces e may be 16384 bytes (or other sizes). If the byte array meets the third preset storage condition, the electronic device may store the byte array into a fourth storage space according to the format.

It should be noted that the storage format of the data (i.e., one of the first format, the second format, the third format, and the fourth format) may be identified by storing the specific character (e.g., one of empty, explicit, spare, full).

Optionally, the preset storage condition may be further set as follows:

the first preset storage condition may also be a condition: the storage space for storing each hash value in the hash value set and the number of the hash values is less than the storage space for storing the non-0 element in the byte array corresponding to the hash value set and the position of the non-0 element in the byte array;

the second preset storage condition may also be a condition: the storage space for storing each hash value in the hash value set and the number of the hash values is larger than or equal to the storage space for storing the non-0 element in the byte array corresponding to the hash value set and the position of the non-0 element in the byte array, and the storage space for storing the position of the non-0 element in the byte array and the position of the non-0 element in the byte array is smaller than or equal to the storage space for storing each element in the byte array;

the third preset storage condition may also be a condition: the storage space for storing the non-0 elements in the byte array and the positions of the non-0 elements in the byte array is larger than the storage space for storing the individual elements in the byte array.

It can be understood that, by setting the first preset storage condition, the second preset storage condition and the third preset storage condition in the above manner, the following effects can be achieved:

for each sub data set, storing a related result of the sub data set (for example, an empty set, each hash value and the number of hash values in the hash value set corresponding to the sub data set, positions of non-0 elements and non-0 elements in a byte array corresponding to the sub data set in the byte array, and each element in the byte array corresponding to the sub data set) in a first storage space, a second storage space, a third storage space, or a fourth storage space (one of four storage spaces) according to the first format, the second format, the third format, or the fourth format (one of four formats) described above, so as to store the related result of the sub data set with relatively less storage space. When determining which of the four formats is used for storing the data, firstly, judging whether the data is stored by using the latter format, and saving more storage space than the data stored by using the former format; if not, the data is stored using the former format. It can be understood that, compared with the calculation directly on the original data, the technical solution provided in the embodiment of the present application helps to accelerate the processing speed on the sub-data set by calculating and storing the intermediate result (i.e. the above-mentioned correlation result) in advance.

In some optional implementation manners of this embodiment, the electronic device may further create a materialized view based on data stored in the first storage space, the second storage space, the third storage space, and the fourth storage space. The materialized view can be used for pre-calculating and storing results of operations which are time-consuming, such as table connection or aggregation. Thus, when the query operation is executed, the time-consuming operations can be avoided, and the result can be obtained quickly. It is understood that the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space may be the result of operations that the materialized view needs to save. It should be noted that the creation technology of the materialized view is a technology commonly researched and known by persons in the related art (for example, persons such as a database development engineer), and is not described herein again. In some use cases, the materialized view described above may be an aggregated materialized view.

In some optional implementations of this embodiment, the method further includes: and determining the cardinality of the target data set according to the maximum byte array in the determined byte arrays.

It will be appreciated that the byte array has recorded therein the position of the first 1 in the binary data represented by each hash value. This position may be used to estimate the cardinality of the target data set. Wherein the cardinality of the target data set is the number of distinct elements (data) in the target data set.

By way of example, the electronic device may determine a cardinality of the target data set according to a maximum byte array of the determined byte arrays based on a Hyperlogog algorithm.

Here, the cardinality of the target data set may be determined according to the position of the first 1 recorded by the element in the byte array. For example, if the first 1 recorded by the element is 1000 (the first 999 data representing binary data represented by the hash value is 0, and the 1000 th is 1), the base of the target data set may be 2¹⁰⁰⁰。

Optionally, the electronic device may further record an average of the positions (represented by numbers, for example, 1000) of the first 1 according to each element in the byte arrayOr the harmonic mean, and determining the cardinality of the target data set according to the method. For example, if the first 1 represented by the harmonic mean is 1000 (the first 999 data representing binary data represented by the hash value is 0, and the 1000 th is 1), the base of the target data set may be 2¹⁰⁰⁰。

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for storing data in the present embodiment highlights the step of determining the byte array corresponding to the hash value set. Therefore, the scheme described in this embodiment may determine the second storage space for storing each hash value in the hash value set corresponding to the byte array and the number of the hash values, and the size relationship between the second storage space and the storage space for storing the position of each non-0 element in the byte array, so as to store the intermediate result generated in the radix solving process in a storage manner with a smaller storage space, thereby reducing the occupation of the storage space, implementing more flexible data processing, and contributing to further increasing the speed of determining the radix of the set.

With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for storing data, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the apparatus 700 for storing data of the present embodiment includes: a reading unit 701, a first storage unit 702, a first determination unit 703, and a second storage unit 704. The reading unit 701 is configured to read a plurality of sub data sets included in a target data set, where the sub data sets are obtained by dividing the target data set; the first storage unit 702 is configured to store an empty set of the plurality of sub data sets into a first storage space according to a predetermined first format; the first determining unit 703 is configured to determine a hash value of each data in the multiple sub-data sets, so as to obtain multiple hash value sets; the second storage unit 704 is configured to store, in a second storage space, each hash value in a hash value set that meets a first preset storage condition in the plurality of hash value sets and the number of the hash values according to a predetermined second format.

In this embodiment, the reading unit 701 of the apparatus 700 for storing data may read a plurality of sub data sets included in a target data set from other electronic devices (e.g., terminal devices) through a wired connection manner or a wireless connection manner. And the sub data sets are obtained by dividing the target data set. The target data may be any type of data. Including but not limited to at least one of: numbers, characters, constructs, and arrays. The target data may be data that meets a specific condition. For example, the specific condition may be a value greater than 10, a sex of a male, or the like. It should be noted that the target data set may include the same elements (i.e., data). The number of elements included in the target data set may be 10 hundred million, 100 hundred million (on the order of large data), and so on. A sub data set may be, but is not limited to being, partitioned as follows: dividing target data based on the maximum data size which can be read by the reading unit 701 at a time; the target data is partitioned based on distributed streaming data statistics. It will be appreciated that the target data set may be comprised of a plurality of sub data sets divided.

In this embodiment, based on the plurality of sub data sets obtained by the reading unit 701, the first storage unit 702 may store an empty set of the plurality of sub data sets into the first storage space according to a predetermined first format. The first format may be various formats determined in advance. For example, the first format described above may be characterized by a particular character (e.g., null, etc.). The size of the first storage space may be 1 byte, 2 bytes, etc. Here, the first storage unit 702 may store each empty set to one first storage space.

In this embodiment, the first determining unit 703 may determine a hash value of each of the read sub data sets to obtain a plurality of hash value sets. And the number of the obtained hash value sets is equal to the number of the read sub data sets. It is to be understood that the hash value of the data may be determined by, but is not limited to, the following algorithm: secure Hash Algorithm (SHA), Message Digest (MD) Algorithm.

In this embodiment, the second storage unit 704 may store, in the second storage space, each hash value in the hash value set meeting the first preset storage condition and the number of the hash values in the hash value set obtained by the first determining unit 703 according to a predetermined second format. The first preset storage condition may be that the size of the storage space occupied by the hash value set is smaller than a predetermined storage space size (for example, 1024 bits). The second format may be a predetermined variety of formats. For example, the number may be stored in a storage space of a certain size (e.g., 4 bytes, 8 bytes, etc.), and the hash values may be stored in storage spaces consecutive to the storage space of the certain size. Here, the size of the storage space for storing the respective hash values may be the same. The number of the hash value sets meeting the first preset storage condition and the number of the second storage spaces may be the same.

In some use cases, the apparatus may further determine a byte array corresponding to each hash value set based on a Hyperlogog algorithm. And then storing the non-0 element in the byte array which meets the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array into a third storage space according to a predetermined third format. The second preset storage condition may be that the size of the storage space occupied by the byte array is smaller than the size of the storage space occupied by the hash value set corresponding to the byte array. The third format may be a predetermined variety of formats. For example, the position of the non-0 element in the byte array where the non-0 element is located may be stored in a storage space of a specific size (e.g., 2048 bytes, etc.), and each non-0 element may be stored in a storage space consecutive to the storage space of the specific size. Here, the size of the storage space for storing the respective non-0 elements may be the same. The number of the byte arrays and the number of the third storage spaces may be the same.

In some usage cases, the apparatus may further store, in a fourth format determined in advance, a byte array that meets a third preset storage condition in the determined byte arrays to a fourth storage space. Wherein the third preset storage condition may be that the ratio of the number of non-0 elements in the byte array to the number of total elements is greater than a preset percentage threshold (e.g., 80%). The fourth format may be a predetermined variety of formats. For example, each element in the byte array may be stored with contiguous storage space. Here, the size of the storage space for storing the respective elements may be the same. The number of the byte arrays and the number of the fourth storage spaces may be the same.

In some optional implementations of this embodiment, the apparatus further includes: the second determining unit (not shown in the figure) is configured to determine, for each hash value set of the plurality of hash value sets, a byte array corresponding to the hash value set, where a position of an element in the byte array is determined based on data on a first predetermined number of bits in each hash value in the hash value set, an element in the byte array is determined based on a position of a first 1 in the binary data represented by the hash value in the hash value set, and a length of the byte array is predetermined, for example, the length of the byte array may be 64 bits (bit).

In the above example, when the data on the first predetermined number of bits in the two (or more) hash values are the same (i.e. the element located at the same position in the byte array is required to record the position of the first 1 in the binary data represented by different hash values), the data on the first predetermined number of bits in the hash value located behind (or at any position of) the position of the first 1 in the binary data represented by different hash values may be used to determine the element at the same position according to the above steps, and the element at the same position is determined by which hash value. Here, the above-mentioned "first 1 position is located backward" can be determined as follows: for example, if the first 1 position of the binary data represented by a certain hash value is 5 (i.e. the first 4 bits of the binary data are all 0, and the 5 th bit is 1), and the first 1 position of the binary data represented by another hash value is 3 (i.e. the first 2 bits of the binary data are all 0, and the 3 rd bit is 1), the hash value following the first 1 position is the hash value of the first 1 position of the represented binary data being 5.

According to the steps, the device can determine the byte array corresponding to each hash value set.

In some optional implementations of this embodiment, the apparatus further includes: the third storage unit (not shown in the figure) is configured to store the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array in a third predetermined format into a third storage space.

In some optional implementations of this embodiment, the apparatus further includes: the fourth storage unit (not shown in the figure) is configured to store the byte arrays meeting the third preset storage condition in the determined byte arrays to the fourth storage space according to a predetermined fourth format.

Optionally, the preset storage condition may be further set as follows:

for each sub data set, storing a related result of the sub data set (for example, an empty set, each hash value and the number of hash values in the hash value set corresponding to the sub data set, positions of non-0 elements and non-0 elements in a byte array corresponding to the sub data set in the byte array, and each element in the byte array corresponding to the sub data set) in a first storage space, a second storage space, a third storage space, or a fourth storage space (one of four storage spaces) according to the first format, the second format, the third format, or the fourth format (one of four formats) described above, so as to store the related result of the sub data set with relatively less storage space. When determining which of the four formats is used for storing the data, firstly, judging whether the data is stored by using the latter format, and saving the storage space compared with the data stored by using the former format; if not, the data is stored using the former format. It can be understood that, compared with the calculation directly on the original data, the technical solution provided in the embodiment of the present application helps to accelerate the processing speed on the sub-data set by calculating and storing the intermediate result (i.e. the above-mentioned correlation result) in advance.

In some optional implementations of this embodiment, the apparatus further includes: the creating unit (not shown in the figure) is configured to create the materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space. The materialized view can be used for pre-calculating and storing results of operations which are time-consuming, such as table connection or aggregation. Thus, when the query operation is executed, the time-consuming operations can be avoided, and the result can be obtained quickly. It is understood that the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space may be the result of operations that the materialized view needs to save. It should be noted that the creation technology of the materialized view is a technology commonly researched and known by persons in the related art (for example, persons such as a database development engineer), and is not described herein again. Under some use requisites, the materialized view described above may be an aggregated materialized view.

In some optional implementations of this embodiment, the apparatus further includes: a third determining unit (not shown in the figure) is configured to determine a cardinality of the target data set according to a largest byte array among the determined byte arrays.

By way of example, the apparatus may determine a cardinality of the target data set according to a maximum byte array among the determined byte arrays based on a Hyperlogog algorithm.

The apparatus provided by the above-mentioned embodiment of the present application reads, by the reading unit 701, a plurality of sub data sets included in a target data set, then the first storage unit 702 stores the empty set in the plurality of sub data sets to a first storage space according to a predetermined first format, then the first determination unit 703 determines the hash value of each data in the plurality of sub data sets to obtain a plurality of hash value sets, finally the second storage unit 704 stores each hash value and the number of hash values in the hash value sets meeting a first preset storage condition in the plurality of hash value sets to a second storage space according to a predetermined second format, therefore, the intermediate result in the solving process is effectively utilized, the original data is processed in advance based on the storage of the intermediate result, the cardinality of the set is favorably and rapidly determined, and the flexibility of data processing is improved.

Referring now to fig. 8, shown is a block diagram of a computer system 800 suitable for use in implementing a terminal device/server according to embodiments of the present application. The terminal device/server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a reading unit, a first storage unit, a first determination unit, and a second storage unit. Here, the names of the units do not constitute a limitation to the unit itself in some cases, and for example, the reading unit may also be described as "a unit that reads a plurality of sub data sets included in the target data set".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the server described in the above embodiments; or may exist separately and not be assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set; storing an empty set in the plurality of sub-data sets to a first storage space according to a predetermined first format; determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets; and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for storing data, comprising:

reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set;

storing empty sets in the plurality of sub-data sets to a first storage space according to a predetermined first format;

determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets;

and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the plurality of hash value sets to a second storage space according to a predetermined second format.

2. The method of claim 1, wherein the method further comprises:

and determining a byte array corresponding to the hash value set for each hash value set in the plurality of hash value sets, wherein the positions of elements in the byte array are determined based on data on the first predetermined number of bits in each hash value in the hash value set, the elements in the byte array are determined based on the position of the first 1 in the binary data represented by the hash value in the hash value set, and the length of the byte array is predetermined.

3. The method of claim 2, wherein the method further comprises:

and storing the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.

4. The method of claim 3, wherein the method further comprises:

and storing the byte arrays which accord with the third preset storage condition in the determined byte arrays into a fourth storage space according to a predetermined fourth format.

5. The method of claim 4, wherein the method further comprises:

and creating a materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.

6. The method according to one of claims 2-5, wherein the method further comprises:

and determining the cardinality of the target data set according to the largest byte array in the determined byte arrays.

7. An apparatus for storing data, comprising:

the reading unit is configured to read a plurality of sub-data sets included in a target data set, wherein the sub-data sets are obtained by dividing the target data set;

the first storage unit is configured to store an empty set in the plurality of sub data sets into a first storage space according to a predetermined first format;

a first determining unit, configured to determine a hash value of each data in the plurality of sub-data sets, so as to obtain a plurality of hash value sets;

and the second storage unit is configured to store each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.

8. The apparatus of claim 7, wherein the apparatus further comprises:

and a second determining unit, configured to determine, for each of the plurality of sets of hash values, a byte array corresponding to the set of hash values, where a position of an element in the byte array is determined based on data on a predetermined number of bits in each of the hash values in the set of hash values, the element in the byte array is determined based on a position of a first 1 in binary data represented by the hash value in the set of hash values, and a length of the byte array is predetermined.

9. The apparatus of claim 8, wherein the apparatus further comprises:

and the third storage unit is configured to store the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.

10. The apparatus of claim 9, wherein the apparatus further comprises:

and the fourth storage unit is configured to store the byte arrays meeting the third preset storage condition in the determined byte arrays to a fourth storage space according to a predetermined fourth format.

11. The apparatus of claim 10, wherein the apparatus further comprises:

and the creating unit is configured to create the materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.

12. The apparatus according to one of claims 8-11, wherein the apparatus further comprises:

and the third determining unit is configured to determine the cardinality of the target data set according to the largest byte array in the determined byte arrays.

13. A server, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.