CN108197324B - Method and apparatus for storing data - Google Patents

Method and apparatus for storing data Download PDF

Info

Publication number
CN108197324B
CN108197324B CN201810117471.9A CN201810117471A CN108197324B CN 108197324 B CN108197324 B CN 108197324B CN 201810117471 A CN201810117471 A CN 201810117471A CN 108197324 B CN108197324 B CN 108197324B
Authority
CN
China
Prior art keywords
hash value
storage space
data
sets
byte array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810117471.9A
Other languages
Chinese (zh)
Other versions
CN108197324A (en
Inventor
陈浩
牟宇航
马如悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810117471.9A priority Critical patent/CN108197324B/en
Publication of CN108197324A publication Critical patent/CN108197324A/en
Application granted granted Critical
Publication of CN108197324B publication Critical patent/CN108197324B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

The embodiment of the application discloses a method and a device for storing data. One embodiment of the method comprises: reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set; storing an empty set in the plurality of sub-data sets to a first storage space according to a predetermined first format; determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets; and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the plurality of hash value sets to a second storage space according to a predetermined second format.

Description

Method and apparatus for storing data
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method and a device for storing data.
Background
Cardinality (Cardinality), refers to the number of different data in a data set. At present, in the radix statistics process of Big data (Big data), the original data generally needs to be processed. For different sets, when calculating the cardinality of the set each time, it is often necessary to perform the same algorithm from the original data in a traversal manner, respectively, to find the cardinality of the set.
Currently, the radix solving algorithm can be widely applied to the fields of UV (number of visitors, which refers to the number of different users visiting a website in a period of time) statistics and the like in website visiting analysis.
Disclosure of Invention
The embodiment of the application provides a method and a device for storing data.
In a first aspect, an embodiment of the present application provides a method for storing data, where the method includes: reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set; storing an empty set in the plurality of sub-data sets to a first storage space according to a predetermined first format; determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets; and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.
In some embodiments, the above method further comprises: for each hash value set in the plurality of hash value sets, determining a byte array corresponding to the hash value set, wherein the position of an element in the byte array is determined based on data on a predetermined number of bits in each hash value in the hash value set, the element in the byte array is determined based on the position of the first 1 in the binary data represented by the hash value in the hash value set, and the length of the byte array is predetermined.
In some embodiments, the above method further comprises: and storing the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.
In some embodiments, the above method further comprises: and storing the byte arrays which accord with the third preset storage condition in the determined byte arrays into a fourth storage space according to a predetermined fourth format.
In some embodiments, the above method further comprises: and creating a materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.
In some embodiments, the above method further comprises: and determining the cardinality of the target data set according to the largest byte array in the determined byte arrays.
In a second aspect, an embodiment of the present application provides an apparatus for storing data, the apparatus including: the reading unit is configured to read a plurality of sub-data sets included in a target data set, wherein the sub-data sets are obtained by dividing the target data set; the first storage unit is configured to store an empty set in the plurality of sub-data sets into a first storage space according to a predetermined first format; the first determining unit is configured to determine a hash value of each data in the plurality of sub-data sets to obtain a plurality of hash value sets; and the second storage unit is configured to store each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.
In some embodiments, the above apparatus further comprises: and a second determining unit, configured to determine, for each of the plurality of sets of hash values, a byte array corresponding to the set of hash values, where a position of an element in the byte array is determined based on data on a predetermined number of bits in each of the hash values in the set of hash values, the element in the byte array is determined based on a position of a first 1 in the binary data represented by the hash value in the set of hash values, and a length of the byte array is predetermined.
In some embodiments, the above apparatus further comprises: and the third storage unit is configured to store the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.
In some embodiments, the above apparatus further comprises: and the fourth storage unit is configured to store the byte arrays meeting the third preset storage condition in the determined byte arrays to a fourth storage space according to a predetermined fourth format.
In some embodiments, the above apparatus further comprises: and the creating unit is configured to create the materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.
In some embodiments, the above apparatus further comprises: and the third determining unit is configured to determine the cardinality of the target data set according to the largest byte array in the determined byte arrays.
In a third aspect, an embodiment of the present application provides a server for storing data, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the embodiments of the method for storing data described above.
In a fourth aspect, the present application provides a computer-readable medium for storing data, on which a computer program is stored, which when executed by a processor implements the method of any one of the embodiments of the method for storing data as described above.
According to the method and the device for storing data provided by the embodiment of the application, a plurality of subdata sets included in a target data set are read, then an empty set in the subdata sets is stored in a first storage space according to a predetermined first format, then a hash value of each data in the subdata sets is determined to obtain a plurality of hash value sets, and finally each hash value and the number of the hash values in the hash value sets meeting a first preset storage condition in the hash value sets are stored in a second storage space according to a predetermined second format, so that the flexibility of data processing is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram to which embodiments of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for storing data according to the present application;
FIG. 3A is a schematic illustration of an application scenario of a method for storing data according to the present application;
FIG. 3B is a schematic diagram of data storage in the application scenario of FIG. 3A;
FIG. 3C is a schematic illustration of yet another application scenario of a method for storing data according to the present application;
FIG. 3D is a schematic diagram of data storage in the application scenario of FIG. 3C;
FIG. 4 is a flow diagram of yet another embodiment of a method for storing data according to the present application;
FIG. 5 is a data storage schematic of a method for storing data according to the present application;
FIG. 6 is yet another data storage schematic of a method for storing data according to the present application;
FIG. 7 is a schematic block diagram illustrating one embodiment of an apparatus for storing data according to the present application;
FIG. 8 is a schematic block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for storing data or the apparatus for storing data of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a data search application, a data statistics application, a web browser application, a shopping application, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting data processing, including but not limited to smart phones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server that provides various services, such as a background data processing server that provides support for multiple sub data sets uploaded by the terminal devices 101, 102, 103. The background data processing server may analyze the received data processing request and feed back the processing result (e.g., the obtained hash value sets, the number of hash values in each hash value set, etc.) to the terminal device.
It should be noted that the method for storing data provided by the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for storing data is generally disposed in the server 105.
It should be further noted that the method for storing data provided by the embodiment of the present application may be applied to a distributed server, and accordingly, the apparatus for storing data may be disposed in the distributed server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the information processing method operates does not need to perform data transmission with the terminal device, the system architecture may not include a network and the terminal device.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for storing data in accordance with the present application is shown. The method for storing data comprises the following steps:
step 201, reading a plurality of sub data sets included in a target data set.
In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the method for storing data operates may read a plurality of sub data sets included in a target data set from other electronic devices (for example, terminal devices) through a wired connection manner or a wireless connection manner. And the sub data sets are obtained by dividing the target data set. The target data may be any type of data. Including but not limited to at least one of: numbers, characters, constructs, and arrays. The target data may be data that meets a specific condition. For example, the specific condition may be a value greater than 10, a sex of a male, or the like. It should be noted that the target data set may include the same elements (i.e., data). The number of elements included in the target data set may be 10 hundred million, 100 hundred million (on the order of large data), and so on. A child data set may be, but is not limited to, a data set partitioned as follows: dividing target data into data sets based on the maximum data volume which can be read by the electronic equipment at a single time; and the data set is obtained by dividing the target data based on the distributed streaming data statistics. It will be appreciated that the target data set may be comprised of a plurality of sub data sets divided.
For example, please refer to fig. 3A. The server (i.e., the electronic device) reads a plurality of child data sets 301 included in a target data set (e.g., {1, 24, 67, 24, 1900, 2634, 4571, 264 … … }) stored in a data storage server.
Step 202, storing an empty set of the plurality of sub-data sets to a first storage space according to a predetermined first format.
In this embodiment, based on the plurality of sub data sets obtained in step 201, the electronic device may store an empty set of the plurality of sub data sets in a first storage space according to a predetermined first format. The first format may be various formats determined in advance. For example, the first format described above may be characterized by a particular character (e.g., null, empty, etc.). The size of the first storage space may be 1 byte, 2 bytes, etc. Here, the electronic device may store each empty set (i.e., the specific character corresponding to the empty set) in a first storage space.
It should be noted that, in practice, when data in the form is not filled in, etc., the plurality of sub data sets may include an empty set, and the storage of the empty set helps record the state of the original form.
For example, please refer to fig. 3B. The electronic device stores a specific character (e.g., empty) corresponding to an empty set in the multiple sub-data sets into a first storage space a according to a predetermined first format (e.g., empty) (as shown by reference numeral 302).
Step 203, determining a hash value of each data in the plurality of sub data sets to obtain a plurality of hash value sets.
In this embodiment, the electronic device may determine a hash value of each data in the read sub data sets, and obtain a plurality of hash value sets. And the number of the obtained hash value sets is equal to the number of the read sub data sets. It is to be understood that the hash value of the data may be determined by, but is not limited to, the following algorithm: the method comprises the following steps of MurmurHash hash algorithm (a non-encryption type hash function which is suitable for general hash retrieval operation, compared with other popular hash functions, the random distribution characteristic of a hash value obtained by the function is better in performance for keys (keys) with stronger regularity), FNV (feeder Noll Vo) hash algorithm, and CRC (Cyclic Redundancy Check) algorithm.
By way of example, please continue to refer to fig. 3C. The electronic device determines a hash value of each data in the sub-data sets 301 by using a hash algorithm, so as to obtain a plurality of hash value sets 303.
And 204, storing each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.
In this embodiment, the electronic device may further store, in the second storage space, each hash value in the hash value set meeting the first preset storage condition in the hash value set obtained in step 203 and the number of the hash values according to a predetermined second format. The first preset storage condition may be that the size of the storage space occupied by the hash value set is smaller than a predetermined storage space size (for example, 1024 bits). The second format may be a predetermined variety of formats. For example, the number may be stored in a storage space of a certain size (e.g., 4 bytes, 8 bytes, etc.), and the hash values may be stored in storage spaces consecutive to the storage space of the certain size. Here, the size of the storage space for storing the respective hash values may be the same. The number of the hash value sets meeting the first preset storage condition and the number of the second storage spaces may be the same.
By way of example, please continue to refer to fig. 3D. The electronic device stores, in the second storage space 304, each hash value (including 0X356EF34E,0XA96B3452) and the number of hash values (e.g. 10000) in a hash value set (e.g. {0X356EF34E,0XA96B3452, … … }) that meets a first preset storage condition (e.g. the size of the storage space occupied by the hash value set is smaller than 1024 bits) according to the following format: the storage space a of the second storage space 304 may store a specific character (e.g., an explicit) which may be used to characterize that the data stored in the second storage space 304 is stored according to the second format; the storage space b stores the number of hash values (e.g., 10000); each storage space c stores a hash value. In this example, the above-mentioned storage space may be a continuous storage space. The size of the storage space a may be 1 byte (or other sizes), the size of the storage space b may be 4 bytes (or other sizes), and the size of each storage space c may be 8 bytes (or other sizes).
It should be noted that the storage format of the data can be identified by storing the specific characters.
In some usage cases, the electronic device may further determine, based on a hyperlogog algorithm, a byte array corresponding to each hash value set. And then storing the non-0 element in the byte array which meets the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array into a third storage space according to a predetermined third format. The second preset storage condition may be that the size of the storage space occupied by the byte array is smaller than the size of the storage space occupied by the hash value set corresponding to the byte array. The third format may be a predetermined variety of formats. For example, the position of the non-0 element in the byte array where the non-0 element is located may be stored in a storage space of a specific size (e.g., 2048 bytes, etc.), and each non-0 element may be stored in a storage space consecutive to the storage space of the specific size. Here, the size of the storage space for storing the respective non-0 elements may be the same. The number of the byte arrays and the number of the third storage spaces may be the same.
In some usage cases, the electronic device may further store, in a fourth predetermined format, a byte array that meets a third preset storage condition in the determined byte arrays to a fourth storage space. Wherein the third preset storage condition may be that the ratio of the number of non-0 elements in the byte array to the number of total elements is greater than a preset percentage threshold (e.g., 80%). The fourth format may be a predetermined variety of formats. For example, each element in the byte array may be stored with contiguous storage space. Here, the size of the storage space for storing the respective elements may be the same. The number of the byte arrays and the number of the fourth storage spaces may be the same.
In the method provided by the above embodiment of the application, a plurality of sub data sets included in a target data set are read, then according to a predetermined first format, empty sets in the plurality of sub data sets are stored in a first storage space, then hash values of each data in the plurality of sub data sets are determined, a plurality of hash value sets are obtained, and finally, each hash value and the number of hash values in the hash value sets meeting a first preset storage condition in the plurality of hash value sets are stored in a second storage space according to a predetermined second format, so that an intermediate result in a solving process is effectively utilized, based on the storage of the intermediate result, the advance processing of original data is realized, the rapid determination of the base number of the sets is facilitated, and the flexibility of data processing is improved.
With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for storing data is shown. The process 400 of the method for storing data includes the steps of:
step 401, reading a plurality of sub data sets included in the target data set.
In this embodiment, step 401 is substantially the same as step 201 in the corresponding embodiment of fig. 2, and is not described here again.
Step 402, storing an empty set of the plurality of sub-data sets in a first storage space according to a predetermined first format.
In this embodiment, step 402 is substantially the same as step 202 in the corresponding embodiment of fig. 2, and is not described herein again.
Step 403, determining a hash value of each data in the plurality of sub-data sets to obtain a plurality of hash value sets.
In this embodiment, step 403 is substantially the same as step 203 in the corresponding embodiment of fig. 2, and is not described herein again.
Step 404, storing each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the plurality of hash value sets to a second storage space according to a predetermined second format.
In this embodiment, step 404 is substantially the same as step 204 in the corresponding embodiment of fig. 2, and is not described herein again.
Step 405, for each hash value set of the plurality of hash value sets, determining a byte array corresponding to the hash value set.
In this embodiment, the electronic device may further determine, for each hash value set of the multiple hash value sets, a byte array corresponding to the hash value set. Wherein the position of the element in the byte array is determined based on the data on the first predetermined number of bits in each hash value in the hash value set, the element in the byte array is determined based on the position of the first 1 in the binary data characterized by the hash value in the hash value set, and the length of the byte array is predetermined, for example, the length of the byte array may be 64 bits (bit).
Illustratively, if the length of the byte array is N (e.g., 64) bits, and the hash value of each data in the sub-data set obtained in step 403 is L bits (e.g., 32 bits), the first log thereof is required2Data on N bits (e.g., 6 bits) to determine the position of an element in the byte array. Leaving L-N bits (e.g., 26 bits), then log is required for each element in the byte array2The (L-N) bits (e.g., 5 bits) record the position of the first 1 in the binary data characterized by the hash value. Wherein if the data on the first 6 bits of the hash value is "000001", the data is converted into decimal data as 1, so the first element in the byte array can be used to record the position of the first 1 in the binary data characterized by the hash value.
In the above example, when the data on the first predetermined number of bits in the two (or more) hash values are the same (i.e. the element located at the same position in the byte array is required to record the position of the first 1 in the binary data represented by different hash values), the data on the first predetermined number of bits in the hash value located at the back (or any position) of the first 1 in the binary data represented by different hash values may be used to determine which hash value the element at the same position is determined by (e.g. determined by the hash value located at the back of the position of the first 1 in the represented binary data). Here, the above-mentioned "first 1 position is located backward" can be determined as follows: for example, if the first 1 position of the binary data represented by a certain hash value is 5 (i.e. the first 4 bits of the binary data are all 0, and the 5 th bit is 1), and the first 1 position of the binary data represented by another hash value is 3 (i.e. the first 2 bits of the binary data are all 0, and the 3 rd bit is 1), the hash value following the first 1 position is the hash value of the first 1 position of the represented binary data being 5.
According to the above steps, the electronic device may determine the byte array corresponding to each hash value set.
Step 406, storing the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.
In this embodiment, the electronic device may further store, in the determined byte array, a non-0 element in the byte array that meets the second preset storage condition and a position of the non-0 element in the byte array according to a predetermined third format, to a third storage space.
The second preset storage condition may be that the size of the storage space occupied by the byte array is smaller than the size of the storage space occupied by the hash value set corresponding to the byte array. The third format may be a predetermined variety of formats. For example, the position of the non-0 element in the byte array where the non-0 element is located may be stored in a storage space of a specific size (e.g., 2048 bytes, etc.), and each non-0 element may be stored in a storage space consecutive to the storage space of the specific size. Here, the size of the storage space for storing the respective non-0 elements may be the same. The number of the byte arrays and the number of the third storage spaces may be the same.
As an example, please continue to refer to fig. 5. The storage space a of the second storage space 501 may be used to store a specific character (e.g., sparse, etc.) representing the third format; the storage space d may be used to store the position of the non-0 element in the byte array where it is located; each storage space e may be used to store one non-0 element. The storage space may be a continuous storage space. The size of the storage space a may be 1 byte (or other sizes), the size of the storage space d may be 2048 bytes (or other sizes), and the size of each storage space e may be 1 byte (or other sizes).
Step 407, storing the byte array meeting the third preset storage condition in the determined byte arrays to a fourth storage space according to a predetermined fourth format.
In this embodiment, the electronic device may further store, in a fourth storage space, a byte array that meets a third preset storage condition in the determined byte arrays according to a predetermined fourth format.
Wherein the third preset storage condition may be that the ratio of the number of non-0 elements in the byte array to the number of total elements is greater than a preset percentage threshold (e.g., 80%). The fourth format may be a predetermined variety of formats. For example, each element in the byte array may be stored with contiguous storage space. Here, the size of the storage space for storing the respective elements may be the same. Here, the number of the above byte arrays and the number of the fourth storage space may be the same.
By way of example, please continue to refer to fig. 6. Wherein, the storage space a of the third storage space 601 can be used for storing specific characters (e.g. full, etc.) representing the fourth format; each memory space e may be used to store one element of the byte array. The storage space may be a continuous storage space. The size of the storage space a may be 1 byte (or other sizes), and the size of all the storage spaces e may be 16384 bytes (or other sizes). If the byte array meets the third preset storage condition, the electronic device may store the byte array into a fourth storage space according to the format.
It should be noted that the storage format of the data (i.e., one of the first format, the second format, the third format, and the fourth format) may be identified by storing the specific character (e.g., one of empty, explicit, spare, full).
Optionally, the preset storage condition may be further set as follows:
the first preset storage condition may also be a condition: the storage space for storing each hash value in the hash value set and the number of the hash values is less than the storage space for storing the non-0 element in the byte array corresponding to the hash value set and the position of the non-0 element in the byte array;
the second preset storage condition may also be a condition: the storage space for storing each hash value in the hash value set and the number of the hash values is larger than or equal to the storage space for storing the non-0 element in the byte array corresponding to the hash value set and the position of the non-0 element in the byte array, and the storage space for storing the position of the non-0 element in the byte array and the position of the non-0 element in the byte array is smaller than or equal to the storage space for storing each element in the byte array;
the third preset storage condition may also be a condition: the storage space for storing the non-0 elements in the byte array and the positions of the non-0 elements in the byte array is larger than the storage space for storing the individual elements in the byte array.
It can be understood that, by setting the first preset storage condition, the second preset storage condition and the third preset storage condition in the above manner, the following effects can be achieved:
for each sub data set, storing a related result of the sub data set (for example, an empty set, each hash value and the number of hash values in the hash value set corresponding to the sub data set, positions of non-0 elements and non-0 elements in a byte array corresponding to the sub data set in the byte array, and each element in the byte array corresponding to the sub data set) in a first storage space, a second storage space, a third storage space, or a fourth storage space (one of four storage spaces) according to the first format, the second format, the third format, or the fourth format (one of four formats) described above, so as to store the related result of the sub data set with relatively less storage space. When determining which of the four formats is used for storing the data, firstly, judging whether the data is stored by using the latter format, and saving more storage space than the data stored by using the former format; if not, the data is stored using the former format. It can be understood that, compared with the calculation directly on the original data, the technical solution provided in the embodiment of the present application helps to accelerate the processing speed on the sub-data set by calculating and storing the intermediate result (i.e. the above-mentioned correlation result) in advance.
In some optional implementation manners of this embodiment, the electronic device may further create a materialized view based on data stored in the first storage space, the second storage space, the third storage space, and the fourth storage space. The materialized view can be used for pre-calculating and storing results of operations which are time-consuming, such as table connection or aggregation. Thus, when the query operation is executed, the time-consuming operations can be avoided, and the result can be obtained quickly. It is understood that the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space may be the result of operations that the materialized view needs to save. It should be noted that the creation technology of the materialized view is a technology commonly researched and known by persons in the related art (for example, persons such as a database development engineer), and is not described herein again. In some use cases, the materialized view described above may be an aggregated materialized view.
In some optional implementations of this embodiment, the method further includes: and determining the cardinality of the target data set according to the maximum byte array in the determined byte arrays.
It will be appreciated that the byte array has recorded therein the position of the first 1 in the binary data represented by each hash value. This position may be used to estimate the cardinality of the target data set. Wherein the cardinality of the target data set is the number of distinct elements (data) in the target data set.
By way of example, the electronic device may determine a cardinality of the target data set according to a maximum byte array of the determined byte arrays based on a Hyperlogog algorithm.
Here, the cardinality of the target data set may be determined according to the position of the first 1 recorded by the element in the byte array. For example, if the first 1 recorded by the element is 1000 (the first 999 data representing binary data represented by the hash value is 0, and the 1000 th is 1), the base of the target data set may be 21000
Optionally, the electronic device may further record an average of the positions (represented by numbers, for example, 1000) of the first 1 according to each element in the byte arrayOr the harmonic mean, and determining the cardinality of the target data set according to the method. For example, if the first 1 represented by the harmonic mean is 1000 (the first 999 data representing binary data represented by the hash value is 0, and the 1000 th is 1), the base of the target data set may be 21000
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for storing data in the present embodiment highlights the step of determining the byte array corresponding to the hash value set. Therefore, the scheme described in this embodiment may determine the second storage space for storing each hash value in the hash value set corresponding to the byte array and the number of the hash values, and the size relationship between the second storage space and the storage space for storing the position of each non-0 element in the byte array, so as to store the intermediate result generated in the radix solving process in a storage manner with a smaller storage space, thereby reducing the occupation of the storage space, implementing more flexible data processing, and contributing to further increasing the speed of determining the radix of the set.
With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for storing data, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 7, the apparatus 700 for storing data of the present embodiment includes: a reading unit 701, a first storage unit 702, a first determination unit 703, and a second storage unit 704. The reading unit 701 is configured to read a plurality of sub data sets included in a target data set, where the sub data sets are obtained by dividing the target data set; the first storage unit 702 is configured to store an empty set of the plurality of sub data sets into a first storage space according to a predetermined first format; the first determining unit 703 is configured to determine a hash value of each data in the multiple sub-data sets, so as to obtain multiple hash value sets; the second storage unit 704 is configured to store, in a second storage space, each hash value in a hash value set that meets a first preset storage condition in the plurality of hash value sets and the number of the hash values according to a predetermined second format.
In this embodiment, the reading unit 701 of the apparatus 700 for storing data may read a plurality of sub data sets included in a target data set from other electronic devices (e.g., terminal devices) through a wired connection manner or a wireless connection manner. And the sub data sets are obtained by dividing the target data set. The target data may be any type of data. Including but not limited to at least one of: numbers, characters, constructs, and arrays. The target data may be data that meets a specific condition. For example, the specific condition may be a value greater than 10, a sex of a male, or the like. It should be noted that the target data set may include the same elements (i.e., data). The number of elements included in the target data set may be 10 hundred million, 100 hundred million (on the order of large data), and so on. A sub data set may be, but is not limited to being, partitioned as follows: dividing target data based on the maximum data size which can be read by the reading unit 701 at a time; the target data is partitioned based on distributed streaming data statistics. It will be appreciated that the target data set may be comprised of a plurality of sub data sets divided.
In this embodiment, based on the plurality of sub data sets obtained by the reading unit 701, the first storage unit 702 may store an empty set of the plurality of sub data sets into the first storage space according to a predetermined first format. The first format may be various formats determined in advance. For example, the first format described above may be characterized by a particular character (e.g., null, etc.). The size of the first storage space may be 1 byte, 2 bytes, etc. Here, the first storage unit 702 may store each empty set to one first storage space.
In this embodiment, the first determining unit 703 may determine a hash value of each of the read sub data sets to obtain a plurality of hash value sets. And the number of the obtained hash value sets is equal to the number of the read sub data sets. It is to be understood that the hash value of the data may be determined by, but is not limited to, the following algorithm: secure Hash Algorithm (SHA), Message Digest (MD) Algorithm.
In this embodiment, the second storage unit 704 may store, in the second storage space, each hash value in the hash value set meeting the first preset storage condition and the number of the hash values in the hash value set obtained by the first determining unit 703 according to a predetermined second format. The first preset storage condition may be that the size of the storage space occupied by the hash value set is smaller than a predetermined storage space size (for example, 1024 bits). The second format may be a predetermined variety of formats. For example, the number may be stored in a storage space of a certain size (e.g., 4 bytes, 8 bytes, etc.), and the hash values may be stored in storage spaces consecutive to the storage space of the certain size. Here, the size of the storage space for storing the respective hash values may be the same. The number of the hash value sets meeting the first preset storage condition and the number of the second storage spaces may be the same.
In some use cases, the apparatus may further determine a byte array corresponding to each hash value set based on a Hyperlogog algorithm. And then storing the non-0 element in the byte array which meets the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array into a third storage space according to a predetermined third format. The second preset storage condition may be that the size of the storage space occupied by the byte array is smaller than the size of the storage space occupied by the hash value set corresponding to the byte array. The third format may be a predetermined variety of formats. For example, the position of the non-0 element in the byte array where the non-0 element is located may be stored in a storage space of a specific size (e.g., 2048 bytes, etc.), and each non-0 element may be stored in a storage space consecutive to the storage space of the specific size. Here, the size of the storage space for storing the respective non-0 elements may be the same. The number of the byte arrays and the number of the third storage spaces may be the same.
In some usage cases, the apparatus may further store, in a fourth format determined in advance, a byte array that meets a third preset storage condition in the determined byte arrays to a fourth storage space. Wherein the third preset storage condition may be that the ratio of the number of non-0 elements in the byte array to the number of total elements is greater than a preset percentage threshold (e.g., 80%). The fourth format may be a predetermined variety of formats. For example, each element in the byte array may be stored with contiguous storage space. Here, the size of the storage space for storing the respective elements may be the same. The number of the byte arrays and the number of the fourth storage spaces may be the same.
In some optional implementations of this embodiment, the apparatus further includes: the second determining unit (not shown in the figure) is configured to determine, for each hash value set of the plurality of hash value sets, a byte array corresponding to the hash value set, where a position of an element in the byte array is determined based on data on a first predetermined number of bits in each hash value in the hash value set, an element in the byte array is determined based on a position of a first 1 in the binary data represented by the hash value in the hash value set, and a length of the byte array is predetermined, for example, the length of the byte array may be 64 bits (bit).
Illustratively, if the length of the byte array is N (e.g., 64) bits, and the hash value of each data in the sub-data set obtained in step 403 is L bits (e.g., 32 bits), the first log thereof is required2Data on N bits (e.g., 6 bits) to determine the position of an element in the byte array. Leaving L-N bits (e.g., 26 bits), then log is required for each element in the byte array2The (L-N) bits (e.g., 5 bits) record the position of the first 1 in the binary data characterized by the hash value. Wherein if the data on the first 6 bits of the hash value is "000001", the data is converted into decimal data as 1, so the first element in the byte array can be used to record the position of the first 1 in the binary data characterized by the hash value.
In the above example, when the data on the first predetermined number of bits in the two (or more) hash values are the same (i.e. the element located at the same position in the byte array is required to record the position of the first 1 in the binary data represented by different hash values), the data on the first predetermined number of bits in the hash value located behind (or at any position of) the position of the first 1 in the binary data represented by different hash values may be used to determine the element at the same position according to the above steps, and the element at the same position is determined by which hash value. Here, the above-mentioned "first 1 position is located backward" can be determined as follows: for example, if the first 1 position of the binary data represented by a certain hash value is 5 (i.e. the first 4 bits of the binary data are all 0, and the 5 th bit is 1), and the first 1 position of the binary data represented by another hash value is 3 (i.e. the first 2 bits of the binary data are all 0, and the 3 rd bit is 1), the hash value following the first 1 position is the hash value of the first 1 position of the represented binary data being 5.
According to the steps, the device can determine the byte array corresponding to each hash value set.
In some optional implementations of this embodiment, the apparatus further includes: the third storage unit (not shown in the figure) is configured to store the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array in a third predetermined format into a third storage space.
The second preset storage condition may be that the size of the storage space occupied by the byte array is smaller than the size of the storage space occupied by the hash value set corresponding to the byte array. The third format may be a predetermined variety of formats. For example, the position of the non-0 element in the byte array where the non-0 element is located may be stored in a storage space of a specific size (e.g., 2048 bytes, etc.), and each non-0 element may be stored in a storage space consecutive to the storage space of the specific size. Here, the size of the storage space for storing the respective non-0 elements may be the same. The number of the byte arrays and the number of the third storage spaces may be the same.
In some optional implementations of this embodiment, the apparatus further includes: the fourth storage unit (not shown in the figure) is configured to store the byte arrays meeting the third preset storage condition in the determined byte arrays to the fourth storage space according to a predetermined fourth format.
Wherein the third preset storage condition may be that the ratio of the number of non-0 elements in the byte array to the number of total elements is greater than a preset percentage threshold (e.g., 80%). The fourth format may be a predetermined variety of formats. For example, each element in the byte array may be stored with contiguous storage space. Here, the size of the storage space for storing the respective elements may be the same. Here, the number of the above byte arrays and the number of the fourth storage space may be the same.
Optionally, the preset storage condition may be further set as follows:
the first preset storage condition may also be a condition: the storage space for storing each hash value in the hash value set and the number of the hash values is less than the storage space for storing the non-0 element in the byte array corresponding to the hash value set and the position of the non-0 element in the byte array;
the second preset storage condition may also be a condition: the storage space for storing each hash value in the hash value set and the number of the hash values is larger than or equal to the storage space for storing the non-0 element in the byte array corresponding to the hash value set and the position of the non-0 element in the byte array, and the storage space for storing the position of the non-0 element in the byte array and the position of the non-0 element in the byte array is smaller than or equal to the storage space for storing each element in the byte array;
the third preset storage condition may also be a condition: the storage space for storing the non-0 elements in the byte array and the positions of the non-0 elements in the byte array is larger than the storage space for storing the individual elements in the byte array.
It can be understood that, by setting the first preset storage condition, the second preset storage condition and the third preset storage condition in the above manner, the following effects can be achieved:
for each sub data set, storing a related result of the sub data set (for example, an empty set, each hash value and the number of hash values in the hash value set corresponding to the sub data set, positions of non-0 elements and non-0 elements in a byte array corresponding to the sub data set in the byte array, and each element in the byte array corresponding to the sub data set) in a first storage space, a second storage space, a third storage space, or a fourth storage space (one of four storage spaces) according to the first format, the second format, the third format, or the fourth format (one of four formats) described above, so as to store the related result of the sub data set with relatively less storage space. When determining which of the four formats is used for storing the data, firstly, judging whether the data is stored by using the latter format, and saving the storage space compared with the data stored by using the former format; if not, the data is stored using the former format. It can be understood that, compared with the calculation directly on the original data, the technical solution provided in the embodiment of the present application helps to accelerate the processing speed on the sub-data set by calculating and storing the intermediate result (i.e. the above-mentioned correlation result) in advance.
In some optional implementations of this embodiment, the apparatus further includes: the creating unit (not shown in the figure) is configured to create the materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space. The materialized view can be used for pre-calculating and storing results of operations which are time-consuming, such as table connection or aggregation. Thus, when the query operation is executed, the time-consuming operations can be avoided, and the result can be obtained quickly. It is understood that the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space may be the result of operations that the materialized view needs to save. It should be noted that the creation technology of the materialized view is a technology commonly researched and known by persons in the related art (for example, persons such as a database development engineer), and is not described herein again. Under some use requisites, the materialized view described above may be an aggregated materialized view.
In some optional implementations of this embodiment, the apparatus further includes: a third determining unit (not shown in the figure) is configured to determine a cardinality of the target data set according to a largest byte array among the determined byte arrays.
By way of example, the apparatus may determine a cardinality of the target data set according to a maximum byte array among the determined byte arrays based on a Hyperlogog algorithm.
It will be appreciated that the byte array has recorded therein the position of the first 1 in the binary data represented by each hash value. This position may be used to estimate the cardinality of the target data set. Wherein the cardinality of the target data set is the number of distinct elements (data) in the target data set.
The apparatus provided by the above-mentioned embodiment of the present application reads, by the reading unit 701, a plurality of sub data sets included in a target data set, then the first storage unit 702 stores the empty set in the plurality of sub data sets to a first storage space according to a predetermined first format, then the first determination unit 703 determines the hash value of each data in the plurality of sub data sets to obtain a plurality of hash value sets, finally the second storage unit 704 stores each hash value and the number of hash values in the hash value sets meeting a first preset storage condition in the plurality of hash value sets to a second storage space according to a predetermined second format, therefore, the intermediate result in the solving process is effectively utilized, the original data is processed in advance based on the storage of the intermediate result, the cardinality of the set is favorably and rapidly determined, and the flexibility of data processing is improved.
Referring now to fig. 8, shown is a block diagram of a computer system 800 suitable for use in implementing a terminal device/server according to embodiments of the present application. The terminal device/server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a reading unit, a first storage unit, a first determination unit, and a second storage unit. Here, the names of the units do not constitute a limitation to the unit itself in some cases, and for example, the reading unit may also be described as "a unit that reads a plurality of sub data sets included in the target data set".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the server described in the above embodiments; or may exist separately and not be assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set; storing an empty set in the plurality of sub-data sets to a first storage space according to a predetermined first format; determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets; and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (14)

1. A method for storing data, comprising:
reading a plurality of subdata sets included by a target data set, wherein the subdata sets are obtained by dividing the target data set;
storing empty sets in the plurality of sub-data sets to a first storage space according to a predetermined first format;
determining a hash value of each data in the plurality of subdata sets to obtain a plurality of hash value sets;
and storing each hash value in the hash value sets which meet the first preset storage condition and the number of the hash values in the plurality of hash value sets to a second storage space according to a predetermined second format.
2. The method of claim 1, wherein the method further comprises:
and determining a byte array corresponding to the hash value set for each hash value set in the plurality of hash value sets, wherein the positions of elements in the byte array are determined based on data on the first predetermined number of bits in each hash value in the hash value set, the elements in the byte array are determined based on the position of the first 1 in the binary data represented by the hash value in the hash value set, and the length of the byte array is predetermined.
3. The method of claim 2, wherein the method further comprises:
and storing the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.
4. The method of claim 3, wherein the method further comprises:
and storing the byte arrays which accord with the third preset storage condition in the determined byte arrays into a fourth storage space according to a predetermined fourth format.
5. The method of claim 4, wherein the method further comprises:
and creating a materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.
6. The method according to one of claims 2-5, wherein the method further comprises:
and determining the cardinality of the target data set according to the largest byte array in the determined byte arrays.
7. An apparatus for storing data, comprising:
the reading unit is configured to read a plurality of sub-data sets included in a target data set, wherein the sub-data sets are obtained by dividing the target data set;
the first storage unit is configured to store an empty set in the plurality of sub data sets into a first storage space according to a predetermined first format;
a first determining unit, configured to determine a hash value of each data in the plurality of sub-data sets, so as to obtain a plurality of hash value sets;
and the second storage unit is configured to store each hash value in the hash value sets meeting the first preset storage condition and the number of the hash values in the hash value sets to a second storage space according to a predetermined second format.
8. The apparatus of claim 7, wherein the apparatus further comprises:
and a second determining unit, configured to determine, for each of the plurality of sets of hash values, a byte array corresponding to the set of hash values, where a position of an element in the byte array is determined based on data on a predetermined number of bits in each of the hash values in the set of hash values, the element in the byte array is determined based on a position of a first 1 in binary data represented by the hash value in the set of hash values, and a length of the byte array is predetermined.
9. The apparatus of claim 8, wherein the apparatus further comprises:
and the third storage unit is configured to store the non-0 element in the byte array meeting the second preset storage condition in the determined byte array and the position of the non-0 element in the byte array to a third storage space according to a predetermined third format.
10. The apparatus of claim 9, wherein the apparatus further comprises:
and the fourth storage unit is configured to store the byte arrays meeting the third preset storage condition in the determined byte arrays to a fourth storage space according to a predetermined fourth format.
11. The apparatus of claim 10, wherein the apparatus further comprises:
and the creating unit is configured to create the materialized view based on the data stored in the first storage space, the second storage space, the third storage space and the fourth storage space.
12. The apparatus according to one of claims 8-11, wherein the apparatus further comprises:
and the third determining unit is configured to determine the cardinality of the target data set according to the largest byte array in the determined byte arrays.
13. A server, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
14. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.
CN201810117471.9A 2018-02-06 2018-02-06 Method and apparatus for storing data Active CN108197324B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810117471.9A CN108197324B (en) 2018-02-06 2018-02-06 Method and apparatus for storing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810117471.9A CN108197324B (en) 2018-02-06 2018-02-06 Method and apparatus for storing data

Publications (2)

Publication Number Publication Date
CN108197324A CN108197324A (en) 2018-06-22
CN108197324B true CN108197324B (en) 2021-07-16

Family

ID=62592561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810117471.9A Active CN108197324B (en) 2018-02-06 2018-02-06 Method and apparatus for storing data

Country Status (1)

Country Link
CN (1) CN108197324B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737691B (en) * 2018-07-03 2022-11-04 百度在线网络技术(北京)有限公司 Method and apparatus for processing access behavior data
CN109299112B (en) * 2018-11-15 2020-01-17 北京百度网讯科技有限公司 Method and apparatus for processing data
CN111435939B (en) * 2019-01-14 2023-05-05 百度在线网络技术(北京)有限公司 Method and device for dividing storage space of node
CN110955685A (en) * 2019-11-29 2020-04-03 北京锐安科技有限公司 Big data base estimation method, system, server and storage medium
CN111523072B (en) * 2020-04-20 2023-08-15 咪咕文化科技有限公司 Page access data statistics method and device, electronic equipment and storage medium
CN113064555A (en) * 2021-04-21 2021-07-02 山东英信计算机技术有限公司 BIOS data storage method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
CN103389992A (en) * 2012-05-09 2013-11-13 北京百度网讯科技有限公司 Structured data storage method and device
CN103473334A (en) * 2013-09-18 2013-12-25 浙江中控技术股份有限公司 Data storage method, inquiry method and system
CN106484691A (en) * 2015-08-24 2017-03-08 阿里巴巴集团控股有限公司 The date storage method of mobile terminal and device
CN106682147A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Mass data based query method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7584182B2 (en) * 2005-12-19 2009-09-01 Microsoft Corporation Determining cardinality of a parameter using hash values

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
CN103389992A (en) * 2012-05-09 2013-11-13 北京百度网讯科技有限公司 Structured data storage method and device
CN103473334A (en) * 2013-09-18 2013-12-25 浙江中控技术股份有限公司 Data storage method, inquiry method and system
CN106484691A (en) * 2015-08-24 2017-03-08 阿里巴巴集团控股有限公司 The date storage method of mobile terminal and device
CN106682147A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Mass data based query method and device

Also Published As

Publication number Publication date
CN108197324A (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN108197324B (en) Method and apparatus for storing data
CN108733317B (en) Data storage method and device
CN108628898B (en) Method, device and equipment for data storage
WO2016045641A2 (en) Data block storage method, data query method and data modification method
CN109697277B (en) Text compression method and device
US10360198B2 (en) Systems and methods for processing binary mainframe data files in a big data environment
CN108540508B (en) Method, device and equipment for pushing information
CN113010542B (en) Service data processing method, device, computer equipment and storage medium
CN110866040A (en) User portrait generation method, device and system
CN110198473B (en) Video processing method and device, electronic equipment and computer readable storage medium
CN116049109A (en) File verification method, system, equipment and medium based on filter
CN112436943B (en) Request deduplication method, device, equipment and storage medium based on big data
CN112650804B (en) Big data access method, device, system and storage medium
CN111949678A (en) Method and device for processing non-accumulation indexes across time windows
CN114064308A (en) Multi-data sending and receiving method, device and equipment based on column type data scanning
CN113641706B (en) Data query method and device
CN110852057A (en) Method and device for calculating text similarity
CN111176641B (en) Flow node execution method, device, medium and electronic equipment
CN110377822B (en) Method and device for network characterization learning and electronic equipment
CN111949648B (en) Memory data caching system and data indexing method
WO2023061180A1 (en) Multi frequency-based data sending method and apparatus, multi frequency-based data receiving method and apparatus, and device
CN111259013A (en) Method and device for storing data
CN110019531B (en) Method and device for acquiring similar object set
CN110504973A (en) Compressing file, decompressing method and device
JP6859407B2 (en) Methods and equipment for data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant