WO2019052209A1 - 数据存储方法、装置及存储介质 - Google Patents

数据存储方法、装置及存储介质 Download PDF

Info

Publication number
WO2019052209A1
WO2019052209A1 PCT/CN2018/087377 CN2018087377W WO2019052209A1 WO 2019052209 A1 WO2019052209 A1 WO 2019052209A1 CN 2018087377 W CN2018087377 W CN 2018087377W WO 2019052209 A1 WO2019052209 A1 WO 2019052209A1
Authority
WO
WIPO (PCT)
Prior art keywords
partition
bitmap
protocol
mapping
data
Prior art date
Application number
PCT/CN2018/087377
Other languages
English (en)
French (fr)
Inventor
钟超强
毕杰山
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019052209A1 publication Critical patent/WO2019052209A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Definitions

  • the present application relates to the field of information processing technologies, and in particular, to a data storage method, apparatus, and storage medium.
  • Hadoop Database has the characteristics of distributed, high reliability, high performance, and key-value based storage. Therefore, more and more enterprises and users use HBase to build data tables.
  • the data table includes a plurality of rows of data records, and each row of data records includes an identifier of the carrier and a tag value of each tag that the carrier has.
  • each row of data records includes an identifier of the carrier and a tag value of each tag that the carrier has.
  • the row corresponding to user A in the data table includes the identifier of user A, the tag value "female” and the tag value " engineer”. That is, the correspondence between the identifier of the bearer and the tag value it has is recorded in the data table.
  • the query efficiency is high when performing data query according to the identifier of the bearer, and when the query is combined according to a certain tag value or tag value, the related technology can only According to the column value filter, the label value of each bearer is queried line by line according to the identifier of the bearer, and since the number of rows of the data table is usually tens of thousands, in the related scheme, the data query is performed based on the label value. Its data query efficiency is low.
  • the present application provides a data storage method, device and storage medium.
  • the technical solution is as follows:
  • a data storage method comprising:
  • each data record including a carrier identifier and at least one tag value
  • Each first mapping set corresponds to a first protocol partition
  • the N first protocol partitions are determined according to partition information of N bitmap index partitions included in the bitmap index, where N is a positive integer, and each bitmap index partition corresponds to a first protocol partition, and each bit
  • the map index partition includes at least one bitmap, each bitmap corresponding to a label value, each bitmap includes at least one bitmap bit, and each bitmap bit is used to record whether a bearer corresponding to a bearer identifier has a current The tag value corresponding to the bitmap;
  • the resulting bitmap of the tag values in each bitmap index partition is stored in the corresponding bitmap index partition.
  • the at least one data record when at least one data record is acquired, the at least one data record may be stored in the bitmap index based on the preset mapping/protocol model, so as to pass the data based on a certain tag value after the data is stored.
  • the bitmap index looks up the bearer ID with this tag value.
  • the bitmap of the tag values in each bitmap index partition can be determined in parallel by the preset mapping/protocol model, which improves the efficiency of storing at least one data record to N bitmap index partitions.
  • the partition information of each first protocol partition is composed of a bitmap index table identifier and a bearer identifier of a preset interval range;
  • the first type mapping process is performed on the at least one data record in parallel by using the preset mapping/protocol model to obtain at least one first mapping result, where each first mapping result includes the bitmap index table identifier and the carrier Identification and at least one tag value;
  • each data is further required to be performed in parallel through a preset mapping/protocol model.
  • the recording performs a first type of mapping process to facilitate classifying at least one first mapping result after the mapping.
  • Bitmap including:
  • the protocol partition processes a plurality of data belonging to the protocol partition in a certain order, so for each first mapping set, the first protocol partition corresponding to the first mapping set may First sorting the first mapping result in the first mapping set, and sequentially processing each first mapping result in the first mapping set according to the sorting result.
  • the method before updating the bitmap of the label value according to the bitmap bit of the bearer identifier, the method further includes:
  • the first mapping result further includes a bitmap bit of the bearer identifier, performing an operation of updating the bitmap of the tag value according to the bitmap bit of the bearer identifier;
  • the first mapping result does not include a bitmap bit of the bearer identifier, acquiring a bitmap bit of the bearer identifier, and performing a bitmap that updates the label value according to the bitmap bit of the bearer identifier operating.
  • the bitmap bit of the bearer identifier When updating a bitmap of a tag value, the bitmap bit of the bearer identifier needs to be determined first, and the system may have configured a bitmap bit for the bearer identifier in advance, or may not configure a bitmap bit for the bearer identifier. Therefore, the bitmap result may include the bitmap bit of the bearer identifier, or may not include the bitmap bit of the bearer identifier.
  • the bitmap bit of the bearer identifier When the first mapping result does not include the bitmap bit of the bearer identifier, before updating the bitmap of a certain tag value, the bitmap bit of the bearer identifier needs to be acquired first.
  • the method further includes:
  • the bitmap bit of the bearer identifier and the bearer identifier may also be stored. Corresponding relationship, so as to query the bitmap bit of the bearer identifier according to the bearer identifier, or query the bearer identifier corresponding to the bitmap bit according to the bitmap bit.
  • the at least one data record is classified into the first category according to the partition information of the N first protocol partitions included in the preset mapping/protocol model, before the at least one data record is classified into the first category, Determining the partition information of the N first protocol partitions included in the preset mapping/protocol model according to the partition information of the bitmap index.
  • the method further includes:
  • the M second protocol partitions are determined according to the partition information of the M data partitions included in the data table, where M is a positive integer, each data partition corresponds to a second protocol partition, and each data partition is used for recording bearers.
  • M is a positive integer
  • each data partition corresponds to a second protocol partition
  • each data partition is used for recording bearers.
  • the obtained data of each data partition is stored in the corresponding data partition.
  • the at least one data record when at least one data record is acquired, the at least one data record may also be stored in the data table based on the preset mapping/protocol model, so as to simultaneously store the at least one data record.
  • the data in each data partition can be determined in parallel by a preset mapping/protocol model, which improves the efficiency of storing at least one data record to M data partitions.
  • the partition information of each second protocol partition is composed of a bearer data table identifier and a bearer identifier of a preset interval range;
  • the second type of classification of the at least one data record including:
  • each second mapping result includes the data table identifier, the bearer identifier, and At least one tag value
  • each data needs to be parallelized by using a preset mapping/protocol model.
  • the recording performs a second type of mapping process to facilitate classifying at least one second mapping result after the mapping.
  • N is less than or equal to M, and N is greater than or equal to 2.
  • Each data partition in the M data partitions belongs to a unique bitmap index partition, and each bitmap index partition in the N bitmap index partitions includes at least A data partition.
  • the M data partitions and the bitmap index included in the data table include N bitmap index partitions.
  • the condition, that is, the partition range of each data partition in the data table can be appropriately set smaller, and the partition range of each bitmap index partition in the bitmap index is set to be larger.
  • a data storage device having a function of implementing the behavior of the data storage method of the first aspect described above.
  • the data storage device includes at least one module for implementing the data storage method provided by the first aspect above.
  • another data storage device the structure of which includes a processor and a memory for storing a program supporting the data storage device to execute the data storage method provided by the above first aspect And storing data involved in implementing the data storage method provided by the first aspect above.
  • the processor is configured to execute a program stored in the memory.
  • the operating device of the storage device may further include a communication bus for establishing a connection between the processor and the memory.
  • a computer readable storage medium is provided, the instructions being stored in the computer readable storage medium, when executed on a computer, causing the computer to perform the data storage method of the first aspect described above.
  • a computer program product comprising instructions for causing a computer to perform the data storage method of the first aspect described above when executed on a computer is provided.
  • a bitmap of the tag values in each bitmap index partition included in the bitmap index may be determined by using the preset mapping/protocol model based on the at least one data record to implement The at least one data record is stored in a corresponding bitmap index partition. Since the bitmap index includes at least one bitmap, each bitmap corresponds to one tag value, so the bearer index having the tag value can be searched through the bitmap index based on the tag value, and the efficiency of data query based on the tag value is improved. In addition, the bitmap of the tag values in each bitmap index partition can be determined in parallel by the preset mapping/protocol model, which improves the efficiency of storing data.
  • FIG. 1 is a schematic diagram of a bitmap of a tag value according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a data storage method according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of another data storage method according to an embodiment of the present invention.
  • FIG. 5A is a block diagram of a data storage device according to an embodiment of the present invention.
  • FIG. 5B is a block diagram of a first classification module according to an embodiment of the present invention.
  • FIG. 5C is a block diagram of another data storage device according to an embodiment of the present invention.
  • 5D is a block diagram of a second classification module according to an embodiment of the present invention.
  • FIG. 6 is a block diagram of another data storage device according to an embodiment of the present invention.
  • a tag is a way of organizing content to characterize a certain characteristic of a data to help people describe and classify content.
  • common labels are gender, education, occupation, color, and so on.
  • the label is artificially specified.
  • the tag can include both an enumerated tag and a boolean tag.
  • An enumeration label is a label that includes multiple enumeration values. For example, a degree includes a specialist, an undergraduate, a graduate student, a doctor, etc., for example, a gender includes a male or a female; and a Boolean label is only used to indicate whether the label is provided, such as Whether there is a room, whether it is drug-using, whether there has been a criminal record, etc.
  • the label value of the label refers to the specific value of the label. For example, taking the label as the academic qualification, when the degree is undergraduate, the label value is undergraduate, and when the degree is graduate, the label value is graduate student.
  • the label value of the label is the label itself. For example, when a user has a room, the tag value is a room. For example, when the user does not have a criminal record, the corresponding tag value is a criminal record.
  • Carrier is the object described by each tag.
  • the carrier may be a person, a car, a phone number or a virtual user account or the like.
  • a carrier can have one label or multiple labels.
  • the person's tag can have gender, education, whether there is a room, whether there is a criminal record, and so on.
  • the label of the car can be colored, whether there is a violation record, and the like.
  • Data table A data record created for the index in the database.
  • Each data record in the data table records the identifier of a bearer, records all the tag values of the bearer, and records the correspondence between the identifier of the bearer and the tag value of the bearer.
  • Bitmap index A secondary index established for indexing in the database with tag values in the data table.
  • the bitmap index records the tag value and the bitmap, and also records a one-to-one correspondence between the tag value and the bitmap.
  • Each bitmap bit in the bitmap corresponds to the identifier of one bearer, but the different bitmap bits in the bitmap correspond to the identifiers of different bearers, that is, all bitmap bits in the bitmap and one of the identifiers of all bearers A correspondence.
  • Each bitmap bit in the bitmap records whether the bearer corresponding to the identifier of a bearer has a label value corresponding to the current bitmap (the bitmap in which the bitmap bit is located); for example, if a bit of the label value If a bitmap bit in the figure is 1, it means that the bearer corresponding to the bitmap bit has the label value, and if the bitmap bit is 0, it represents the bearer corresponding to the bitmap bit. Does not have this tag value.
  • the same bitmap bits in different bitmaps correspond to the identity of the same carrier.
  • the carrier Take the carrier as the virtual user account as an example. Assume that there are a total of eight virtual user accounts. Each virtual user account is user1, user2, ..., and user8.
  • the set with the tag value "online shopper” is: user1, user4, user8, and the set with the tag value “forum actives” is: user1, user2, user8.
  • the bitmap bits allocated for the eight virtual user accounts in the bitmap are 1, 2, 3...8, as shown in FIG. 1; for the label value "online purchaser”, the corresponding bitmap is "10010001"; For the tag value "Forum Active", the corresponding bitmap includes 11000001. Take the bitmap “10010001” corresponding to “Internet shopping darling” as an example.
  • the first “1” in the bitmap indicates that the virtual user account with bitmap position 1 is a network darling, similarly, the second in the bitmap.
  • a “1” indicates that the virtual user account with bitmap bit 4 is also an online shopper.
  • the third “1” in the bitmap indicates that the virtual user account with bit map bit 8 is also an online shopper; the tag value "forum activists”
  • the meaning of the corresponding bitmap "11000001" is similar. As can be seen from Figure 1, user1 and user8 have both the "network talent" and “forum active” tag values.
  • the application scenario of the embodiment of the present invention is introduced.
  • the client usually needs to perform data query through the server.
  • the server stores the pre-stored
  • the data table is configured to query, according to the identifier of the bearer, the at least one label value corresponding to the identifier of the carrier from the data table, and determine at least one label corresponding to the at least one label value that is queried as the label that the carrier has.
  • the efficiency of querying data is high.
  • the server queries the tag value of each bearer item by item according to the data table to determine which carriers have the tag value, and query data at this time.
  • the efficiency is lower. It can be seen that how the server stores the correspondence between the bearer and the tag value will affect the efficiency of the client to perform data query through the server.
  • the embodiment of the present invention is applied to a scenario in which a server performs data storage.
  • the server may be one or more servers; optionally, multiple servers may provide database services for the terminal in a server cluster manner.
  • a database is set in the server, and the database can be HBase, Mongo database (Mongo Database, MongoDB), Distributed Relational Database Service (DRDS), Volt Database (Volt Database, VoltDB), and Distributed database such as ScaleBase.
  • the data storage method provided by the embodiment of the present invention mainly includes two parts, one is to store at least one data record into a bitmap index, and the other is to store the at least one data record in the data table.
  • the bitmap index and data table provided by the embodiment of the present invention are first introduced.
  • the data table is used to record the correspondence between the bearer identifier and the tag value
  • the bitmap index includes at least one bitmap, each bitmap corresponds to one tag value, and each bitmap includes at least one bitmap bit, each bit
  • the picture bit is used to record whether the bearer corresponding to the bearer identifier has the label value corresponding to the current bitmap.
  • the data table may be divided into M data partitions, and the M data of the different data storage data tables are distributedly distributed.
  • the bitmap index is divided into N bitmap index partitions, and N data partitions of different data storage data tables are distributedly distributed. That is, the data table includes M data partitions, and the bitmap index includes N bitmap index partitions.
  • each data table may be appropriately The first range of data partitions is set smaller.
  • the second range of each bitmap index partition in the bitmap index can be appropriately set to a larger extent. That is, for each bitmap index partition in the N bitmap index partitions, the bitmap index partition includes data in at least one data partition.
  • M and N may satisfy the following relationship: N is less than or equal to M, N is greater than or equal to 2, and each data partition in the M data partitions belongs to a unique bitmap index partition, N Each bitmap index partition in the bitmap index partition contains at least one data partition.
  • partition data partitions are to divide the data partitions by specifying the number of data partitions, or you can directly define the partitioning interval for each data partition.
  • a partition interval in which each data partition is directly defined is taken as an example for description.
  • the set of bearer identifiers corresponding to each data partition is referred to as a first range, that is, the set of bearer identifiers corresponding to each data partition is identical.
  • Each data partition is used to store data records in which the bearer identifier is located in the partition interval, and there is no intersection between each partition interval to avoid the same data record being stored in two different data partitions.
  • data partition 1 [, a1)
  • data partition 2 [a1, a2)
  • data partition 3 [a2, a3)
  • data partition 9 [a8, a9 ).
  • the data partition 1 is used to store the data record of the bearer identifier in the partition section [, a1)
  • the data partition 2 is used to store the data record of the bearer identifier in the partition section [a1, a2)
  • the data partition 3 is used to store the bearer identifier.
  • the data partition 9 is used to store the data record of the bearer identifier in the partition section [a8, a9).
  • each data partition of the data table can be automatically fissile or expanded. For example, as time passes, the data of a certain data partition increases. When the data volume of the data partition reaches the split threshold, the server can split the data partition into two data partitions, thereby avoiding the After the storage space of the data partition is full, it cannot continue to write new data to the data partition.
  • the partitioning of the bitmap index partition can be similar to the partitioning of the reference data partition.
  • the partitioning method of dividing the bitmap index partition is also defining the partition interval of each bitmap index partition. That is, the partition interval corresponding to each bitmap index partition is preset by the user, and the server divides the bitmap index partition according to the preset partition interval.
  • the range of the bearer identifier corresponding to each bitmap index partition is also the same. For the convenience of description, the range of the bearer identifier corresponding to each bitmap index partition is referred to as a second range.
  • bitmap index partition 1 is used to store a bitmap of the label value of the bearer identifier in the partition section [b0, c0)
  • bitmap index partition 2 is used to store the label value of the bearer identifier in the partition section [c0, d0).
  • Bitmap index partition 3 is used to store a bitmap carrying the tag value in the partition interval [d0, e0), and the bitmap index partition 4 is used to store the bearer identifier in the partition interval [e0, f0)
  • a bitmap of the tag value, the bitmap index partition 5 is used to store a bitmap carrying the tag values identified in the partition interval [f0, j0).
  • each bitmap index partition is used to store a bitmap of a part of the label value of the bearer identifier
  • bitmap of each label value is partitioned by each bitmap index.
  • a part of the bitmap value of the label value is combined.
  • a part of the bitmap of the label value in a bitmap index partition is called a sub-bit map of the label value, and therefore, the bitmap of each label value is It is composed of the corresponding sub-bitmaps in all bitmap index partitions.
  • the same bitmap bit of different sub-bitmaps in each bitmap index partition corresponds to the identity of the same bearer, and the same bitmap bit of different sub-bitmaps in different bitmap index partitions corresponds to no The identity of the same carrier.
  • each bitmap index partition is created for each tag value in the tag definition table in each bitmap index partition.
  • the number of sub-bitmaps in each bitmap index partition is the number of all tags in the tag definition table. For example, if the total number of tag values set in the tag definition table is 10, each bitmap index partition is created after each bitmap index partition separately creates a sub-bitmap for each tag value in the tag definition table. The number of sub-bitmaps is also 10.
  • the server sets the bitmap index partition to be non-splitable or expandable.
  • FIG. 2 is a flowchart of a data storage method according to an embodiment of the present invention, which is applied to a scenario in which the at least one data record is stored in a bitmap index. As shown in FIG. 2, the data storage method includes the following steps:
  • Step 201 Acquire at least one data record, where each data record includes a carrier identifier and at least one tag value.
  • each piece of source data includes a carrier identifier and at least one label.
  • the at least one source data may be data stored in a Hadoop Distributed File System (HDFS), that is, when the client needs to store a certain data, the data is sent to the server, the server.
  • the data is first stored in the HDFS, and then the server stores the data according to the source data in the HDFS.
  • the server may obtain at least one source data from the HDFS according to the default path, and may obtain at least one source data from the HDFS according to the preset path, which is not limited herein.
  • the tag definition table may be information that is previously acquired and stored by the server.
  • the tag definition table may be stored in the form of a separate file, such as an Extensible Markup Language (XML) file, or may be stored in a third-party distributed storage system, such as to ZooKeeper. .
  • XML Extensible Markup Language
  • the preset tag definition table records a preset plurality of tag values.
  • An optional preset method that sets the tag value contained in the tag based on historical data, or artificially defines the tag value contained in the tag.
  • Table 1 shows a possible list of tag definitions. Of course, Table 1 may also include more or fewer tags, which is not limited.
  • label Tag value Tag configuration information gender men and women Resident memory Education Specialist, undergraduate, graduate student, doctor Not resident memory
  • the label definition table may further include label configuration information, where the label configuration information includes whether the label value needs resident memory, and whether the bitmap corresponding to the resident memory value needs to be used frequently. Residing in memory, bitmaps corresponding to tag values that do not require resident memory do not need to be resident in memory.
  • the identifier "resident memory” is set for the tag value that needs to be resident memory, and the identifier “non-resident memory” is set for the tag value that does not require resident memory. It should be understood that the identifier "resident memory” may also be set for the tag value that needs to be resident memory, and the tag value that does not need resident memory may not be set. Table 1 sets the identifier for the tag value that does not require resident memory. Resident memory is just an example.
  • the label definition table may further include a lifetime of each label value, where the label value is a valid time period; that is, other times that are not in the life cycle, the label value is invalid.
  • the server may also assign a tag number to each tag value in Table 1.
  • the tag value can be replaced by the tag number, and the storage tag number can save storage space relative to the stored tag value.
  • the corresponding tag value can be queried according to the tag number, and the corresponding tag number can be queried according to the tag value.
  • Table 2 is a format of source data provided by an embodiment of the present invention.
  • Each row in Table 2 represents a piece of source data, and each source data has a unique bearer identifier, and each source data further includes the bearer. Identify at least one corresponding tag.
  • Table 3 shows the data records corresponding to the respective pieces of source data determined by the server according to Table 1, wherein the contents in each [] in Table 3 represent one tag value.
  • Each row in Table 3 represents a data record, and each data record includes a carrier identifier and at least one tag value.
  • the at least one piece of data may be stored in the bitmap index by using a preset map/reduce model.
  • a preset map/reduce model For the convenience of the description, the preset mapping/protocol model is explained here.
  • the preset mapping/protocol model is a parallel computing model, which mainly includes two computing processes, a mapping process and a reduction process, and the mapping process also classifies the data records according to the type of data to be stored.
  • the process of the protocol that is, the process of storing the data records into the corresponding files according to the protocol partition corresponding to the data record.
  • the preset mapping/protocol model includes a plurality of protocol partitions, each protocol partition corresponds to one data interval, and each protocol partition is used to process data belonging to the data interval, and different protocol partitions are parallel processing manners. Because of the parallel processing between different protocol partitions, the bitmap of the label values in each bitmap index partition can be determined in parallel by the preset mapping/protocol model.
  • mapping process maps each data record in parallel, so that the batch data can be processed in parallel by the preset mapping/protocol model, and the efficiency of processing the data is also improved.
  • the at least one data record in addition to storing at least one data record in the bitmap index, the at least one data record may be stored in the data table, that is, the data record needs to be simultaneously stored in the data table.
  • the data table identifier and the bitmap index table identifier are introduced here, wherein the data table identifier is used to uniquely identify the data table, and the bitmap index table identifier is used for Uniquely identifies the bitmap index.
  • the protocol partition of the above preset mapping/protocol model can be set as the data partition of the data table and A combination of the bitmap index partitions of the bitmap index.
  • the data record can be directly stored into the corresponding data partition and the bitmap index partition by the preset mapping/protocol model.
  • the N protocol partitions corresponding to the N bitmap index partitions of the bitmap index are referred to as the first protocol partition, and the M protocol partitions corresponding to the M data partitions of the data table are called one by one. Partition for the second protocol.
  • mapping process of the preset mapping/protocol model also includes two different mapping processing processes.
  • the specification process of the preset mapping/protocol model also includes two different protocol processing processes.
  • a protocol process corresponding to storing at least one data record into a bitmap index which is called a first type of protocol processing
  • Step 202 Determine N first protocol partitions in the preset mapping/protocol model.
  • the partition information of the bitmap index is determined, and the partition information of the bitmap index is used to describe a set of bearer identifiers corresponding to each bitmap index partition in the bitmap index. Determining N first protocol partitions in the preset mapping/protocol model according to the partition information of the bitmap index, and each first protocol partition corresponds to one bitmap index partition. That is, the N first protocol partitions are determined according to the partition information of the N bitmap index partitions included in the bitmap index.
  • each partition interval in the data table represents a set of bearer identifiers
  • each partition interval in the bitmap index also represents a set of bearer identifiers. Therefore, if the data table is directly The partition interval is used as the partition interval of the M second protocol partitions in the preset mapping/protocol model, and the partition interval of the bitmap index is used as the partition interval of the N first protocol partitions in the preset mapping/protocol model. Will result in an intersection between the N first protocol partitions and the M second protocol partitions.
  • a bitmap index table identifier for identifying a bitmap index is added to the partition interval of the bitmap index, and the bitmap index after the bitmap index table identifier is added
  • the partition interval is determined as the partition interval of the N first protocol partitions in the preset mapping/protocol model. That is, the partition information of each first protocol partition is composed of a bitmap index table identifier and a bearer identifier of a preset interval range.
  • B is the identifier used to identify the bitmap index, that is, the bitmap index table identifier. That is, the first protocol partitions [B b0, Bc0), [Bc0, Bd0), [Bd0, Be0), [Be0, Bf0), and [Bf0, B j0) are the one-to-one correspondences with the respective bitmap index partitions. Partition.
  • the at least one data record needs to be according to the N
  • the first protocol partition is classified so that different first protocol partitions correspondingly process data belonging to the first protocol partition.
  • the at least one data record is classified into the first class according to the partition information of the N first protocol partitions included in the preset mapping/protocol model, and at least one first is obtained.
  • a mapping set each first mapping set corresponding to a first protocol partition, so that the first protocol partition processes data in the corresponding first mapping set.
  • the process can be implemented by the following steps 203 to 204.
  • Step 203 Perform a first type of mapping processing on the at least one data record in parallel by using the preset mapping/protocol model to obtain at least one first mapping result, where each first mapping result includes the bitmap index table identifier and the bearer identifier. And at least one tag value.
  • step 202 it can be seen from step 202 that the partition interval of the N first protocol partitions in the preset mapping/protocol model is not actually the partition interval of the bitmap index partition in the bitmap index. Therefore, the first type of mapping processing is mainly for each The strip data record adds a bitmap index table identifier to facilitate subsequent determination of the first protocol partition corresponding to each data record.
  • bitmap index table identifier is added for each data record to obtain a first mapping result.
  • the preset mapping/protocol model adds the bitmap index table identifier to each data record in parallel, that is, the preset mapping/protocol model and the bitmap index table identifier are added to each data. Recorded. Therefore, the preset mapping/protocol model adds the bitmap index table identifier to the one data record at the same time as the n data records, and improves the addition of the bitmap index table identifier to at least one data record. s efficiency.
  • the first mapping result may be recorded in a key-value format.
  • Table 4 is a format of a first mapping result provided by an embodiment of the present invention. As shown in Table 4, the bitmap index table identifier and the bearer identifier in the first mapping result are collectively set as a key, and the first At least one tag value in a mapping result is set to the value of the key.
  • remark information may also be added to the corresponding value, where the remark information includes a generation time of each tag value in at least one tag value. , or the internal identity (ID) of each tag value.
  • ID the internal identity of each tag value.
  • B is a bitmap index table identifier.
  • the preset mapping/protocol model performs the first type of mapping processing on the data record to obtain a first mapping result, and the first mapping result includes a bitmap index table identifier B, a carrier identifier a01, and two labels. Values are "male” and "undergraduate”.
  • the first mapping result is recorded according to the format shown in Table 4 above, and the first mapping result shown in Table 5 is obtained, that is, the first mapping result is recorded as the key is Ba01, and the value is ⁇ sex: male, academic qualification : Undergraduate ⁇ data.
  • the system may have configured a corresponding bitmap bit for the bearer identifier, and the first mapping result further includes a bitmap bit of the bearer identifier.
  • the corresponding bitmap bit is not configured for the bearer identifier, the first mapping result does not include the bitmap bit of the bearer identifier.
  • the mapping result shown in Table 6 or Table 7 can be obtained.
  • Table 6 the bitmap index table identifier, the bearer identifier, and the bitmap bits of the bearer identifier are set together as a key, and the value is still at least one label value in the first mapping result.
  • bitmap index table identifier and the bearer identifier may be collectively set as a key, and the at least one label value and the bitmap bit of the bearer identifier are collectively set to a value corresponding to the key.
  • At least one first mapping result is obtained, that is, for each data record, the above Table 4 or Table 6 is obtained.
  • the first type of classification is performed on the at least one first mapping result by the following step 204.
  • Step 204 Classify the at least one first mapping result according to the partition information of the N first protocol partitions, to obtain at least one first mapping set, where each first mapping set corresponds to one first protocol partition.
  • different protocol partitions may process data of the partition sections belonging to the protocol partition in parallel, and therefore, for the at least one first mapping result, the at least one first mapping result needs to be returned. Class to the corresponding first protocol partition.
  • the at least one first mapping result For each of the at least one first mapping result, searching for the first section from the partitioning sections of the N first protocol partitions according to the bearer identifier and the bitmap index table identifier in the first mapping result. A partition interval to which the bearer identifier in the mapping result belongs to implement classification of the at least one first mapping result. After the classification, at least one first mapping set is obtained, and for each first mapping set, the first mapping set includes at least one first mapping result.
  • Step 205 Perform a first type of protocol processing on the at least one first mapping set in parallel by using the first protocol partition corresponding to each of the at least one first mapping set to obtain a bitmap of the label values in each bitmap index partition.
  • the first type of protocol processing is performed on a first mapping set.
  • the process is explained. Specifically, the first type of protocol processing is divided into the following two processes:
  • the server When the server performs the first type of protocol processing by using the preset mapping/protocol model, for each data record, since the data record has a corresponding first mapping result, and the step 204 is used to determine that the first mapping result belongs to The first mapping set. At this time, since the data record corresponding to the first mapping result belonging to the first mapping set is stored in the same bitmap index partition, the server first belongs to the first protocol partition corresponding to the first mapping set. The first mapping result of a mapping set is sorted to sequentially store the data in the first mapping set sequentially into the corresponding bitmap index partition in the order after the arrangement.
  • the manner of sorting the first mapping result in the first mapping set is usually a default sorting method, where the default sorting method is in ascending order according to the lexicographic order of the bearer identifier, or according to the lexicographic order of the bearer identifier.
  • the embodiment of the present invention is not specifically limited herein.
  • the first mapping set includes three first mapping results, and the bearer identifiers in the three first mapping results are a01, a02, and a03, respectively, and may be sequentially performed according to the order of a01, a02, and a03.
  • a mapping result is sorted.
  • the first mapping result of the data record may include a bitmap bit of the bearer identifier, or may not include a bitmap bit of the bearer identifier, and therefore, according to the bearer identifier.
  • the bitmap bit updates the bitmap of each of the at least one tag value included in the first mapping result in two ways:
  • the bitmap of the corresponding tag value is updated according to the bitmap bit of the bearer identifier.
  • the bitmap bit of the bearer identifier is obtained, and the bit of the corresponding tag value is updated according to the bitmap bit of the bearer identifier.
  • updating the bitmap of each of the at least one tag value included in the first mapping result needs to first determine the bitmap bit of the bearer identifier, after determining the bitmap bit of the bearer identifier,
  • the first mapping result includes a bitmap of each of the at least one tag value, and the bitmap of the tag value is updated at a value of the bitmap bit of the carrier identifier.
  • the bitmap of the tag value may be stored in the manner shown in FIG. 1, that is, the value of the bitmap of the tag value is 0 or 1 on each bitmap bit, and at this time, the tag value is The bitmap is updated at the value of the bitmap bit of the bearer identifier, that is, the value of the bitmap of the tag value is set to 1 on the bitmap bit of the bearer identifier.
  • bitmap for determining each tag value is set by setting the value of the bitmap of the tag value on the bitmap bit of the carrier identifier to 1. Therefore, in the embodiment of the present invention, the value of each tag value on each bitmap bit is initialized in advance for the bitmap of each tag value, that is, set to zero.
  • the first mapping result includes each of the at least one tag value, and the value of the sub-bitmap of the tag value on the bitmap bit of the carrier identifier is changed to 1, that is, the at least one tag value
  • the sub-bitmap is updated, that is, the bitmap index partition is updated. No processing is performed on the label value other than the at least one label value in the bitmap index partition, that is, the value of the sub-bit map of the other label value is still 0 on the bitmap bit of the bearer identifier. , indicating that the bearer identifier does not have the other tag value.
  • the first first mapping result is substantially the same as the above processing, except that, at this time, according to the above A first mapping result continues to update the bitmap index partition on the bitmap index partition after the bitmap index partition update, that is, at this time, the sub-bit of at least one label value in the second first mapping result
  • the sub-bitmap of at least one of the first mapping results in the bitmap index partition has a value of 1 on the bitmap bit of the bearer identifier of the first mapping result.
  • the bitmap is based on the previous first mapping result.
  • the bitmap of the tag value is updated after the bitmap of the tag value in the index partition is updated.
  • Table 9 is an initialized bitmap index provided by an embodiment of the present invention.
  • the bitmap index includes a plurality of bitmap index partitions, and each bitmap index partition includes sub-bits of all label values.
  • the initial value of the sub-bitmap of each tag value on each bitmap bit is 0.
  • the first mapping result of the nine data records is the first mapping result shown in Table 5, and the first mapping result of the nine data records is classified into the first protocol corresponding to the bitmap index partition 1 by the above step 204. Partition. And sorting the nine first mapping results according to the carrier identifiers a01, a02, b01, b02, c01, c02, d01, d02, and d03 in the nine first mapping results, which are sequentially the first data records.
  • the bitmap bits configured by the system for the carrier identifiers a01, a02, b01, b02, c01, c02, d01, d02, and d03 are 1, 2, 3, 4, 5, 6, 7, 8, and 9, respectively
  • the second mapping result includes two label values, “gender: male” and “education: undergraduate”.
  • the two label values are in the bitmap index.
  • the values of the sub-bitmaps of the two label values on the bitmap bit 1 are updated to 1,
  • the bitmap index partition 1 shown in Table 10 is obtained.
  • the first mapping result of the bearer identifier a02 includes four tag values, “gender: female”, “education: specialist”, “occupation: individual”, and “online purchaser”, by Table 9
  • the corresponding sub-bitmaps of the four tag values in the bitmap index partition 1 are the sub-bitmap of the second tag value, the sub-bitmap of the third tag value, the sub-bitmap of the sixth tag value, and Sub-bitmap of the second-to-last tag value.
  • the value of the word bitmap of the four tag values on bit map 2 is continuously updated to 1, and the table shown in Table 11 is obtained.
  • Bitmap index partition 1 is the sub-bitmap of the second tag value.
  • the bitmap of the tag value may also be represented by an array, in which case an array of tag values is used to represent a bitmap bit of a "1" in the bitmap of the tag value.
  • the bitmap can also be represented as an array [10]
  • the tag value "online purchaser” corresponds to a bitmap "[0100011000000100. ...]”
  • the bitmap can also be represented as an array [2, 6, 7, 14].
  • the bitmap is used to represent the bitmap of the tag value, which can save storage space.
  • the bitmap of the tag value is updated on the bitmap bit of the bearer identifier, that is, the bitmap bit of the identifier of the bearer is added to the array of the tag value.
  • the identifier of the bearer in the sub-bitmap is 3, and the initial sub-bitmap corresponding to the tag value is [1, 7], then the tag value is After the bitmap is updated with the value on the bitmap bit of the bearer identifier, the updated sub-bitmap of the tag value is [1, 3, 7].
  • the system does not configure the bearer identifier in the data record before mapping the data record corresponding to the first mapping result.
  • the mapping between the bitmap bit of the bearer identifier and the bearer identifier may also be stored.
  • the bitmap bit and the bearer identifier of the bearer identifier are stored in a bidirectional mapping manner, so as to find a corresponding bitmap bit according to the bearer identifier, or according to the bit.
  • the bitmap finds the corresponding bearer ID.
  • Step 206 Store the obtained bitmap of the tag value in each bitmap index partition into the corresponding bitmap index partition.
  • step 205 it can be seen from step 205 that, for a first mapping set, since at least one bitmap for each first mapping result is determined based on at least one bitmap of the determined first mapping result, a first mapping result, when determining at least one bitmap, storing the at least one bitmap into the bitmap index partition corresponding to the first mapping result, so that when the next first mapping result is processed later, The update continues based on the updated target bitmap index partition.
  • bit corresponding to the first protocol partition is obtained.
  • a bitmap of each tag value in the index partition, in which case the bitmap of each tag value in the obtained bitmap index partition can be directly stored into the bitmap index partition.
  • the bitmap of the label value in each bitmap index partition included in the bitmap index may be determined by using the preset mapping/protocol model based on the at least one data record.
  • the bitmap of the tag values in each bitmap index partition can be determined in parallel by the preset mapping/protocol model, which improves the efficiency of storing data.
  • FIG. 3 is a flowchart of a data storage method according to an embodiment of the present invention, which is applied to a scenario in which the at least one data record is stored in a data table. As shown in FIG. 3, the data storage method includes the following steps:
  • Step 301 Acquire at least one data record, where each data record includes a carrier identifier and at least one tag value.
  • step 301 is basically the same as that of step 201 in FIG. 2, and will not be described in detail herein.
  • the preset mapping/protocol model includes M second protocol partitions, and the M second protocol partitions are in one-to-one correspondence with the M data partitions included in the data table.
  • the M second protocol partitions are in one-to-one correspondence with the M data partitions included in the data table.
  • Statute partition And storing the at least one data record into a second type of mapping processing and a second type of specification processing in the data table through the preset mapping/protocol model.
  • the process can be implemented by the following step 302.
  • Step 302 Determine M second protocol partitions in the preset mapping/protocol model.
  • the partition information of the data table is determined, and the partition information of the data table is used to describe a set of bearer identifiers corresponding to each data partition in the data table. Determining M second protocol partitions in the preset mapping/protocol model according to the partition information of the data table, and each second protocol partition corresponds to one data partition.
  • a data table identifier for identifying the data table is added to the partition interval of the data table, and the data table after the data table identifier is added
  • the partition interval is determined as the partition interval of the M second protocol partitions in the preset mapping/protocol model. That is, the partition information of each second protocol partition is composed of the data table identifier and the bearer identifier of the preset interval range.
  • A is the identifier used to identify the data table, that is, the data table identifier. That is, the second protocol partition [, Aa1), [Aa1, A a2), [Aa2, Aa3), ..., [Aa8, Aa9] are protocol partitions that correspond one-to-one with the respective data partitions.
  • the at least one data record needs to be classified according to the M second protocol partitions, so as to be different.
  • the second protocol partition correspondingly processes data belonging to the second protocol partition.
  • the at least one data record is classified into the second category according to the partition information of the M second protocol partitions included in the preset mapping/protocol model, to obtain at least one second. a mapping set, each second mapping set corresponding to a second protocol partition, so that the second protocol partition processes the data in the corresponding first mapping set.
  • the process can be implemented by the following steps 303 to 304.
  • Step 303 Perform a second type of mapping processing on the at least one data record in parallel by using the preset mapping/protocol model to obtain at least one second mapping result, where each second mapping result includes a data table identifier, a carrier identifier, and at least one label. value.
  • step 302 it can be seen from step 302 that the partitioning interval of the M second protocol partitions in the preset mapping/protocol model is not actually the partitioning interval of the data partition in the data table. Therefore, the second type of mapping processing is mainly for each data.
  • the record adds the data table identifier to facilitate subsequent determination of the second protocol partition corresponding to each data record.
  • the preset mapping/protocol model adds a data table identifier for each data record to obtain a second mapping result.
  • the preset mapping/protocol model adds the data table identifiers to each data record in parallel, that is, the preset mapping/protocol model simultaneously adds the data table identifier to each data record. Therefore, the preset mapping/protocol model adds the data table identifier to the one data record at the same time as the n data records, increasing the efficiency of adding the data table identifier to the at least one data record.
  • the second mapping result may be recorded in a key-value format.
  • Table 12 is a format of a second mapping result provided by the embodiment of the present invention. As shown in Table 12, for the second mapping result, the data table identifier and the bearer identifier are collectively set as a key, and the second mapping is performed. At least one of the tag values in the result is set to the value of the key.
  • the remark information may also be added to the corresponding value, and the remark information may be in step 203 in FIG. Remark information in the first mapping result.
  • A is a data table identifier.
  • the first data record in Table 3 "a01-> ⁇ gender: male, education: undergraduate ⁇ ”
  • the bearer identifier in the data record is a01
  • the data record includes two tag values "gender: male” and "Education: Undergraduate”
  • the preset mapping/protocol model performs a second type of mapping processing on the data record to obtain a second mapping result
  • the second mapping result includes a data table identifier A, a carrier identifier a01, and two label values. Male and undergraduate.
  • the second mapping result is recorded according to the format shown in Table 12 above, and the second mapping result as shown in Table 13 is obtained, that is, the second mapping result is recorded as the key Aa01, and the value is ⁇ sex: male, academic qualification : Undergraduate ⁇ data.
  • At least one second mapping result is obtained, that is, for each data record, the second table shown in the above table 12 is obtained. Map the results. Then, the second type of classification is performed on the at least one second mapping result by the following step 304.
  • Step 304 Classify the at least one second mapping result according to the partition information of the M second protocol partitions, to obtain at least one second mapping set, where each first mapping set corresponds to two second protocol partitions.
  • different protocol partitions may process data of the partition sections belonging to the protocol partition in parallel, and therefore, for the at least one second mapping result, the at least one second mapping result needs to be returned. Class to the corresponding second protocol partition.
  • the second mapping from the partition sections of the M second protocol partitions according to the bearer identifier and the data table identifier in the second mapping result.
  • the bearer in the result identifies the partitioning interval to which the carrier identifies the classification of the at least one second mapping result.
  • Step 305 Perform a second type of protocol processing on the at least one second mapping set in parallel by the second protocol partition corresponding to each of the at least one second mapping set to obtain data in each data partition.
  • the second type of protocol processing is performed on a second mapping set.
  • the process is explained. Specifically, as with step 205 in FIG. 2, the second type of protocol processing is also divided into the following two processes:
  • the implementation manner of sorting the second mapping result in the second mapping set by using the second protocol partition may refer to the first mapping in the first mapping set by using the first protocol partition by using step 205 in FIG. 2
  • the implementation of the sorting of the results is not described in detail herein.
  • At least one record of the second mapping result is sequentially generated according to the sorting result, and each record includes a carrier identifier and a label value to obtain The data in the data partition corresponding to the second mapping set.
  • the second protocol partition corresponding to the second mapping set in the preset mapping/protocol model may be sequentially used, and each second is sequentially performed according to the sorting result.
  • the mapping results are processed.
  • the format of the second mapping result is the key-value format shown in Table 12 in step 303
  • at this time at least one record of the second mapping result is generated, that is, from the second mapping result.
  • the data table identifier is deleted in the key, and the key is the bearer data identifier and the data whose value is the at least one label value, and the obtained data is converted into at least one record, and each record includes the bearer identifier and a label value.
  • the at least one record may also be output in a key-value format, that is, for each record, the carrier identifier is used as a key, and the one tag value is used as a value to obtain a record in a key-value format.
  • the following two records as shown in Table 14 can be obtained by step 305.
  • the key of the first record is a01
  • the value is ⁇ sex: male ⁇
  • the key of the second record is a01
  • the value is ⁇ degree: undergraduate ⁇ .
  • Step 306 Store the obtained data of each data partition into a corresponding data partition.
  • step 305 For each first mapping set, when the data in each data partition is obtained through step 305, the data of each data partition can be directly stored into the corresponding data partition.
  • the step 305 is performed in parallel according to different second mapping sets, that is, in the embodiment of the present invention, the data records belonging to different data partitions can be stored in parallel to the corresponding data by using the preset mapping/protocol model. In the partition, which improves the efficiency of storing data.
  • the data in each data partition included in the data table may be determined based on the at least one data record by using a preset mapping/protocol model to implement the at least one piece of data.
  • the record is stored in the corresponding data partition so as to subsequently query the label that a bearer has based on the data table.
  • the data in each data partition can be determined in parallel by a preset mapping/protocol model, which improves the efficiency of storing data.
  • the N first protocol partitions and M firsts included in the preset mapping/protocol model are preset.
  • the second protocol partition can realize the simultaneous construction of the data table and the bitmap index according to the at least one data record. This will be described in detail in the following examples.
  • an embodiment of the present invention provides a data storage method for simultaneously storing at least one data record into a scene in a data table and a bitmap index. As shown in FIG. 4, the method includes the following steps:
  • Step 401 Acquire at least one data record, where each data record includes a carrier identifier and at least one tag value.
  • step 401 is basically the same as that of step 201 in FIG. 2, and will not be described in detail herein.
  • bitmap index and the data table are simultaneously constructed by the following steps 402 to 406.
  • Step 402 Determine N first protocol partitions and M second protocol partitions in the preset mapping/protocol model.
  • step 402 may refer to the implementation of step 202 in FIG. 2 and step 302 in FIG.
  • the N first protocol partitions corresponding to the N bitmap index partitions and the M data partitions may be acquired at the same time.
  • Corresponding M second protocol partitions correspond to the N bitmap index partitions and the M data partitions.
  • Step 403 Perform a first type mapping process on the at least one data record in parallel by using the preset mapping/protocol model to obtain at least one first mapping result, where each first mapping result includes the bitmap index table identifier and the bearer identifier. And at least one tag value; at the same time, the second type mapping process is performed on the at least one data record in parallel by using the preset mapping/protocol model to obtain at least one second mapping result, where each second mapping result includes a data table identifier and a carrier Identification and at least one tag value.
  • step 403 For the implementation of step 403, reference may be made to step 203 in FIG. 2 and the implementation in step 303 in FIG. 3.
  • the first type mapping process in step 203 in FIG. 2 and the second type mapping process in step 303 in FIG. 3 can be processed in parallel to achieve simultaneous acquisition of each data record.
  • the first mapping result and the second mapping result can be processed in parallel to achieve simultaneous acquisition of each data record.
  • Step 404 Classify the at least one first mapping result according to the partition information of the N first protocol partitions, to obtain at least one first mapping set, where each first mapping set corresponds to a first protocol partition; and, according to M
  • the partition information of the second protocol partition classifies the at least one second mapping result to obtain at least one second mapping set, where each first mapping set corresponds to two second protocol partitions.
  • step 404 may refer to the implementation of step 204 in FIG. 2 and step 304 in FIG.
  • the first class classification in step 204 in FIG. 2 and the second class classification in step 304 in FIG. 3 may be processed in parallel to achieve simultaneous classification of at least one first mapping result and at least one second.
  • the mapping results are classified.
  • Step 405 Perform a first type of protocol processing on the at least one first mapping set in parallel by using the first protocol partition corresponding to each of the at least one first mapping set to obtain a bitmap of the label value in each bitmap index partition. And performing, by using the second protocol partition corresponding to each of the at least one second mapping set, the second type of protocol processing on the at least one second mapping set in parallel to obtain data in each data partition.
  • step 405 can refer to the implementation of step 205 in FIG. 2 and step 305 in FIG.
  • each protocol partition is parallel.
  • the land is in its own data. That is, the processing data between the various protocol partitions is independent of each other, so as to simultaneously determine the data in each bitmap index partition and the data in each data partition.
  • Step 406 Store the obtained bitmap of the tag values in each bitmap index partition into the corresponding bitmap index partition; at the same time, store the obtained data of each data partition into the corresponding data partition.
  • step 406 may refer to the implementation of step 206 in FIG. 2 and step 306 in FIG.
  • the at least one data record when at least one data record is acquired, the at least one data record may be simultaneously stored in the corresponding bitmap index partition and the data partition by using the preset mapping/protocol model based on the at least one data record.
  • the bearer identifier corresponding to a certain tag value based on the bitmap index partition or query the tag of a certain bearer based on the data partition query the efficiency of storing data is improved.
  • an embodiment of the present invention further provides a data storage device.
  • the data storage device 500 includes an obtaining module 501, a first classification module 502, a first protocol module 503, and a A storage module 504.
  • the obtaining module 501 is configured to perform step 201 in FIG. 2 or step 301 in FIG. 3;
  • the first classifying module 502 is configured to perform, according to the bearer identifier included in each data record, the first class classification of the at least one data record according to the partition information of the N first protocol partitions included in the preset mapping/protocol model. Obtaining at least one first mapping set, each first mapping set corresponding to a first protocol partition;
  • the N first protocol partitions are determined according to partition information of N bitmap index partitions included in the bitmap index, where N is a positive integer, and each bitmap index partition corresponds to a first protocol partition, and each bit
  • the map index partition includes at least one bitmap, each bitmap corresponding to a label value, each bitmap includes at least one bitmap bit, and each bitmap bit is used to record whether a bearer corresponding to a bearer identifier has a current The tag value corresponding to the bitmap;
  • the first protocol module 503 is configured to perform step 205 in FIG. 2 above;
  • the first storage module 504 is configured to perform step 206 in FIG. 2 above.
  • the partition information of each first protocol partition is composed of a bitmap index table identifier and a bearer identifier of a preset interval range;
  • the first classification module 502 includes a first mapping unit 5021 and a first classification unit 5022:
  • the first mapping unit 5021 is configured to perform step 203 in FIG. 2 above;
  • the first classifying unit 5022 is configured to perform step 204 in FIG. 2 above.
  • the first protocol module 503 includes:
  • a determining unit configured to determine, for each first mapping set, a first protocol partition corresponding to the first mapping set
  • a sorting unit configured to sort the first mapping result in the first mapping set by using the first protocol partition according to the bearer identifier in each first mapping result in the first mapping set;
  • an update unit configured to acquire, for each of the first mapping results after the sorting, the at least one label value included in the first mapping result from the bitmap index partition corresponding to the first protocol partition according to the sorting result a bitmap of the tag values, and updating the bitmap of the tag values according to the bitmap bits of the bearer identification.
  • the first protocol module 503 further includes:
  • a first execution unit when the first mapping result further includes a bitmap bit of the bearer identifier, performing an operation of updating a bitmap of the tag value according to the bitmap bit of the bearer identifier;
  • a second execution unit when the first mapping result does not include a bitmap bit of the bearer identifier, acquiring a bitmap bit of the bearer identifier, and performing updating the label according to the bitmap bit of the bearer identifier The operation of the bitmap of the value.
  • the second execution unit is further configured to:
  • the apparatus 500 further includes:
  • a first determining module configured to determine partition information of the bitmap index, where the partition information of the bitmap index is used to describe a set of bearer identifiers corresponding to each bitmap index partition in the bitmap index;
  • a second determining module configured to determine, according to the partition information of the bitmap index, N first protocol partitions in the preset mapping/protocol model.
  • the apparatus 500 further includes a second classification module 505, a second protocol module 506, and a second storage module 507:
  • the second classification module 505 is configured to perform, according to the bearer identifier included in each data record, the second data record according to the partition information of the M second protocol partitions included in the preset mapping/protocol model. Class classification, obtaining at least one second mapping set, each second mapping set corresponding to a second protocol partition;
  • the M second protocol partitions are determined according to the partition information of the M data partitions included in the data table, where M is a positive integer, each data partition corresponds to a second protocol partition, and each data partition is used for recording bearers.
  • M is a positive integer
  • each data partition corresponds to a second protocol partition
  • each data partition is used for recording bearers.
  • a second protocol module 506, configured to perform step 305 in FIG. 3;
  • the second storage module 507 is configured to perform step 306 in FIG.
  • the partition information of each second protocol partition is composed of a bearer data table identifier and a bearer identifier of a preset interval range;
  • the second classification module 505 includes a second mapping unit 5051 and a second classification unit 5052:
  • the second mapping unit 5051 is configured to perform step 304 in FIG. 3;
  • the second classification unit 5052 is configured to perform step 305 in FIG.
  • N is less than or equal to M, and N is greater than or equal to 2.
  • Each data partition in the M data partitions belongs to a unique bitmap index partition, and each bitmap index partition in the N bitmap index partitions includes at least A data partition.
  • the bitmap of the label value in each bitmap index partition included in the bitmap index may be determined by using the preset mapping/protocol model based on the at least one data record.
  • the bitmap of the tag values in each bitmap index partition can be determined in parallel by the preset mapping/protocol model, which improves the efficiency of storing data.
  • FIG. 6 is a schematic diagram of another data storage device according to an embodiment of the present invention.
  • the data storage device 600 can be a computer device, which can be the server described above, and the data storage device 600 includes at least one processor 601, a communication bus 602, a memory 603, and at least one communication interface 604.
  • the processor 601 can be a general purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present invention.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • Communication bus 602 can include a path for communicating information between the components described above.
  • the communication interface 604 uses devices such as any transceiver for communicating with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), and the like.
  • RAN Radio Access Network
  • WLAN Wireless Local Area Networks
  • the memory 603 can be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (RAM) or other type that can store information and instructions.
  • the dynamic storage device can also be an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical disc storage, and a disc storage device. (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be Any other media accessed, but not limited to this.
  • the memory can exist independently and be connected to the processor via a bus.
  • the memory can also be integrated with the processor.
  • the memory 603 is used to store program code for executing the solution of the present invention, and is controlled by the processor 601 for execution.
  • the processor 601 is configured to execute program code stored in the memory 603.
  • the processor 601 may include one or more CPUs, such as CPU0 and CPU1 in FIG.
  • data storage device 600 can include multiple processors, such as processor 601 and processor 608 in FIG. Each of these processors can be a single-CPU processor or a multi-core processor.
  • processors herein may refer to one or more devices, circuits, and/or processing cores for processing data, such as computer program instructions.
  • data storage device 600 may also include an output device 605 and an input device 606.
  • Output device 605 is in communication with processor 601 and can display information in a variety of ways.
  • the output device 605 can be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector. Wait.
  • Input device 606 is in communication with processor 601 and can accept user input in a variety of ways.
  • input device 606 can be a mouse, keyboard, touch screen device, or sensing device, and the like.
  • the data storage device 600 described above can be a general purpose computer device or a special purpose computer device.
  • the data storage device 600 can be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet, a wireless terminal device, a communication device, an embedded device, or have FIG. 6 A device of similar structure.
  • PDA personal digital assistant
  • the embodiment of the present invention does not limit the type of data storage device 600 for user password management.
  • One or more software modules are stored in the memory of the data storage device.
  • the data storage device can implement the software module by using the processor and the program code in the memory to implement the data storage method in the above embodiment.
  • An embodiment of the present application further provides a computer storage medium having instructions stored therein; a data storage device (which may be a computer device, such as a server) executing the instructions, such as a processor in the computer device executing the instructions, The data storage device is caused to implement the data storage method described in the above embodiments.
  • a data storage device which may be a computer device, such as a server
  • the data storage device is caused to implement the data storage method described in the above embodiments.
  • the embodiment of the present application provides a computer program product, the computer program product includes instructions, and the data storage device (which may be a computer device, such as a server) executes the instruction, so that the data storage device executes the data storage method of the foregoing method embodiment.
  • the data storage device which may be a computer device, such as a server

Abstract

一种数据存储方法、装置及存储介质,属于信息处理技术领域。所述方法包括:当获取到至少一条数据记录时,通过预设映射/规约模型基于该至少一条数据记录,确定位图索引包括的各个位图索引分区中的标签值的位图,以实现将该至少一条数据记录存储至对应的位图索引分区中。由于位图索引包括至少一个位图,每个位图对应一个标签值,因此可以基于标签值通过位图索引查找具有该标签值的承载体标识,提高了基于标签值进行数据查询的效率。另外,通过预设映射/规约模型可以并行地确定各个位图索引分区中的标签值的位图,提高了存储数据的效率。

Description

数据存储方法、装置及存储介质 技术领域
本申请涉及信息处理技术领域,特别涉及一种数据存储方法、装置及存储介质。
背景技术
Hadoop数据库(Hadoop Database,HBase)具有分布式、高可靠、高性能、基于键-值(Key-Value)存储等特点,因此越来越多的企业和用户使用HBase来构建数据表。
通常情况下,数据表包括多行数据记录,每一行数据记录包括承载体的标识和该承载体具有的各个标签的标签值。比如,对于用户A来说,其具有性别“女”和职业“工程师”两个标签值,则数据表中用户A所对应的行中包括用户A的标识、标签值“女”和标签值“工程师”。也即,数据表中记录了承载体的标识和其所具有的标签值的对应关系。
基于上述数据表的存储方式,当需要在数据表中查询数据时,根据承载体的标识进行数据查询时查询效率高,而在根据某一标签值或者标签值组合查询时,相关技术中只能根据行值过滤器(column value filter)按照承载体的标识逐行查询各个承载体的标签值,并且由于数据表的行数通常成千上万个,因此相关方案中,基于标签值进行数据查询时,其数据查询效率较低。
发明内容
为了解决相关技术中基于标签值进行数据查询时,其数据查询效率较低问题,本申请提供了一种数据存储方法、装置及存储介质。所述技术方案如下:
第一方面,提供了一种数据存储方法,所述方法包括:
获取至少一条数据记录,每条数据记录包括一个承载体标识和至少一个标签值;
基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类,得到至少一个第一映射集合,每个第一映射集合对应一个第一规约分区;
其中,所述N个第一规约分区是根据位图索引包括的N个位图索引分区的分区信息确定的,N为正整数,每个位图索引分区对应一个第一规约分区,每个位图索引分区包括至少一个位图,每个位图对应于一个标签值,每个位图包括至少一个位图位,每个位图位用于记录一个承载体标识所对应的承载体是否具备当前位图所对应的标签值;
通过所述至少一个第一映射集合各自对应的第一规约分区并行地对所述至少一个第一映射集合进行第一类规约处理,得到各个位图索引分区中的标签值的位图;
将得到的各个位图索引分区中的标签值的位图存储至对应的位图索引分区中。
在本发明实施例中,当获取到至少一条数据记录时,可以基于预设映射/规约模型将该至少一条数据记录存储至位图索引中,以便于在存储数据之后,基于某个标签值通过位图索引查找具有该标签值的承载体标识。另外,通过预设映射/规约模型可以并行地确定各个位图索引分区中标签值的位图,提高了将至少一条数据记录存储至N个位图索引 分区的效率。
可选地,每个第一规约分区的分区信息是由位图索引表标识和预设区间范围的承载体标识组成;
所述按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类,得到至少一个第一映射集合,包括:
通过所述预设映射/规约模型并行地对所述至少一条数据记录进行第一类映射处理,得到至少一个第一映射结果,每个第一映射结果包括所述位图索引表标识、承载体标识和至少一个标签值;
根据所述N个第一规约分区的分区信息,对所述至少一个第一映射结果进行分类,得到至少一个第一映射集合。
进一步地,在根据预设映射/规约模型包括的N个第一规约分区的分区信息对该至少一条数据记录进行第一类分类之前,还需通过预设映射/规约模型并行地对每条数据记录进行第一类映射处理,以便于之后对映射之后的至少一个第一映射结果进行分类。
可选地,所述通过所述至少一个第一映射集合各自对应的第一规约分区并行地对所述至少一个第一映射集合进行第一类规约处理,得到各个位图索引分区中的标签值的位图,包括:
对于每个第一映射集合,确定所述第一映射集合对应的第一规约分区;
按照所述第一映射集合中每个第一映射结果中的承载体标识,通过所述第一规约分区对所述第一映射集合中的第一映射结果进行排序;
对于排序后的每个第一映射结果,按照排序结果,从与所述第一规约分区对应的位图索引分区中获取所述第一映射结果包括的至少一个标签值中每个标签值的位图,并根据所述承载体标识的位图位更新所述标签值的位图。
其中,对于每个规约分区而言,该规约分区是按照一定的顺序处理属于该规约分区的多个数据,因此对于每个第一映射集合,与该第一映射集合对应的第一规约分区可以先对该第一映射集合中的第一映射结果进行排序,并按照排序结果依次处理该第一映射集合中的每个第一映射结果。
可选地,所述根据所述承载体标识的位图位更新所述标签值的位图之前,还包括:
当所述第一映射结果还包括承载体标识的位图位时,执行根据所述承载体标识的位图位更新所述标签值的位图的操作;或者
当所述第一映射结果未包括承载体标识的位图位时,获取所述承载体标识的位图位,并执行根据所述承载体标识的位图位更新所述标签值的位图的操作。
由于更新某个标签值的位图时,需先确定承载体标识的位图位,而系统可能预先已经为该承载体标识配置位图位,也可能没有为该承载体标识配置位图位,因此第一映射结果中可能包括该承载标识的位图位,也可以不包括该承载体标识的位图位。当第一映射结果没有包括该承载体标识的位图位时,在更新某个标签值的位图之前,需先获取该承载体标识的位图位。
可选地,所述获取所述承载体标识的位图位之后,还包括:
存储所述承载体标识的位图位和所述承载体标识之间的对应关系。
进一步地,当第一映射结果没有包括该承载体标识的位图位时,在获取该承载体标 识的位图位之后,还可以存储该承载体标识的位图位和该承载体标识之间的对应关系,以便于后续根据该承载体标识查询该承载体标识的位图位,或根据该位图位查询与该位图位对应的承载体标识。
可选地,所述基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类之前,还包括:
确定所述位图索引的分区信息,所述位图索引的分区信息用于描述所述位图索引中每个位图索引分区所对应的承载体标识的集合;
根据所述位图索引的分区信息,确定所述预设映射/规约模型中的N个第一规约分区。
由于是根据预设映射/规约模型包括的N个第一规约分区的分区信息对该至少一条数据记录进行第一类分类的,所以在对该至少一条数据记录进行第一类分类之前,还可以根据位图索引的分区信息确定该预设映射/规约模型包括的N个第一规约分区的分区信息。
可选地,所述获取至少一条数据记录之后,还包括:
基于每条数据记录包括的承载体标识,按照所述预设映射/规约模型包括的M个第二规约分区的分区信息,对所述至少一条数据记录进行第二类分类,得到至少一个第二映射集合,每个第二映射集合对应一个第二规约分区;
其中,所述M个第二规约分区是根据数据表包括的M个数据分区的分区信息确定的,M为正整数,每个数据分区对应一个第二规约分区,每个数据分区用于记录承载体标识与标签值的对应关系;
通过所述至少一个第二映射集合各自对应的第二规约分区并行地对所述至少一个第二映射集合进行第二类规约处理,得到各个数据分区中的数据;
将得到的各个数据分区的数据存储至对应的数据分区中。
进一步地,在本发明实施例中,当获取到至少一条数据记录时,还可以基于预设映射/规约模型将该至少一条数据记录存储至数据表中,以实现将该至少一条数据记录同时存储至位图索引和数据表中。并且,通过预设映射/规约模型可以并行地确定各个数据分区中的数据,提高了将至少一条数据记录存储至M个数据分区的效率。
可选地,每个第二规约分区的分区信息是由承载体数据表标识和预设区间范围的承载体标识组成;
所述按照所述预设映射/规约模型包括的M个第二规约分区的分区信息,对所述至少一条数据记录进行第二类分类,包括:
通过所述预设映射/规约模型并行地对所述至少一条数据记录进行第二类映射处理,得到至少一个第二映射结果,每个第二映射结果包括所述数据表标识、承载体标识和至少一个标签值;
根据所述M个第二规约分区的分区信息,对所述至少一个第二映射结果进行分类,得到至少一个第二映射集合。
进一步地,在根据预设映射/规约模型包括的M个第二规约分区的分区信息对该至少一条数据记录进行第二类分类之前,还需通过预设映射/规约模型并行地对每条数据记录进行第二类映射处理,以便于之后对映射之后的至少一个第二映射结果进行分类。
可选地,N小于或等于M,N大于或等于2,M个数据分区中的每个数据分区属于 唯一的位图索引分区,N个位图索引分区中的每个位图索引分区包含至少一个数据分区。
另外,为了提高从数据表中查询数据的效率,同时为了提高从位图索引分区中查询数据的效率,数据表包括的M个数据分区和位图索引包括的N个位图索引分区可以满足以上条件,也即,可以适当地将数据表中每个数据分区的分区范围设置小一点,将位图索引中每个位图索引分区的分区范围设置的大一点。
第二方面,提供了一种数据存储装置,所述数据存储的装置具有实现上述第一方面中数据存储方法行为的功能。所述数据存储装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的数据存储方法。
第三方面,提供了另一种数据存储装置,所述数据存储装置的结构中包括处理器和存储器,所述存储器用于存储支持数据存储装置执行上述第一方面所提供的数据存储方法的程序,以及存储用于实现上述第一方面所提供的数据存储方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述存储设备的操作装置还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的数据存储方法。
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述数据存储方法。
本申请提供的技术方案带来的有益效果是:
在本申请中,当获取到至少一条数据记录时,可以通过预设映射/规约模型基于该至少一条数据记录,确定位图索引包括的各个位图索引分区中的标签值的位图,以实现将该至少一条数据记录存储至对应的位图索引分区中。由于位图索引包括至少一个位图,每个位图对应一个标签值,因此可以基于标签值通过位图索引查找具有该标签值的承载体标识,提高了基于标签值进行数据查询的效率。另外,通过预设映射/规约模型可以并行地确定各个位图索引分区中的标签值的位图,提高了存储数据的效率。
附图说明
图1是本发明实施例提供的一种标签值的位图示意图;
图2是本发明实施例提供的一种数据存储方法流程图;
图3是本发明实施例提供的另一种数据存储方法流程图;
图4是本发明实施例提供的另一种数据存储方法流程图;
图5A是本发明实施例提供的一种数据存储装置框图;
图5B是本发明实施例提供的一种第一分类模块框图;
图5C是本发明实施例提供的另一种数据存储装置框图;
图5D是本发明实施例提供的一种第二分类模块框图;
图6是本发明实施例提供的另一种数据存储装置框图。
具体实施方式
为了便于理解,首先对本发明实施例所涉及的相关术语做简单介绍。
标签是一种内容组织方式,用于表征数据的某一特征进而帮助人们描述和分类内容。比如,常见的标签有性别、学历、职业、颜色等等。可选地,标签是人为规定的。
一种可能实现,标签可以包括枚举标签和布尔标签两种。枚举标签是指包括多个枚举值的标签,比如,学历包括专科、本科、研究生、博士等等,又比如,性别包括男或者女;而布尔标签只用于表示是否具备该标签,比如,是否有房、是否吸毒、是否有过犯罪记录等等。
在标签为枚举标签时,标签的标签值是指标签的具体取值。比如,以标签为学历为例,当学历是本科时其标签值为本科,当学历是研究生时其标签值为研究生。而在标签为布尔标签时,标签的标签值为标签本身。比如,当用户有房时其标签值为有房,再比如,当用户没有犯罪记录时,其对应的标签值为无犯罪记录。
承载体:是各个标签所描述的对象。可选地,承载体可以是人、车、电话号码或者虚拟用户帐号等等。一个承载体可以具有一个标签,也可以具有多个标签。比如,以承载体是人为例,描述人的标签可以有性别、学历、是否有房、是否有过犯罪记录等等。又比如,以承载体为车为例,描述车的标签可以有颜色、是否有违规记录等等。
数据表:为数据库中以承载体为索引建立的数据记录。数据表中的每条数据记录,记录一个承载体的标识,记录该个承载体具有的所有标签值,以及记录该个承载体的标识与该承载体具有的标签值之间的对应关系。
位图索引:为数据库中以数据表中的标签值为索引而建立的二级索引。可选地,位图索引记录标签值和位图,还记录标签值和位图之间的一一对应关系。其中,位图中的每一个位图位对应一个承载体的标识,但位图中的不同位图位对应不同承载体的标识,即位图中的所有位图位与所有承载体的标识的一一对应。位图中的每个位图位记录一个承载体的标识所对应的承载体是否具备当前位图(该个位图位所在的位图)所对应的标签值;比如,若一个标签值的位图中的某个位图位为1,则代表该个位图位对应的承载体具有该个标签值,反之,若该个位图位为0,则代表该个位图位对应的承载体不具有该个标签值。不同位图中相同的位图位对应于相同的承载体的标识。
以承载体为虚拟用户账号为例,假设共8个虚拟用户帐号,每个虚拟用户帐号分别为user1、user2、…、user8。具有标签值“网购达人”的集合为:user1、user4、user8,具有标签值“论坛活跃分子”的集合为:user1、user2、user8。在位图中为8个虚拟用户帐号分配的位图位依次为1、2、3…8,如图1所示;对于标签值“网购达人”,其对应的位图为“10010001”;对于标签值“论坛活跃分子”,其对应的位图包括11000001。以“网购达人”对应的位图“10010001”为例,位图中的第一个“1”表示位图位为1的虚拟用户帐号是网络达人,类似的,位图中的第二个“1”表示位图位为4的虚拟用户帐号也是网购达人,位图中的第三个“1”表示位图位为8的虚拟用户帐号也是网购达人;标签值“论坛活跃分子”对应的位图“11000001”表达的意思类似。由图1可知,user1 和user8同时具有“网络达人”和“论坛活跃分子”两个标签值。
接下来对本发明实施例的应用场景进行介绍,实际应用中,客户端通常需要通过服务器进行数据查询,比如,当客户端向服务器发送针对某个承载体的标签查询请求时,服务器根据预先存储的数据表,根据该承载体的标识从该数据表中查询与该承载体标识对应的至少一个标签值,并将查询到的至少一个标签值对应的至少一个标签确定为该承载体具有的标签,此时查询数据的效率较高。又比如,当客户端向服务器发送针对某个标签值的查询请求时,服务器根据该数据表,逐项查询每个承载体的标签值,以确定哪些承载体具有该标签值,此时查询数据的效率较低。由此可知,服务器如何存储承载体和标签值之间的对应关系,将影响之后客户端通过服务器进行数据查询的效率。而本发明实施例即应用于服务器如何进行数据存储的场景。
也即,本发明实施例提供的数据存储方法应用于服务器中。其中,服务器可以为一台或多台服务器;可选地,多台服务器可以以服务器集群的方式为终端提供数据库服务。一种可能实现,服务器中设置有数据库,该数据库可以为HBase、Mongo数据库(Mongo Database,MongoDB)、分布型关系数据库服务(Distribute Relational Database Service,DRDS)、Volt数据库(Volt Database,VoltDB)、和ScaleBase等分布式数据库。
需要说明的是,本发明实施例提供的数据存储的方法主要包括两部分的内容,一是将至少一条数据记录存储至位图索引中,二是将该至少一条数据记录存储至数据表中。为了后续便于说明,在此先对本发明实施例提供的位图索引和数据表进行介绍。
其中,数据表用于记录承载体标识与标签值的对应关系,位图索引包括至少一个位图,每个位图对应于一个标签值,每个位图包括至少一个位图位,每个位图位用于记录一个承载体标识所对应的承载体是否具备当前位图所对应的标签值。
进一步地,为了提高从数据表和位图索引中查询数据的效率,可以将数据表划分为M个数据分区,分布式地将不同的数据存储数据表的M个数据分区。同时将位图索引划分为N个位图索引分区,分布式地将不同的数据存储数据表的N个数据分区。也即,数据表包括M个数据分区,位图索引包括N个位图索引分区。
值得注意的是,将数据存储在数据表是为了便于后续根据承载体标识查询该承载体标识对应的标签值,因此,为了提高从数据表中查询数据的效率,可以适当地将数据表中每个数据分区的第一范围设置小一点。而将数据存储在位图索引中时为了便于后续根据标签值查找对应的承载体标识,由于每个位图索引分区包括标签定义表中的所有标签值,因此,为了提高从位图索引分区中查询数据的效率,可以适当地将位图索引中每个位图索引分区的第二范围设置的大一点。也即,对于N个位图索引分区中的每个位图索引分区,该位图索引分区包括至少一个数据分区中的数据。
也即,在本发明实施例中,M和N可以满足如下关系,N小于或等于M,N大于或等于2,M个数据分区中的每个数据分区属于唯一的位图索引分区,N个位图索引分区中的每个位图索引分区包含至少一个数据分区。
划分数据分区的可选方式,可以通过指定数据分区的数量来划分数据分区,或者可以直接定义每个数据分区的分区区间。在本发明实施例中,以直接定义每个数据分区的 分区区间为例进行说明。当直接定义每个数据分区的分区区间时,为了后续便于说明,将每个数据分区所对应的承载体标识的集合称为第一范围,也即每个数据分区对应的承载体标识的集合是相同的。每个数据分区用于存储承载体标识位于该分区区间的数据记录,且每个分区区间之间不存在交集,以避免同一条数据记录存储在两个不同的数据分区中。
举例说明,为数据表设置如下分区区间:数据分区1:[,a1)、数据分区2:[a1,a2)、数据分区3:[a2,a3)、…、数据分区9:[a8,a9)。数据分区1用于存储承载体标识在分区区间[,a1)的数据记录,数据分区2用于存储承载体标识在分区区间[a1,a2)的数据记录,数据分区3用于存储承载体标识在分区区间[a2,a3)的数据记录,…,数据分区9用于存储承载体标识在分区区间[a8,a9)的数据记录。其中,分区区间[,a1)、[a1,a2)、[a2,a3)、…以及[a8,a9)两两之间不存在交集。
另外,数据表的每个数据分区可以自动裂变或者扩展。比如,随着时间推移,某个数据分区的数据越来越多,在该个数据分区的数据量达到分裂阈值,服务器可以将该个数据分区进行分裂成两个数据分区,从而避免由于该个数据分区的存储空间被存满之后无法继续向该个数据分区写入新数据。
而位图索引分区的划分方式可以类似参考数据分区的划分方式。比如,当数据分区的划分方式为定义每个数据分区的分区区间时,划分位图索引分区的划分方式也为定义每个位图索引分区的分区区间。也即,每个位图索引分区所对应的分区区间为用户预先设置的,服务器按照预先设置的分区区间划分出位图索引分区。其中,各个位图索引分区对应的承载体标识的范围也是相同的。为了后续便于说明,将各个位图索引分区对应的承载体标识的范围称为第二范围。
例如,预先设置不重叠的分区区间[b0,c0),[c0,d0)、[d0,e0)、[e0,f0)以及[f0,j0)。根据该几个分区区间可以将位图索引划分出位图索引分区1,位图索引分区2、位图索引分区3、位图索引分区4和位图索引分区5。其中,位图索引分区1用于存储承载标识在分区区间[b0,c0)中的标签值的位图,位图索引分区2用于存储承载标识在分区区间[c0,d0)中的标签值的位图,位图索引分区3用于存储承载标识在分区区间[d0,e0)中的标签值的位图,位图索引分区4用于存储承载标识在分区区间[e0,f0)中的标签值的位图,位图索引分区5用于存储承载标识在分区区间[f0,j0)中的标签值的位图。
需要说明的是,由于每个位图索引分区用于存储一部分承载标识的标签值的位图,因此,对于整个位图索引而言,每个标签值的位图是由每个位图索引分区中该标签值的一部分位图组合而成,为了便于说明,将一个位图索引分区中该标签值的一部分位图称为该标签值的子位图,因此,每个标签值的位图是由所有位图索引分区中对应的子位图组合而成。
也即,每个位图索引分区中的不同子位图的相同的位图位对应于相同的承载体的标识,不同位图索引分区中的不同子位图的相同的位图位对应于不相同的承载体的标识。
由于在每个位图索引分区中为标签定义表中的所有标签值分别建立一个子位图,因此每个位图索引分区中的子位图的数量为标签定义表中的全部标签的数量。比如,假设标签定义表中设置的标签值的总个数为10个,则在每个位图索引分区分别为标签定义表中的所有标签值分别建立子位图之后,每个位图索引分区的子位图的数量也为10个。
可选地,由于位图索引分区扩展或者分裂会导致位图索引中的所有位图索引分区需要重建,代价较高,在本实施例中服务器将位图索引分区设置为不可分裂或者扩展。
下述两个实施例将分别用于说明将至少一条数据记录存储至位图索引和数据表的详细过程。
图2为本发明实施例提供的一种数据存储方法流程图,应用于将该至少一条数据记录存储至位图索引的场景。如图2所示,该数据存储方法包括如下步骤:
步骤201:获取至少一条数据记录,每条数据记录包括一个承载体标识和至少一个标签值。
具体地,获取至少一条源数据,每条源数据包括承载体标识和至少一个标签,对于每条源数据,根据预先设置的标签定义表,确定该至少一个标签中每个标签的标签值,得到至少一个标签值。
其中,至少一条源数据可以为存储于Hadoop分布式分布式文件系统(Hadoop Distributed File System,HDFS)中的数据,也即,当客户端需要存储某个数据时,将该数据发送给服务器,服务器先将该数据存储于HDFS中,之后,由服务器根据该HDFS中源数据进行数据的存储。需要说明的是,服务器可以按照默认的路径从HDFS中获取至少一条源数据,也可以按照预设路径从HDFS中获取至少一条源数据,本发明实施例在此不做具体限定。
另外,标签定义表可以为服务器预先获取并存储的信息。可选地,标签定义表可以以独立的文件的形式存储,如以可扩展标记语言(Extensible Markup Language,XML)文件的形式存储,也可以在第三方分布式存储系统中存储,如存储至ZooKeeper。
该预先设置的标签定义表记录预设的多个标签值。一种可选的预设方式,根据历史数据设置标签包含的标签值,或者人为定义标签包含的标签值。
表1示出了一种可能的标签定义表。当然,表1还可能会包括更多或者更少的标签,对此并不做限定。
表1
标签 标签值 标签配置信息
性别 男,女 常驻内存
学历 专科,本科,研究生,博士 不常驻内存
职业 学生,教师,个体,企业员工 常驻内存
网购狂人 网购狂人 常驻内存
吸毒者 吸毒者 不常驻内存
可选地,如表1所示,标签定义表还可以包括标签配置信息,该标签配置信息包括表示标签值是否需要常驻内存,需要常驻内存的标签值所对应的位图也需要是否常驻在内存,不需要常驻内存的标签值所对应的位图也不需要是否常驻在内存。
在表1中,对需要常驻内存的标签值设置了标识“常驻内存”,对不需要常驻内存的标签值设置了标识“不常驻内存”。应当理解,也可以对需要常驻内存的标签值设置 标识“常驻内存”,可以对不需要常驻内存的标签值不设置标识,表1对不需要常驻内存的标签值设置标识“不常驻内存”仅是一种示例。
可选地,标签定义表还可以包括每个标签值的生命周期,该生命周期是指该标签值为有效的时间段;即不属于该生命周期的其他时间,该标签值为无效。
可选地,服务器还可以为表1中的每个标签值分配一个标签号。在存储标签值与位图的映射关系时可以用该标签号替代标签值,存储标签号相对于存储标签值可以节省存储空间。另外,在中,可以根据标签号查询到对应的标签值,可以根据标签值查询对应的标签号。
例如,表2为本发明实施例提供的一种源数据的格式,表2中的每一行表示一条源数据,每条源数据具有唯一的承载体标识,每条源数据还包括与该承载体标识对应的至少一个标签。
表2
承载体标识  
a01 性别:男,学历:本科
a02 性别:女,学历:专科,职业:个体,网购达人
b01 性别:男,学历:专科,职业:企业员工
b02 性别:女,学历:本科,职业:学生
c01 性别:男,学历:研究生,
c02 性别:女,学历:专科,职业:企业员工,网购达人
d01 性别:男,学历:研究生,职业:企业员工,网购达人
d02 性别:女,学历:本科,职业:学生
d03 性别:男,学历:专科,职业:企业员工
e01 性别:男,学历:研究生,职业:个体,吸毒者
e02 性别:女,学历:专科,职业:个体
e03 性别:男,学历:本科,职业:学生
f01 性别:男,学历:研究生,职业:教师,
f02 性别:女,学历:专科,职业:企业员工,网购达人
f03 性别:男,学历:本科,职业:学生
f04 性别:女,学历:研究生,职业:个体,吸毒者
对于表2所示的源数据,表3示出了服务器根据表1确定得到的各条源数据对应的数据记录,其中,表3中每个[]中的内容表示一个标签值。表3中的每一行代表一条数据记录,每条数据记录包括一个承载体标识和至少一个标签值。
表3
Figure PCTCN2018087377-appb-000001
在本发明实施例中,当服务器获取到至少一条数据记录之后,可以通过预设映射/规约(map/reduce)模型,实现将该至少一条数据存储于位图索引中。为了后续便于说明,在此对预设映射/规约模型进行解释说明。
该预设映射/规约模型为一种并行计算的模型,主要包括两个计算过程,映射过程(map)和规约过程(reduce),映射过程也即按照需要存储的数据的类型对数据记录进行分类的过程,规约过程也即根据该数据记录对应的规约分区将数据记录存储到对应的文件中的过程。
其中,该预设映射/规约模型包括多个规约分区,每个规约分区对应一个数据区间,每个规约分区用于处理属于该数据区间的数据,且不同的规约分区之间为并行处理方式。正是由于不同的规约分区之间为并行处理方式,因此,通过该预设映射/规约模型可以实现并行地确定各个位图索引分区中的标签值的位图。
另外,映射过程是对每条数据记录并行地进行映射,因此通过该预设映射/规约模型可以并行的处理批量的数据,同样提高了处理数据的效率。
值得注意的是,在本发明实施例中,除了将至少一条数据记录存储至位图索引中,还可以将该至少一条数据记录存储至数据表中,也即需要同时将数据记录存储至数据表和位图索引中,因此,为了便于区分数据表和位图索引,在此引入数据表标识和位图索 引表标识,其中,数据表标识用于唯一标识数据表,位图索引表标识用于唯一标识位图索引。
由于需要将数据记录存储至数据表中对应的数据分区中以及位图索引中对应的位图索引分区中,因此,上述预设映射/规约模型的规约分区可以设置为该数据表的数据分区和该位图索引的位图索引分区的组合,此时,通过该预设映射/规约模型可以直接将数据记录存储至对应的数据分区以及位图索引分区中。为了后续便于说明,将与位图索引的N个位图索引分区一一对应的N个规约分区称为第一规约分区,将与数据表的M个数据分区一一对应的M个规约分区称为第二规约分区。
相应地,该预设映射/规约模型的映射过程也包括两种不同的映射处理过程,一是将至少一条数据记录存储至位图索引时对应的映射过程,称为第一类映射处理,二是将至少一条数据记录存储至数据表时对应的映射过程,称为第二类映射处理。
同样地,该预设映射/规约模型的规约过程也包括两种不同的规约处理过程,一是将至少一条数据记录存储至位图索引时对应的规约过程,称为第一类规约处理,二是将至少一条数据记录存储至数据表时对应的规约过程,称为第二类规约处理。
由此可知,当获取到至少一条数据记录时,为了通过预设映射/规约模型将该至少一条数据记录对应的位图索引分区中,需先确定预设映射/规约模型包括的N个第一规约分区。具体地,可以通过下述步骤302实现该过程。
步骤202:确定预设映射/规约模型中的N个第一规约分区。
具体地,确定位图索引的分区信息,该位图索引的分区信息用于描述该位图索引中每个位图索引分区所对应的承载体标识的集合。根据该位图索引的分区信息,确定该预设映射/规约模型中的N个第一规约分区,每个第一规约分区对应一个位图索引分区。也即,该N个第一规约分区是根据位图索引包括的N个位图索引分区的分区信息确定的。
其中,值得注意的是,由于数据表中的每个分区区间表示一个承载体标识的集合,位图索引中每个分区区间也表示一个承载体标识的集合,因此,若直接将将该数据表的分区区间作为该预设映射/规约模型中的M个第二规约分区的分区区间,将该位图索引的分区区间作为该预设映射/规约模型中的N个第一规约分区的分区区间,将导致N个第一规约分区和M个第二规约分区之间可能存在交集。
因此,为了避免不同规约分区之间可能存在交集,为该位图索引的分区区间添加用于标识位图索引的位图索引表标识,将添加了该位图索引表标识之后的位图索引的分区区间确定为该预设映射/规约模型中的N个第一规约分区的分区区间。也即,每个第一规约分区的分区信息是由位图索引表标识和预设区间范围的承载体标识组成。
例如,预先为位图索引设置如下分区区间:
[b0,c0),[c0,d0)、[d0,e0)、[e0,f0)以及[f0,j0)。
此时,可以为该预设映射/规约模型设置如下第一规约分区:
[B b0,Bc0),[Bc0,Bd0)、[Bd0,Be0)、[Be0,Bf0)以及[Bf0,B j0)。
其中,B为用于标识位图索引的标识,也即位图索引表标识。也即,第一规约分区[B b0,Bc0),[Bc0,Bd0)、[Bd0,Be0)、[Be0,Bf0)以及[Bf0,B j0)为与各个位图索引分区一一对应的规约分区。
值得注意的是,对于该预设映射/规约模型,由于不同的第一规约分区可以并行地对 属于该规约分区的分区区间的数据进行处理,因此,需将该至少一条数据记录按照该N个第一规约分区进行分类,以便于不同的第一规约分区对应地处理属于该第一规约分区的数据。
也即,基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的N个第一规约分区的分区信息,对该至少一条数据记录进行第一类分类,得到至少一个第一映射集合,每个第一映射集合对应一个第一规约分区,以便于第一规约分区处理对应的第一映射集合中的数据。具体地,可以通过下述步骤203至步骤204实现该过程。
步骤203:通过该预设映射/规约模型并行地对至少一条数据记录进行第一类映射处理,得到至少一个第一映射结果,每个第一映射结果包括该位图索引表标识、承载体标识和至少一个标签值。
由步骤202可知,该预设映射/规约模型中的N个第一规约分区的分区区间实际上并不是位图索引中位图索引分区的分区区间,因此,该第一类映射处理主要为每条数据记录添加位图索引表标识,以便于后续确定每条数据记录对应的第一规约分区。
也即,对于每条数据记录,为每条数据记录添加位图索引表标识,得到第一映射结果。
需要说明的是,该预设映射/规约模型是将位图索引表标识并行地添加至每条数据记录中,也即,该预设映射/规约模型同时位图索引表标识添加至每条数据记录中。因此,该预设映射/规约模型将位图索引表标识添加至1条数据记录中的时间和添加至n条数据记录的时间相同,提高了将位图索引表标识添加至至少一条条数据记录的效率。
另外,对于第一映射结果,可以采用键-值(key-value)的格式记录该第一映射结果。具体地,表4是本发明实施例提供的一种第一映射结果的格式,如表4所示,将第一映射结果中的位图索引表标识和承载体标识共同设置为键,将第一映射结果中的至少一个标签值设置为该键的值。
表4
映射结果
第一映射结果 位图索引表标识+承载体标识 标签值列表
当采用键-值的格式记录该第一映射结果时,对于每个第一映射结果,还可以在对应的值上添加备注信息,该备注信息包括至少一个标签值中每个标签值的生成时间,或每个标签值的内部身份标识(identification,ID)。当备注信息包括每个标签值的内部ID时,表明在可以使用该标签值的内部ID替换该标签值,以降低数据传输过程的传输量。
例如,B为位图索引表标识。对于表3中的第一条数据记录“a01->{性别:男,学历:本科}”,该数据记录中的承载体标识为a01,该数据记录包括两个标签值“性别:男”和“学历:本科”,该预设映射/规约模型对该数据记录进行第一类映射处理,得到第一映射结果,第一映射结果包括位图索引表标识B、承载体标识a01和两个标签值“男”和“本科”。同时将第一映射结果按照上述表4所示的格式记录,得到下述如表5所示的第一映射结果,也即将第一映射结果记录为键为Ba01,值为{性别:男,学历:本科}的数据。
表5
Ba01 {性别:男,学历:本科}
对于表5所示的第一映射结果,当系统为标签值“性别:男”配置的内部ID为1、为标签值“性别:女”配置的内部ID为2、为标签值“学历:本科”配置的内部ID为3、为标签值“学历:专科”配置的内部ID为4时,此时,上述表5所示的键Ba01对应的值还可以记录为{1,3}。
需要说明的是,对于某个承载体标识而言,系统可能已经为该承载体标识配置对应的位图位,此时第一映射结果还包括该承载体标识的位图位。当系统未为该承载体标识配置对应的位图位时,此时第一映射结果不包括该承载体标识的位图位。
当第一映射结果包括该承载体标识的位图位时,此时若仍采用键-值(key-value)的格式记录该第一映射结果,可以得到表6或表7所示的映射结果的格式。如表6所示,此时将位图索引表标识、承载体标识以及该承载体标识的位图位共同设置为键,值仍为第一映射结果中的至少一个标签值。
如表7所示,还可以将位图索引表标识和承载体标识共同设置为键,将至少一个标签值和该承载体标识的位图位共同设置为该键对应的值。
例如,对于表3中的第一条数据记录“a01->{性别:男,学历:本科}”,若当前系统已为该承载体标识a01配置对应的位图位,且该承载体标识a01的位图位为5,此时将第一映射结果按照上述表6所示的格式记录,得到下述如表8所示的第一映射结果,第一映射结果记录为键为(Ba01,5),值为{性别:男,学历:本科}的数据。
表6
Figure PCTCN2018087377-appb-000002
表7
Figure PCTCN2018087377-appb-000003
表8
Ba01,5 {性别:男,学历:本科}
当通过该预设映射/规约模型对至少一条数据记录进行上述第一类映射处理之后,得到至少一个第一映射结果,也即,对于每条数据记录,都将得到上述表4或表6所示的第一映射结果。之后,需要通过下述步骤204对该至少一个第一映射结果进行第一类分类。
步骤204:根据N个第一规约分区的分区信息,对该至少一个第一映射结果进行分类,得到至少一个第一映射集合,每个第一映射集合对应一个第一规约分区。
对于该预设映射/规约模型,不同的规约分区可以并行地对属于该规约分区的分区区间的数据进行处理,因此,对于该至少一个第一映射结果,需将该至少一个第一映射结果归类至对应的第一规约分区中。
对于该至少一个第一映射结果中的每个第一映射结果,根据该第一映射结果中的承载体标识和位图索引表标识,从该N个第一规约分区的分区区间中查找该第一映射结果中的承载体标识所属的分区区间,以实现对该至少一个第一映射结果的分类。分类之后,得到至少一个第一映射集合,对于每个第一映射集合,该第一映射集合包括至少一个第一映射结果。
步骤205:通过该至少一个第一映射集合各自对应的第一规约分区并行地对该至少一个第一映射集合进行第一类规约处理,得到各个位图索引分区中的标签值的位图。
由于该至少一个第一映射集合各自对应的第一规约分区是并行地对该至少一个第一映射集合进行第一类规约处理,因此下面将以对一个第一映射集合进行第一类规约处理的过程进行解释说明。具体地,第一类规约处理过程分为以下两个过程:
(1)对于每个第一映射集合,确定该第一映射集合对应的第一规约分区,并按照该第一映射集合中每个第一映射结果中的承载体标识,通过该第一规约分区对该第一映射集合中的第一映射结果进行排序。
当服务器通过该预设映射/规约模型进行上述第一类规约处理时,对于每条数据记录,由于该数据记录存在对应的第一映射结果,且已通过步骤204确定出该第一映射结果所属的第一映射集合。此时,由于属于该第一映射集合的第一映射结果对应的数据记录将存储至同一个位图索引分区中,因此服务器将通过该第一映射集合对应的第一规约分区先对属于该第一映射集合的第一映射结果进行排序,以按照排列之后的顺序依次将该第一映射集合中的数据依次存储至对应的位图索引分区中。
其中,对该第一映射集合中的第一映射结果进行排序的方式通常为默认排序方法,其中,该默认排序方法为按照承载体标识的字典顺序的升序排列,或者按照承载体标识的字典顺序的降序排列,本发明实施例在此不做具体限定。
比如,第一映射集合包括三个第一映射结果,该三个第一映射结果中的承载体标识分别为a01、a02和a03,可以按照a01、a02和a03的顺序,依次对这三个第一映射结果进行排序。
(2)对于排序后的每个第一映射结果,按照排序结果,从与该第一规约分区对应的位图索引分区中获取该第一映射结果包括的至少一个标签值中每个标签值的位图,并根据该承载体标识的位图位更新该标签值的位图,以得到该第一规约分区对应的位图索引中各个标签值的位图。
由步骤203可知,对于每条数据记录,该数据记录的第一映射结果可能包括该承载体标识的位图位,也可能不包括该承载体标识的位图位,因此根据该承载体标识的位图位更新第一映射结果包括的至少一个标签值中每个标签值的位图可以有以下两种实现方式:
第一种方式,当该第一映射结果还包括承载体标识的位图位时,根据该承载体标识 的位图位,更新对应的标签值的位图。
第二种方式,当该第一映射结果未包括承载体标识的位图位时,获取该承载体标识的位图位,并根据该承载体标识的位图位,更新对应的标签值的位图。
无论哪种方式,更新第一映射结果包括的至少一个标签值中每个标签值的位图都需先确定该承载体标识的位图位,在确定该承载体标识的位图位之后,对于第一映射结果包括的至少一个标签值中每个标签值的位图,将该标签值的位图在该承载体标识的位图位上的值进行更新。
在本发明实施例中,标签值的位图可以采用图1所示的方式存储,也即,标签值的位图在各个位图位上的数值为0或1,此时,将该标签值的位图在该承载体标识的位图位上的值进行更新,也即为,将该标签值的位图在该承载体标识的位图位上的值设置为1。
需要说明的是,由于上述确定每个标签值的位图是通过将该标签值的位图在该承载体标识的位图位上的值设置为1来实现的。因此,在本发明实施例中,针对每个标签值的位图,预先将每个标签值在各个位图位上的值初始化,也即,设置为0。之后,对于某个位图索引分区内各个标签值,当当前正在处理第一个第一映射结果时,确定该第一个第一映射结果包括的承载体标识的位图位,对于该第一个第一映射结果包括至少一个标签值中的每个标签值,将该标签值的子位图在该承载体标识的位图位上的值更改为1,也即,对该至少一个标签值的子位图进行更新,也即,对该位图索引分区进行更新。对于该位图索引分区内除该至少一个标签值之外的其他标签值不做任何处理,也即,该其他标签值的子位图在该承载体标识的位图位上的值仍为0,表明该承载体标识不具有该其他标签值。
在处理完上述第一个第一映射结果之后,当处理第二个第一映射结果时,和上述处理第一个第一映射结果基本相同,不同之处在于,此时,是在根据上述第一个第一映射结果对位图索引分区更新之后的位图索引分区上继续对位图索引分区更新,也即,此时在对第二个第一映射结果中的至少一个标签值的子位图进行更新时,该位图索引分区中第一个第一映射结果中的至少一个标签值的子位图在第一个映射结果的承载体标识的位图位上的值已经为1。
也即,在依次确定该第一映射结果包括的至少一个标签值中每个标签值的位图的过程中,对于每个第一映射结果,均是在根据上一个第一映射结果对位图索引分区中的标签值的位图进行更新之后的基础上继续对标签值的位图进行更新。
例如,表9为本发明实施例提供的一种初始化的位图索引,如表9所示,该位图索引包括多个位图索引分区,每个位图索引分区包括所有标签值的子位图,且每个标签值的子位图在各个位图位上的初始化值均为0。
表9
Figure PCTCN2018087377-appb-000004
Figure PCTCN2018087377-appb-000005
对于表3中的第一条数据记录“a01->{性别:男,学历:本科}”至第9条数据记录“d03->{性别:男,学历:专科,职业:企业员工}”,该9条数据记录的第一映射结果均为表5所示的第一映射结果,且通过上述步骤204将该9条数据记录的第一映射结果分类至位图索引分区1对应的第一规约分区。且按照这9个第一映射结果中的承载体标识a01、a02、b01、b02、c01、c02、d01、d02以及d03对该9个第一映射结果进行排序,依次为第一条数据记录的第一映射结果、第二条数据记录的第一映射结果、…、第9条数据记录的第一映射结果。
当确定系统为该承载体标识a01、a02、b01、b02、c01、c02、d01、d02以及d03配置的位图位分别为1、2、3、4、5、6、7、8以及9时,对于承载体标识为a01的第一映射结果,该第二映射结果包括两个标签值,“性别:男”和“学历:本科”,由表9可知,这两个标签值在位图索引分区1中对应的第一个标签值的子位图和第四个标签值的子位图,此时,将这两个标签值的子位图在位图位1上的值更新为1,得到表10所示的位图索引分区1。
表10
Figure PCTCN2018087377-appb-000006
Figure PCTCN2018087377-appb-000007
对于承载体标识为a02的第一映射结果,该第一映射结果包括四个标签值,“性别:女”、“学历:专科”、“职业:个体”和“网购达人”,由表9可知,这四个标签值在位图索引分区1中对应的子位图为第二个标签值的子位图、第三个标签值的子位图、第六个标签值的子位图和倒数第二个标签值的子位图,此时,在表10的基础上,继续将这四个标签值的字位图在位图位2上的值更新为1,得到表11所示的位图索引分区1。
表11
Figure PCTCN2018087377-appb-000008
依次类推,直至将这9条数据记录对应的9个第二映射结果处理完成,实现将这9条数据记录均存储于该位图索引分区1中,也即得到位图索引分区1中各个标签值的位图。
可选地,标签值的位图还可以采用数组方式表示,此时标签值的数组用于表示该标签值的位图中为“1”的位图位。例如,标签值“吸毒者”对应一个位图“[0000000001000000....]”,则该位图还可以表示为数组[10],标签值“网购达人”对应一个位图“[0100011000000100....]”,则该位图还可以表示为数组[2,6,7,14]。其中,采用数组方式表示标签值的位图,可以节省存储空间。
此时,将该标签值的位图在该承载体标识的位图位上的值进行更新,也即,在该标签值的数组中新增该承载体的标识的位图位。例如,对于某个标签值,该承载体的标识在子位图中所对应的位图位为3,该标签值所对应的初始子位图为[1,7],则将该标签值的位图在该承载体标识的位图位上的值进行更新之后,该标签值更新后的子位图为[1,3,7]。
另外,当第一映射结果中没有包括承载体标识到的位图位时,表明在对该第一映射结果对应的数据记录进行映射之前,系统还没有为该数据记录中的承载体标识配置对应的位图位,此时,在获取该承载体标识的位图之后,还可以存储该承载体标识的位图位和该承载体标识之间的对应关系。
具体地,根据该承载体标识的位图位和该承载体标识,确定用于指示该承载体标识到该位图位的映射关系的第一对键值,其中,键为该承载体的标识,值为该承载体标识的位图位。并确定用于指示该位图位到该承载体标识的映射关系的第二对键值,其中,键为该承载体标识的位图位,值为该承载体标识。并存储该第一对键值和第二对键值。
也即,在本发明实施例中,承载体标识的位图位和承载体标识之间是采用双向映射的方式进行存储的,以便于之后根据承载体标识查找对应的位图位,或根据位图位查找对应的承载体标识。
步骤206:将得到的各个位图索引分区中的标签值的位图存储至对应的位图索引分区中。
由步骤205可知,对于一个第一映射集合,由于每确定一个第一映射结果的至少一个位图是在确定的上一个第一映射结果的至少一个位图的基础上确定的,因此,对于任一个第一映射结果,在确定得到至少一个位图时,需先将该至少一个位图存储至该第一映射结果对应的位图索引分区中,以便于之后处理下一个第一映射结果时,根据更新之后的目标位图索引分区继续进行更新。
因此,当通过该第一映射集合对应的第一规约分区对属于该第一映射集合的所有第一映射结果均进行了第一类规约处理之后,即可得到与该第一规约分区对应的位图索引分区中的各个标签值的位图,此时可以直接将得到的位图索引分区中的各个标签值的位图存储至该位图索引分区中。
在本发明实施例中,当获取到至少一条数据记录时,可以通过预设映射/规约模型基于该至少一条数据记录,确定位图索引包括的各个位图索引分区中的标签值的位图,以实现将该至少一条数据记录存储至对应的位图索引分区中。由于位图索引包括至少一个位图,每个位图对应一个标签值,因此可以基于标签值通过位图索引查找具有该标签值的承载体标识,提高了基于标签值进行数据查询的效率。另外,通过预设映射/规约模型可以并行地确定各个位图索引分区中的标签值的位图,提高了存储数据的效率。
图3为本发明实施例提供的一种数据存储方法流程图,应用于将该至少一条数据记录存储至数据表的场景。如图3所示,该数据存储方法包括如下步骤:
步骤301:获取至少一条数据记录,每条数据记录包括一个承载体标识和至少一个标签值。
其中,步骤301的实现方式图2中的步骤201的实现方式基本相同,在此不再详细阐述。
需要说明的是,由图2所示的步骤201可知,预设映射/规约模型包括M个第二规约分区,该M个第二规约分区是与数据表包括的M个数据分区一一对应的规约分区。且将该至少一条数据记录存储至数据表中需通过该预设映射/规约模型的第二类映射处理和第二类规约处理。
也即,当获取到至少一条数据记录时,为了通过预设映射/规约模型将该至少一条数据记录对应的数据分区中,需先确定预设映射/规约模型包括的多个第一规约分区。具体地,可以通过下述步骤302实现该过程。
步骤302:确定预设映射/规约模型中的M个第二规约分区。
具体地,确定数据表的分区信息,该数据表的分区信息用于描述该数据表中的每个数据分区所对应的承载体标识的集合。根据该数据表的分区信息,确定该预设映射/规约模型中的M个第二规约分区,每个第二规约分区对应一个数据分区。
由图2所示的步骤202可知,为了避免不同规约分区之间可能存在交集,为该数据表的分区区间添加用于标识数据表的数据表标识,将添加了数据表标识之后的数据表的分区区间确定为该预设映射/规约模型中的M个第二规约分区的分区区间。也即,每个第二规约分区的分区信息是由数据表标识和预设区间范围的承载体标识组成。
例如,预先为数据表设置如下分区区间:
[,a1)、[a1,a2)、[a2,a3)、…、[a8,a9)。
此时,可以为该预设映射/规约模型设置如下第二规约分区:
[,Aa1)、[Aa1,A a2)、[Aa2,Aa3)、…、[Aa8,Aa9)。
其中,A为用于标识数据表的标识,也即数据表标识。也即,第二规约分区[,Aa1)、[Aa1,A a2)、[Aa2,Aa3)、…、[Aa8,Aa9)为与各个数据分区一一对应的规约分区。
同样地,由于不同的第二规约分区可以并行地对属于该规约分区的分区区间的数据进行处理,因此,需将该至少一条数据记录按照该M个第二规约分区进行分类,以便于不同的第二规约分区对应地处理属于该第二规约分区的数据。
也即,基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的M个第二规约分区的分区信息,对该至少一条数据记录进行第二类分类,得到至少一个第二映射集合,每个第二映射集合对应一个第二规约分区,以便于第二规约分区处理对应的第一映射集合中的数据。具体地,可以通过下述步骤303至步骤304实现该过程。
步骤303:通过预设映射/规约模型并行地对至少一条数据记录进行第二类映射处理,得到至少一个第二映射结果,每个第二映射结果包括数据表标识、承载体标识和至少一个标签值。
由步骤302可知,该预设映射/规约模型中的M个第二规约分区的分区区间实际上并不是数据表中数据分区的分区区间,因此,该第二类映射处理主要为将每条数据记录添加数据表标识,以便于后续确定每条数据记录对应的第二规约分区。
也即,对于每条数据记录,该预设映射/规约模型为每条数据记录添加数据表标识,得到第二映射结果。
需要说明的是,该预设映射/规约模型是将数据表标识并行地添加至每条数据记录中,也即,该预设映射/规约模型同时将数据表标识添加至每条数据记录中。因此,该预设映射/规约模型将数据表标识添加至1条数据记录中的时间和添加至n条数据记录的时间相同,提高了将数据表标识添加至该至少一条数据记录的效率。
另外,对于第二映射结果,可以采用键-值(key-value)的格式记录该第二映射结果。具体地,表12是本发明实施例提供的一种第二映射结果的格式,如表12所示,对于第二映射结果,将数据表标识和承载体标识共同设置为键,将第二映射结果中的至少一个标签值设置为该键的值。
表12
映射结果
第二映射结果 数据表标识+承载体标识 标签值列表
同样地,当采用键-值的格式记录该第二映射结果时,对于每个第二映射结果,还可以在对应的值上添加备注信息,该备注信息可以为图2中的步骤203中的第一映射结果中的备注信息。
例如,A为数据表标识。对于表3中的第一条数据记录“a01->{性别:男,学历:本科}”,该数据记录中的承载体标识为a01,该数据记录包括两个标签值“性别:男”和“学历:本科”,该预设映射/规约模型对该数据记录进行第二类映射处理,得到第二映射结果,第二映射结果包括数据表标识A、承载体标识a01和两个标签值“男”和“本科”。同时将第二映射结果按照上述表12所示的格式记录,得到下述如表13所示的第二映射结果,也即将第二映射结果记录为键为Aa01,值为{性别:男,学历:本科}的数据。
表13
Aa01 {性别:男,学历:本科}
对于表13所示的第二映射结果,当系统为标签值“性别:男”配置的内部ID为1、为标签值“性别:女”配置的内部ID为2、为标签值“学历:本科”配置的内部ID为3、为标签值“学历:专科”配置的内部ID为4时,此时,上述表13所示的键Aa01对应的值还可以记录为{1,3}。
当通过该预设映射/规约模型对至少一条数据记录进行上述第二类映射处理之后,得到至少一个第二映射结果,也即,对于每条数据记录,都将得到上述表12所示第二映射结果。之后,需要通过下述步骤304对该至少一个第二映射结果进行第二类分类。
步骤304:根据M个第二规约分区的分区信息,对该至少一个第二映射结果进行分类,得到至少一个第二映射集合,每个第一映射集合对应二个第二规约分区。
对于该预设映射/规约模型,不同的规约分区可以并行地对属于该规约分区的分区区间的数据进行处理,因此,对于该至少一个第二映射结果,需将该至少一个第二映射结果归类至对应的第二规约分区中。
对于该至少一个第二映射结果中的每个第二映射结果,根据该第二映射结果中的承载体标识和数据表标识,从该M个第二规约分区的分区区间中查找该第二映射结果中的承载体标识所属的分区区间,以实现对该至少一个第二映射结果的分类。分类之后,得到至少一个第二映射集合,对于每个第二映射集合,该第二映射集合包括至少一个第二映射结果。
步骤305:通过该至少一个第二映射集合各自对应的第二规约分区并行地对该至少一个第二映射集合进行第二类规约处理,得到各个数据分区中的数据。
由于该至少一个第二映射集合各自对应的第二规约分区是并行地对该至少一个第二映射集合进行第二类规约处理,因此下面将以对一个第二映射集合进行第二类规约处理的过程进行解释说明。具体地,和图2中的步骤205一样,第二类规约处理过程也分为以下两个过程:
(1)对于每个第二映射集合,确定该第二映射集合对应的第一规约分区,并按照该第二映射集合中每个第二映射结果中的承载体标识,通过该第二规约分区对该第二映射 集合中的第二映射结果进行排序。
其中,通过该第二规约分区对该第二映射集合中的第二映射结果进行排序的实现方式可以参考图2中的步骤205通过该第一规约分区对该第一映射集合中的第一映射结果进行排序的实现方式,本发明实施例在此不再详细阐述。
(2)对于排序后的每个第二映射结果,按照排序结果,按照排序结果,依次生成该第二映射结果的至少一条记录,每条记录包括一个承载体标识和一个标签值,以得到与该第二映射集合对应的数据分区中的数据。
当对该第二映射集合中的第二映射结果进行排序之后,可以通过该预设映射/规约模型中与该第二映射集合对应的第二规约分区,按照排序结果,依次对每个第二映射结果进行处理。
当第二映射结果的格式为步骤303中表12所示键-值(key-value)的格式时,此时,生成该第二映射结果的至少一条记录,也即,从第二映射结果的键中删除数据表标识,得到键为承载体数据标识和值为该至少一个标签值的数据,将得到的数据转化为至少一条记录,每条记录包括该承载体标识和一个标签值。
此时,也可以采用键-值格式输出该至少一条记录,也即,对于每条记录,将该承载体标识作为键,将该一个标签值作为值,得到键-值格式的记录。
例如,对于表13中的第二映射结果,通过步骤305可以得到下述如表14所示的两条记录。如表14所示,第一条记录的键为a01,值为{性别:男},第二条记录的键为a01,值为{学历:本科}。
表14
a01 性别:男
a01 学历:本科
步骤306:将得到的各个数据分区的数据存储至对应的数据分区中。
对于每个第一映射集合,当通过步骤305得到各个数据分区中的数据时,可以直接将各个数据分区的数据存储至对应的数据分区中。由于不同的第二映射集合是并行地执行上述步骤305,也即,在本发明实施例中,通过该预设映射/规约模型可以实现将属于不同数据分区的数据记录并行地存储至对应的数据分区中,从而提高了存储数据的效率。
在本发明实施例中,当获取到至少一条数据记录时,可以通过预设映射/规约模型基于该至少一条数据记录,确定数据表包括的各个数据分区中的数据,以实现将该至少一条数据记录存储至对应的数据分区中,以便于后续基于该数据表查询某个承载体具有的标签。另外,通过预设映射/规约模型可以并行地确定各个数据分区中的数据,提高了存储数据的效率。
需要说明的是,由于预设映射/规约模型中的不同规约分区是并行处理的,因此,在本发明实施例中,通过预设映射/规约模型包括的N个第一规约分区和M个第二规约分区,可以实现根据该至少一条数据记录同时构建数据表和位图索引。下述实施例将对此进行详细说明。
参见图4,本发明实施例提供了一种数据存储方法,用于将该至少一条数据记录同时存储至数据表和位图索引中的场景,如图4所示,该方法包括如下步骤:
步骤401:获取至少一条数据记录,每条数据记录包括一个承载体标识和至少一个标签值。
其中,步骤401的实现方式图2中的步骤201的实现方式基本相同,在此不再详细阐述。
在获取至少一条数据记录之后,通过下述步骤402至步骤406同时构建位图索引和数据表。
步骤402:确定预设映射/规约模型中的N个第一规约分区和M个第二规约分区。
其中,步骤402的实现方式可以参考图2中的步骤202和图3中的步骤302的实现方式。
也即,为了实现同时构建位图索引和数据表,在得到至少一条数据记录时,可以同时获取与N个位图索引分区一一对应的N个第一规约分区和与M个数据分区一一对应的M个第二规约分区。
步骤403:通过该预设映射/规约模型并行地对至少一条数据记录进行第一类映射处理,得到至少一个第一映射结果,每个第一映射结果包括该位图索引表标识、承载体标识和至少一个标签值;同时,通过预设映射/规约模型并行地对至少一条数据记录进行第二类映射处理,得到至少一个第二映射结果,每个第二映射结果包括数据表标识、承载体标识和至少一个标签值。
其中,步骤403的实现方式可以参考图2中的步骤203和图3中的步骤303的实现方式。
也即,在本发明实施例中,图2中的步骤203中的第一类映射处理和图3中的步骤303中的第二类映射处理可以并行处理,以实现同时得到每条数据记录的第一映射结果和第二映射结果。
步骤404:根据N个第一规约分区的分区信息,对该至少一个第一映射结果进行分类,得到至少一个第一映射集合,每个第一映射集合对应一个第一规约分区;同时,根据M个第二规约分区的分区信息,对该至少一个第二映射结果进行分类,得到至少一个第二映射集合,每个第一映射集合对应二个第二规约分区。
其中,步骤404的实现方式可以参考图2中的步骤204和图3中的步骤304的实现方式。
也即,图2中的步骤204中的第一类分类和图3中的步骤304中的第二类分类可以并行处理,以实现同时对至少一个第一映射结果进行分类和对至少一个第二映射结果进行分类。
步骤405:通过该至少一个第一映射集合各自对应的第一规约分区并行地对该至少一个第一映射集合进行第一类规约处理,得到各个位图索引分区中的标签值的位图;同时,通过该至少一个第二映射集合各自对应的第二规约分区并行地对该至少一个第二映射集合进行第二类规约处理,得到各个数据分区中的数据。
其中,步骤405的实现方式可以参考图2中的步骤205和图3中的步骤305的实现方式。
也即,在本发明实施例中,无论是N个第一规约分区之间,还是M个第二规约分区之间,还是第一规约分区和第二规约分区之间,各个规约分区均是并行地处于属于各自的数据。也即是,各个规约分区之间处理数据是相互独立的,以实现同时确定各个位图索引分区中的数据和各个数据分区中的数据。
步骤406:将得到的各个位图索引分区中的标签值的位图存储至对应的位图索引分区中;同时,将得到的各个数据分区的数据存储至对应的数据分区中。
其中,步骤406的实现方式可以参考图2中的步骤206和图3中的步骤306的实现方式。
由于不同规约分区之间是相互独立的,因此可以实现同时将该至少一条数据记录存储至位图索引和数据表中。
在本发明实施例中,当获取到至少一条数据记录时,可以通过预设映射/规约模型基于该至少一条数据记录,将该至少一条数据记录同时存储至对应的位图索引分区和数据分区中,以便于后续基于该位图索引分区查询某个标签值对应的承载体标识或基于该数据分区查询查询某个承载体具有的标签,提高了存储数据的效率。
本发明实施例除了提供上述所述的数据存储方法,还提了一种数据存储装置,参见图5A,该数据存储装置500包括获取模块501,第一分类模块502、第一规约模块503和第一存储模块504。
获取模块501,用于执行上述图2中的步骤201或者图3中的步骤301;
第一分类模块502,用于基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类,得到至少一个第一映射集合,每个第一映射集合对应一个第一规约分区;
其中,所述N个第一规约分区是根据位图索引包括的N个位图索引分区的分区信息确定的,N为正整数,每个位图索引分区对应一个第一规约分区,每个位图索引分区包括至少一个位图,每个位图对应于一个标签值,每个位图包括至少一个位图位,每个位图位用于记录一个承载体标识所对应的承载体是否具备当前位图所对应的标签值;
第一规约模块503,用于执行上述图2中的步骤205;
第一存储模块504,用于执行上述图2中的步骤206。
可选地,每个第一规约分区的分区信息是由位图索引表标识和预设区间范围的承载体标识组成;
参见图5B,该第一分类模块502包括第一映射单元5021和第一分类单元5022:
第一映射单元5021,用于执行上述图2中的步骤203;
第一分类单元5022,用于执行上述图2中的步骤204。
可选地,该第一规约模块503包括:
确定单元,用于对于每个第一映射集合,确定所述第一映射集合对应的第一规约分区;
排序单元,用于按照所述第一映射集合中每个第一映射结果中的承载体标识,通过所述第一规约分区对所述第一映射集合中的第一映射结果进行排序;
更新单元,用于对于排序后的每个第一映射结果,按照排序结果,从与所述第一规 约分区对应的位图索引分区中获取所述第一映射结果包括的至少一个标签值中每个标签值的位图,并根据所述承载体标识的位图位更新所述标签值的位图。
可选地,该第一规约模块503还包括:
第一执行单元,当所述第一映射结果还包括承载体标识的位图位时,执行根据所述承载体标识的位图位更新所述标签值的位图的操作;或者
第二执行单元,当所述第一映射结果未包括承载体标识的位图位时,获取所述承载体标识的位图位,并执行根据所述承载体标识的位图位更新所述标签值的位图的操作。
可选地,该第二执行单元,还用于:
存储所述承载体标识的位图位和所述承载体标识之间的对应关系。
可选地,该装置500还包括:
第一确定模块,用于确定所述位图索引的分区信息,所述位图索引的分区信息用于描述所述位图索引中每个位图索引分区所对应的承载体标识的集合;
第二确定模块,用于根据所述位图索引的分区信息,确定所述预设映射/规约模型中的N个第一规约分区。
可选地,参见图5C,该装置500还包括第二分类模块505、第二规约模块506和第二存储模块507:
第二分类模块505,用于基于每条数据记录包括的承载体标识,按照所述预设映射/规约模型包括的M个第二规约分区的分区信息,对所述至少一条数据记录进行第二类分类,得到至少一个第二映射集合,每个第二映射集合对应一个第二规约分区;
其中,所述M个第二规约分区是根据数据表包括的M个数据分区的分区信息确定的,M为正整数,每个数据分区对应一个第二规约分区,每个数据分区用于记录承载体标识与标签值的对应关系;
第二规约模块506,用于执行图3中的步骤305;
第二存储模块507,用于执行图3中的步骤306。
可选地,每个第二规约分区的分区信息是由承载体数据表标识和预设区间范围的承载体标识组成;
参见图5D,该第二分类模块505包括第二映射单元5051和第二分类单元5052:
第二映射单元5051,用于执行图3中的步骤304;
第二分类单元5052,用于执行图3中的步骤305。
可选地,N小于或等于M,N大于或等于2,M个数据分区中的每个数据分区属于唯一的位图索引分区,N个位图索引分区中的每个位图索引分区包含至少一个数据分区。
在本发明实施例中,当获取到至少一条数据记录时,可以通过预设映射/规约模型基于该至少一条数据记录,确定位图索引包括的各个位图索引分区中的标签值的位图,以实现将该至少一条数据记录存储至对应的位图索引分区中。由于位图索引包括至少一个位图,每个位图对应一个标签值,因此可以基于标签值通过位图索引查找具有该标签值的承载体标识,提高了基于标签值进行数据查询的效率。另外,通过预设映射/规约模型可以并行地确定各个位图索引分区中的标签值的位图,提高了存储数据的效率。
图6所示为本发明实施例提供的另一种数据存储装置的示意图。数据存储装置600 可以是计算机设备,该计算机设备可以是上述的服务器,数据存储装置600包括至少一个处理器601,通信总线602,存储器603以及至少一个通信接口604。
处理器601可以是一个通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本发明方案程序执行的集成电路。
通信总线602可包括一通路,在上述组件之间传送信息。所述通信接口604,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线接入网(RAN),无线局域网(Wireless Local Area Networks,WLAN)等。
存储器603可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,所述存储器603用于存储执行本发明方案的程序代码,并由处理器601来控制执行。所述处理器601用于执行所述存储器603中存储的程序代码。
在具体实现中,作为一种实施例,处理器601可以包括一个或多个CPU,例如图6中的CPU0和CPU1。
在具体实现中,作为一种实施例,数据存储装置600可以包括多个处理器,例如图6中的处理器601和处理器608。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,数据存储装置600还可以包括输出设备605和输入设备606。输出设备605和处理器601通信,可以以多种方式来显示信息。例如,输出设备605可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,或投影仪(projector)等。输入设备606和处理器601通信,可以以多种方式接受用户的输入。例如,输入设备606可以是鼠标、键盘、触摸屏设备或传感设备等。
上述的数据存储装置600可以是一个通用计算机设备或者是一个专用计算机设备。在具体实现中,数据存储装置600可以是台式机、便携式电脑、网络服务器、掌上电脑(Personal Digital Assistant,PDA)、移动手机、平板电脑、无线终端设备、通信设备、嵌入式设备或有图6中类似结构的设备。本发明实施例不限定用户口令管理的数据存储装置600的类型。
数据存储装置的存储器中存储了一个或多个软件模块。数据存储装置可以通过处理器以及存储器中的程序代码来实现软件模块,实现上述实施例所说的数据存储方法。
本申请一个实施例还提供了一种计算机存储介质,该计算机存储介质中存储有指令;数据存储装置(可以是计算机设备,例如服务器)执行该指令,例如计算机设备中的处理器执行该指令,使得该数据存储装置实现上述实施例所说的数据存储方法。
本申请实施例提供一种计算机程序产品,该计算机程序产品包括指令;数据存储装置(可以是计算机设备,例如服务器)执行该指令,使得该数据存储装置执行上述方法实施例的数据存储方法。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种数据存储方法,其特征在于,所述方法包括:
    获取至少一条数据记录,每条数据记录包括一个承载体标识和至少一个标签值;
    基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类,得到至少一个第一映射集合,每个第一映射集合对应一个第一规约分区;
    其中,所述N个第一规约分区是根据位图索引包括的N个位图索引分区的分区信息确定的,N为正整数,每个位图索引分区对应一个第一规约分区,每个位图索引分区包括至少一个位图,每个位图对应于一个标签值,每个位图包括至少一个位图位,每个位图位用于记录一个承载体标识所对应的承载体是否具备当前位图所对应的标签值;
    通过所述至少一个第一映射集合各自对应的第一规约分区并行地对所述至少一个第一映射集合进行第一类规约处理,得到各个位图索引分区中的标签值的位图;
    将得到的各个位图索引分区中的标签值的位图存储至对应的位图索引分区中。
  2. 根据权利要求1所述的方法,其特征在于,每个第一规约分区的分区信息是由位图索引表标识和预设区间范围的承载体标识组成;
    所述按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类,得到至少一个第一映射集合,包括:
    通过所述预设映射/规约模型并行地对所述至少一条数据记录进行第一类映射处理,得到至少一个第一映射结果,每个第一映射结果包括所述位图索引表标识、承载体标识和至少一个标签值;
    根据所述N个第一规约分区的分区信息,对所述至少一个第一映射结果进行分类,得到至少一个第一映射集合。
  3. 根据权利要求2所述的方法,其特征在于,所述通过所述至少一个第一映射集合各自对应的第一规约分区并行地对所述至少一个第一映射集合进行第一类规约处理,得到各个位图索引分区中的标签值的位图,包括:
    对于每个第一映射集合,确定所述第一映射集合对应的第一规约分区;
    按照所述第一映射集合中每个第一映射结果中的承载体标识,通过所述第一规约分区对所述第一映射集合中的第一映射结果进行排序;
    对于排序后的每个第一映射结果,按照排序结果,从与所述第一规约分区对应的位图索引分区中获取所述第一映射结果包括的至少一个标签值中每个标签值的位图,并根据所述承载体标识的位图位更新所述标签值的位图。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述承载体标识的位图位更新所述标签值的位图之前,还包括:
    当所述第一映射结果还包括承载体标识的位图位时,执行根据所述承载体标识的位图位更新所述标签值的位图的操作;或者
    当所述第一映射结果未包括承载体标识的位图位时,获取所述承载体标识的位图位,并执行根据所述承载体标识的位图位更新所述标签值的位图的操作。
  5. 根据权利要求4所述的方法,其特征在于,所述获取所述承载体标识的位图位之后,还包括:
    存储所述承载体标识的位图位和所述承载体标识之间的对应关系。
  6. 根据权利要求1-5任一所述的方法,其特征在于,所述基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类之前,还包括:
    确定所述位图索引的分区信息,所述位图索引的分区信息用于描述所述位图索引中每个位图索引分区所对应的承载体标识的集合;
    根据所述位图索引的分区信息,确定所述预设映射/规约模型中的N个第一规约分区。
  7. 根据权利要求1所述的方法,其特征在于,所述获取至少一条数据记录之后,还包括:
    基于每条数据记录包括的承载体标识,按照所述预设映射/规约模型包括的M个第二规约分区的分区信息,对所述至少一条数据记录进行第二类分类,得到至少一个第二映射集合,每个第二映射集合对应一个第二规约分区;
    其中,所述M个第二规约分区是根据数据表包括的M个数据分区的分区信息确定的,M为正整数,每个数据分区对应一个第二规约分区,每个数据分区用于记录承载体标识与标签值的对应关系;
    通过所述至少一个第二映射集合各自对应的第二规约分区并行地对所述至少一个第二映射集合进行第二类规约处理,得到各个数据分区中的数据;
    将得到的各个数据分区的数据存储至对应的数据分区中。
  8. 根据权利要求7所述的方法,其特征在于,每个第二规约分区的分区信息是由承载体数据表标识和预设区间范围的承载体标识组成;
    所述按照所述预设映射/规约模型包括的M个第二规约分区的分区信息,对所述至少一条数据记录进行第二类分类,包括:
    通过所述预设映射/规约模型并行地对所述至少一条数据记录进行第二类映射处理,得到至少一个第二映射结果,每个第二映射结果包括所述数据表标识、承载体标识和至少一个标签值;
    根据所述M个第二规约分区的分区信息,对所述至少一个第二映射结果进行分类,得到至少一个第二映射集合。
  9. 根据权利要求1至8任一所述的方法,其特征在于,N小于或等于M,N大于或等于2,M个数据分区中的每个数据分区属于唯一的位图索引分区,N个位图索引分区中的每个位图索引分区包含至少一个数据分区。
  10. 一种数据存储装置,其特征在于,所述装置包括:
    获取模块,用于获取至少一条数据记录,每条数据记录包括一个承载体标识和至少一个标签值;
    第一分类模块,用于基于每条数据记录包括的承载体标识,按照预设映射/规约模型包括的N个第一规约分区的分区信息,对所述至少一条数据记录进行第一类分类,得到至少一个第一映射集合,每个第一映射集合对应一个第一规约分区;
    其中,所述N个第一规约分区是根据位图索引包括的N个位图索引分区的分区信息确定的,N为正整数,每个位图索引分区对应一个第一规约分区,每个位图索引分区包括至少一个位图,每个位图对应于一个标签值,每个位图包括至少一个位图位,每个位图位用于记录一个承载体标识所对应的承载体是否具备当前位图所对应的标签值;
    第一规约模块,用于通过所述至少一个第一映射集合各自对应的第一规约分区并行地对所述至少一个第一映射集合进行第一类规约处理,得到各个位图索引分区中的标签值的位图;
    第一存储模块,用于将得到的各个位图索引分区中的标签值的位图存储至对应的位图索引分区中。
  11. 根据权利要求10所述的装置,其特征在于,每个第一规约分区的分区信息是由位图索引表标识和预设区间范围的承载体标识组成;
    所述第一分类模块包括:
    第一映射单元,用于通过所述预设映射/规约模型并行地对所述至少一条数据记录进行第一类映射处理,得到至少一个第一映射结果,每个第一映射结果包括所述位图索引表标识、承载体标识和至少一个标签值;
    第一分类单元,用于根据所述N个第一规约分区的分区信息,对所述至少一个第一映射结果进行分类,得到至少一个第一映射集合。
  12. 根据权利要求11所述的装置,其特征在于,所述第一规约模块包括:
    确定单元,用于对于每个第一映射集合,确定所述第一映射集合对应的第一规约分区;
    排序单元,用于按照所述第一映射集合中每个第一映射结果中的承载体标识,通过所述第一规约分区对所述第一映射集合中的第一映射结果进行排序;
    更新单元,用于对于排序后的每个第一映射结果,按照排序结果,从与所述第一规约分区对应的位图索引分区中获取所述第一映射结果包括的至少一个标签值中每个标签值的位图,并根据所述承载体标识的位图位更新所述标签值的位图。
  13. 根据权利要求12所述的装置,其特征在于,所述第一规约模块还包括:
    第一执行单元,当所述第一映射结果还包括承载体标识的位图位时,执行根据所述承载体标识的位图位更新所述标签值的位图的操作;或者
    第二执行单元,当所述第一映射结果未包括承载体标识的位图位时,获取所述承载体标识的位图位,并执行根据所述承载体标识的位图位更新所述标签值的位图的操作。
  14. 根据权利要求13所述的装置,其特征在于,所述第二执行单元,还用于:
    存储所述承载体标识的位图位和所述承载体标识之间的对应关系。
  15. 根据权利要求10-14任一所述的装置,其特征在于,所述装置还包括:
    第一确定模块,用于确定所述位图索引的分区信息,所述位图索引的分区信息用于描述所述位图索引中每个位图索引分区所对应的承载体标识的集合;
    第二确定模块,用于根据所述位图索引的分区信息,确定所述预设映射/规约模型中的N个第一规约分区。
  16. 根据权利要求10所述的装置,其特征在于,所述装置还包括:
    第二分类模块,用于基于每条数据记录包括的承载体标识,按照所述预设映射/规约模型包括的M个第二规约分区的分区信息,对所述至少一条数据记录进行第二类分类,得到至少一个第二映射集合,每个第二映射集合对应一个第二规约分区;
    其中,所述M个第二规约分区是根据数据表包括的M个数据分区的分区信息确定的,M为正整数,每个数据分区对应一个第二规约分区,每个数据分区用于记录承载体标识与标签值的对应关系;
    第二规约模块,用于通过所述至少一个第二映射集合各自对应的第二规约分区并行地对所述至少一个第二映射集合进行第二类规约处理,得到各个数据分区中的数据;
    第二存储模块,用于将得到的各个数据分区的数据存储至对应的数据分区中。
  17. 根据权利要求16所述的装置,其特征在于,每个第二规约分区的分区信息是由承载体数据表标识和预设区间范围的承载体标识组成;
    所述第二分类模块包括:
    第二映射单元,用于通过所述预设映射/规约模型并行地对所述至少一条数据记录进行第二类映射处理,得到至少一个第二映射结果,每个第二映射结果包括所述数据表标识、承载体标识和至少一个标签值;
    第二分类单元,用于根据所述M个第二规约分区的分区信息,对所述至少一个第二映射结果进行分类,得到至少一个第二映射集合。
  18. 根据权利要求10至17任一所述的装置,其特征在于,N小于或等于M,N大于或等于2,M个数据分区中的每个数据分区属于唯一的位图索引分区,N个位图索引分区中的每个位图索引分区包含至少一个数据分区。
  19. 一种数据存储装置,其特征在于,所述装置包括:存储器和处理器,所述存储器中存储有指令,所述处理器通过执行所述存储器中存储的指令使得数据存储装置实现如权利要求1至9任一所述的数据存储方法。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,数据存储装置执行所示指令使得数据存储装置实现权利要求1至9任一所述的数据存储方法。
PCT/CN2018/087377 2017-09-18 2018-05-17 数据存储方法、装置及存储介质 WO2019052209A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710841916.3 2017-09-18
CN201710841916.3A CN107704527B (zh) 2017-09-18 2017-09-18 数据存储方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2019052209A1 true WO2019052209A1 (zh) 2019-03-21

Family

ID=61172880

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/087377 WO2019052209A1 (zh) 2017-09-18 2018-05-17 数据存储方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN107704527B (zh)
WO (1) WO2019052209A1 (zh)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704527B (zh) * 2017-09-18 2020-05-08 华为技术有限公司 数据存储方法、装置及存储介质
CN109471874A (zh) * 2018-10-30 2019-03-15 华为技术有限公司 数据分析方法、设备及存储介质
CN109656948B (zh) * 2018-11-15 2021-01-22 金蝶软件(中国)有限公司 位图数据处理方法、装置、计算机设备和存储介质
CN110209348B (zh) * 2019-04-17 2021-08-17 腾讯科技(深圳)有限公司 数据存储方法、装置、电子设备及存储介质
CN110297836B (zh) * 2019-07-11 2021-07-20 杭州云梯科技有限公司 基于压缩位图方式的用户标签存储方法和检索方法
CN111259005A (zh) * 2020-01-08 2020-06-09 北京每日优鲜电子商务有限公司 模型调用方法、装置及计算机存储介质
CN112084245B (zh) * 2020-09-03 2024-03-12 深圳力维智联技术有限公司 基于微服务架构的数据管理方法、装置、设备及存储介质
CN112307264A (zh) * 2020-10-22 2021-02-02 深圳市欢太科技有限公司 数据查询方法和装置、以及存储介质和电子设备
CN112328595A (zh) * 2020-10-30 2021-02-05 上海钐昆网络科技有限公司 一种数据查找方法、装置、设备及存储介质
CN112532748B (zh) * 2020-12-24 2022-05-17 北京百度网讯科技有限公司 消息推送方法、装置、设备、介质和计算机程序产品
CN113068045A (zh) * 2021-03-17 2021-07-02 厦门雅基软件有限公司 数据存储方法、装置、电子设备及计算机可读存储介质
CN113590856B (zh) * 2021-08-09 2023-05-23 平安银行股份有限公司 标签查询方法、装置、电子设备及可读存储介质
CN113722533B (zh) * 2021-08-30 2023-10-17 康键信息技术(深圳)有限公司 信息推送方法、装置、电子设备及可读存储介质
CN117591520A (zh) * 2024-01-19 2024-02-23 深圳市名通科技股份有限公司 基于位图组的时空大数据计算方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120317361A1 (en) * 2010-04-21 2012-12-13 Empire Technology Development Llc Storage efficient sectored cache
CN104156407A (zh) * 2014-07-29 2014-11-19 华为技术有限公司 索引数据的存储方法、装置及存储设备
CN106201338A (zh) * 2016-06-28 2016-12-07 华为技术有限公司 数据存储方法及装置
CN106970935A (zh) * 2017-01-20 2017-07-21 朗坤智慧科技股份有限公司 实时数据存储结构、数据写入方法及数据读取方法
CN107704527A (zh) * 2017-09-18 2018-02-16 华为技术有限公司 数据存储方法、装置及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722531B (zh) * 2012-05-17 2014-04-16 北京大学 一种云环境中基于分片位图索引的查询方法
US9280780B2 (en) * 2014-01-27 2016-03-08 Umbel Corporation Systems and methods of generating and using a bitmap index
CN104731872B (zh) * 2015-03-05 2018-04-03 长沙新弘软件有限公司 基于位图的存储空间管理系统及其方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120317361A1 (en) * 2010-04-21 2012-12-13 Empire Technology Development Llc Storage efficient sectored cache
CN104156407A (zh) * 2014-07-29 2014-11-19 华为技术有限公司 索引数据的存储方法、装置及存储设备
CN106201338A (zh) * 2016-06-28 2016-12-07 华为技术有限公司 数据存储方法及装置
CN106970935A (zh) * 2017-01-20 2017-07-21 朗坤智慧科技股份有限公司 实时数据存储结构、数据写入方法及数据读取方法
CN107704527A (zh) * 2017-09-18 2018-02-16 华为技术有限公司 数据存储方法、装置及存储介质

Also Published As

Publication number Publication date
CN107704527A (zh) 2018-02-16
CN107704527B (zh) 2020-05-08

Similar Documents

Publication Publication Date Title
WO2019052209A1 (zh) 数据存储方法、装置及存储介质
WO2019024060A1 (zh) 数据存储方法、装置和存储介质
US11294943B2 (en) Distributed match and association of entity key-value attribute pairs
WO2019128318A1 (zh) 数据处理方法、装置和系统
WO2017215370A1 (zh) 构建决策模型的方法、装置、计算机设备及存储设备
CN114049927A (zh) 疾病数据处理方法、装置、电子设备及可读介质
US10078624B2 (en) Method of generating hierarchical data structure
CN114385620A (zh) 数据处理方法、装置、设备及可读存储介质
CN111723161A (zh) 一种数据处理方法、装置及设备
CN110659283A (zh) 数据标签处理方法、装置、计算机设备及存储介质
CN114090760B (zh) 表格问答的数据处理方法、电子设备及可读存储介质
US9965812B2 (en) Generating a supplemental description of an entity
WO2019184577A1 (zh) 一种事务处理方法、服务器及事务处理系统
CN113434501A (zh) 关系型数据库表的存储方法、设备及可读存储介质
US11531706B2 (en) Graph search using index vertices
WO2023138505A1 (en) Methods, systems, and devices for data query
CN111858617A (zh) 用户查找方法和装置、计算机可读存储介质、电子设备
WO2020024824A1 (zh) 一种用户状态标识确定方法及装置
CN111949649A (zh) 一种动态本体存储系统、存储方法、数据查询方法
WO2021174917A1 (zh) 基于人工智能的poi定位方法、装置、计算机设备及介质
CN114022188A (zh) 目标人群圈选方法、装置、设备以及存储介质
WO2019165762A1 (zh) 一种抽样查询的方法和装置
US20230067107A1 (en) Managing vertex level access in a graph via user defined tag rules
CN113342646B (zh) 用例生成方法、装置、电子设备和介质
WO2022160443A1 (zh) 谱系挖掘方法、装置、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18857006

Country of ref document: EP

Kind code of ref document: A1