CN112540972A - Roaring bitmap-based massive user efficient selection method and device - Google Patents
Roaring bitmap-based massive user efficient selection method and device Download PDFInfo
- Publication number
- CN112540972A CN112540972A CN202011482828.7A CN202011482828A CN112540972A CN 112540972 A CN112540972 A CN 112540972A CN 202011482828 A CN202011482828 A CN 202011482828A CN 112540972 A CN112540972 A CN 112540972A
- Authority
- CN
- China
- Prior art keywords
- bitmap
- user
- data
- cube
- creating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 11
- 238000005192 partition Methods 0.000 claims abstract description 63
- 238000013499 data model Methods 0.000 claims abstract description 55
- 238000000034 method Methods 0.000 claims abstract description 36
- 230000006399 behavior Effects 0.000 claims abstract description 20
- 230000001133 acceleration Effects 0.000 claims abstract description 19
- 230000008569 process Effects 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 43
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 10
- 230000000295 complement effect Effects 0.000 claims description 8
- 230000002776 aggregation Effects 0.000 claims description 7
- 238000004220 aggregation Methods 0.000 claims description 7
- 230000001360 synchronised effect Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 11
- 238000005259 measurement Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000003973 irrigation Methods 0.000 description 1
- 230000002262 irrigation Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/211—Schema design and management
- G06F16/212—Schema design and management with details for data modelling support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Abstract
The invention discloses a Roaring bitmap-based massive user efficient selection method and device, wherein the method comprises the following steps: creating a tag library of user behaviors and a tag library of service classification according to user behavior data, then creating a data model according to the tag library, creating a bitmap partition table according to the data model to store user historical data, and finally calculating a result through a user-defined function; synchronizing the data of the secondary bins into a data model, then complementing the bitmap partition table from the data model according to rules, and finally performing cube acceleration on the bitmap partition table according to requirements to generate a cube _ bitmap table; and selecting to directly search the result table according to the dimension selected by the user, or to distinguish accelerated cube query or bitmap partition table query through storage process query. The method and the device can accurately and efficiently select the information to be counted in mass users.
Description
Technical Field
The invention relates to the field of mobile communication Internet of things, in particular to a method and a device for efficiently selecting massive users based on Roaring bitmap (efficient bitmap calculation).
Background
With the widespread use of mobile communication equipment, the data on the internet of the mobile end of a user is gradually increased day by day, and the gating of the user based on different latitudes is usually solved by adopting an OLAP-Druid pre-summary mode and a SPARK distributed efficient memory calculation mode. However, the OLAP-drive is lost in precision in a deduplication service scene, and the statistical result is inaccurate, while the SPAKR calculation mode is to pull detailed data to a memory for operation, although the SPAKR calculation mode is distributed, the memory resource occupied by the detailed data is quite large, and the operation efficiency of making a count (distict) on massive data is also low. For the efficient selection of the service scene of massive users, the user experience of the conventional technical scheme is poor.
Disclosure of Invention
In order to solve the problem that the user experience of a conventional technical scheme is poor for the efficient selection of the service scene of the mass users, the invention provides a method and a device for efficiently selecting the mass users based on Roaring bitmap.
In order to achieve the purpose, the invention adopts the following technical scheme:
in an embodiment of the present invention, a method for efficiently circling mass users based on roaring bitmap is provided, which includes:
data model creation: creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether to directly calculate a result, if so, creating a result table, simultaneously judging whether to create a cube _ bitmap table according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
data complement: synchronizing the data of the bins into a data model, performing complementation according to a corresponding function func (), synchronizing the IMSI into a dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, performing complementation according to data model data by bitmap partition table data, judging whether cube acceleration is performed or not, and generating a model cube _ bitmap table if the cube acceleration is performed;
and (3) user-defined multi-dimensional user selection: and judging whether to directly inquire the routing to the result table during inquiry, otherwise, judging whether to inquire the routing to the cube in the storage process, if so, inquiring according to a function func () corresponding to the model cube _ bitmap table, otherwise, calculating according to the bitmap partition table, returning to the temporary result table, and performing user selection statistics.
Further, the data model is composed of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from the data warehouse.
Further, the bitmap partition table is composed of dimension + flag + status _ date + round bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
Further, the function func () of the query statistics supports multiple time periods and homocyclic queries.
Further, the cube _ bitmap table is used for cube query acceleration.
Further, the result table is used to store the result pre-calculated by bitmap.
Further, synchronizing the IMSI into the dictionary table for bitmap calculation includes:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
In an embodiment of the present invention, a device for efficiently circling massive users based on roaring bitmap is further provided, where the device includes:
the data model creating module is used for creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether a result is directly calculated, if so, creating a result table, and simultaneously judging whether a cube _ bitmap table is created according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
the data complementing module is used for synchronizing the data of the data bins into the data model, complementing according to a corresponding function func (), synchronizing IMSI into the dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, complementing the bitmap partition table data according to the data model data, judging whether cube acceleration is performed or not, and if so, generating a model cube _ bitmap table;
and the user-defined multi-dimensional user selection module is used for judging whether to directly inquire the result table by routing or not during inquiry, otherwise, in the storage process, judging whether to route the cube or not, if so, inquiring according to the function func () corresponding to the model cube _ bitmap table, otherwise, inquiring according to the function func () corresponding to the bitmap partition table, and calculating according to the bitmap partition table to return the temporary result table for user selection statistics.
Further, the data model is composed of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from the data warehouse.
Further, the bitmap partition table is composed of dimension + flag + status _ date + round bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
Further, the function func () of the query statistics supports multiple time periods and homocyclic queries.
Further, the cube _ bitmap table is used for cube query acceleration.
Further, the result table is used to store the result pre-calculated by bitmap.
Further, synchronizing the IMSI into the dictionary table for bitmap calculation includes:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
In an embodiment of the present invention, a computer device is further provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the aforementioned roaring bitmap-based massive user efficient selection method is implemented.
In an embodiment of the present invention, a computer-readable storage medium is further provided, where a computer program for executing the roaring bitmap-based massive user efficient circle selection method is stored in the computer-readable storage medium.
Has the advantages that:
1. the method is applied to circle active users and sleeping users among a large number of users, the users can randomly select the dimension to be counted to carry out efficient circle selection, the query efficiency is high, and the requirements of users in the scenes are perfectly met.
2. The mass user selection is actually bit operation among bitmap sets, the storage and calculation resources are not greatly consumed, the operation efficiency is high, and the problems of duplicate removal, precision loss, low efficiency and the like of mass data at present are solved.
Drawings
Fig. 1 is a schematic flow chart of a method for efficiently circling mass users based on roaring bitmap according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data model creation process according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a data complement process according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a flow of user-defined multidimensional user selection according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating the subscriber circled IMSI configuration according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a device for efficiently circling mass users based on roaring bitmap according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The principles and spirit of the present invention will be described below with reference to several exemplary embodiments, which should be understood to be presented only to enable those skilled in the art to better understand and implement the present invention, and not to limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a method and a device for efficiently selecting circles of massive users based on Roaring bitmap are provided, and the user-defined dimension is supported for selection circles; the method supports efficient circle selection among massive users; the occupation of computing resources is less; the storage resource occupation is less; and efficient and accurate duplication removal of massive data is supported.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.
Fig. 1 is a schematic flow chart of a method for efficiently circling mass users based on roaring bitmap according to an embodiment of the present invention. As shown in fig. 1, the method includes:
the bitmap model is created: generating a bitmap model and generating query func ();
encapsulating bitmap computing logic: calculating a model bitmap and calculating a cube _ bitmap;
obtaining bitmap results of the model and the cube through the bitmap calculation of the model, providing the user terminal model for inquiring and accelerating the cube inquiry, and deleting the temporary result table;
and deleting the temporary result table through IDE timing scheduling.
The following detailed flow description is made:
1. data model creation
FIG. 2 is a schematic diagram of a data model creation process according to an embodiment of the invention. As shown in fig. 2, a tag library of user behavior and a tag library of service classification are created according to user behavior data, a data model (created according to the transmitted dimension and measurement) is created according to the tag library of user behavior and the tag library of service classification, a bitmap partition table is created according to the data model, a function func () of query statistics (parameter time, dimension, measurement sort, full link id) is created according to the bitmap partition table, whether the result is directly calculated is judged, if yes, a result table is created, and if yes, a cube _ bitmap table is created according to the bitmap partition table, if yes, a cube _ bitmap table is created, and a function func () of query statistics (parameter time, dimension, measurement sort, full link id) is created according to the cube _ bitmap table;
the data model consists of specified dimensionality (city, gender, hobby, age, terminal, occupation and the like) and indexes (active users, sleeping users, new users, internet surfing time of the users and the like), and data of the data model is synchronized from the data warehouse;
the bitmap partition table of the data model is mainly composed of fields of dimension + flag + status _ date + round bitmap to be counted, the daily and historical user bitmaps are calculated according to the function func () of dimension summary, duplication removal, aggregation and query statistics of statistics, and finally the table is set as the partition table according to the date;
querying a statistical function func () and supporting a plurality of time periods and homocyclic ratio query;
the cube _ bitmap table is mainly used for cube query acceleration;
and the result table is used for storing the result of the pre-calculation into the result table, the data of the result table is stored by the result calculated by the bitmap in advance, and the real-time operation by the bitmap is not needed during the query.
2. Data complement (bitmap correlation table complement)
FIG. 3 is a schematic diagram of a data complement process according to an embodiment of the invention. As shown in fig. 3, synchronizing the data of the bins into the data model, performing complementation according to a corresponding function func (), synchronizing the IMSI into the dictionary table for bitmap calculation, deleting the date data to be complemented according to partitions, performing complementation according to the data model data by the bitmap partition table data, judging whether cube acceleration is performed, and if so, generating a model cube _ bitmap table; the method comprises the following specific steps:
firstly, synchronizing data of a plurality of bins into a data model, and then mapping the data into a function func () of specified dimension summary, deduplication, aggregation and query statistics according to a custom rule to perform manual or automatic number complementing;
synchronizing the IMSI (user unique identifier) of each day into a dictionary table, wherein the table is mainly used for storing the IMSI and an integer value corresponding to each IMSI, the integer value is the unique identifier of the user, and the subsequent operation is bit operation based on the corresponding identifier;
the data model and the dictionary table are associated through IMSI, a user identification ID is obtained to convert rb _ or _ agg (rb _ build (ARRAY [ b.id:: INT ])), and data stored in the bitmap partition table of a statistical dimension user in the current day (flag ═ 1) are obtained;
carrying out rb _ or _ agg (bitmap) operation on bitmap data counted in the current day (flag is 1) and the previous day (flag is 2) in a bitmap partition table of a data model, and inserting the integrated data serving as the current day (flag is 2) data into the bitmap partition table;
the total daily statistics table only has fields of flag + status _ date + Roaring bitmap, and mainly counts the user and historical user bitmap conditions of the current day, the current day bitmap data of which flag is 1 is counted, and the data model and the user table are associated to perform bitmap fetching rb _ or _ agg (rb _ build [ b.id:: INT ]));
counting the daily bitmap data of which the flag is 2 in a daily gross statistics table, acquiring the data of which the flag is 1 in the daily and the data of which the flag is 2 in yesterday statistics from a self table, and then making rb _ or _ agg (bitmap);
the cube acceleration complement is based on the complement of the bitmap partition table, and the dimensionality is a subset of the dimensionality in the bitmap partition table and is mainly used for query acceleration.
3. Custom multi-dimensional user selection
Fig. 4 is a schematic diagram of a flow of user-defined multidimensional user selection according to an embodiment of the present invention. As shown in fig. 4, during query, it is determined whether to directly query the result table by routing, otherwise, in the storage process, it is determined whether to route the cube, if yes, it is queried according to the function func () corresponding to the model cube _ bitmap table, otherwise, it is computed according to the function func () corresponding to the bitmap partition table, and returned to the temporary result table; the method comprises the following specific steps:
when inquiring, judging whether to route to a result table or to a storage process according to the selected dimensionality, wherein the storage process inquiry is divided into whether to walk to accelerate cube inquiry or to inquire from a bitmap partition table of a data model;
the biatmap partition table or cube acceleration of the data model acquires bitmap _ cur and bitmap _ sum from the bitmap partition table corresponding to the subset, performs bitmap bit operation (bitmap algorithm, continuous binary bit in the memory for performing deduplication and query on a large amount of integer data) after acquiring full amount of bitmap _ all data (flag is 2 and date is the day before the query date) from the daily full amount statistical table, and finally inserts the data into the temporary result table;
and finally, performing user selection statistics according to the returned temporary result table.
4. cube acceleration
The cube is accelerated, useless statistical dimensions are removed, only dimension information needing to be counted is reserved, the base number behind the group by is reduced, and the operation efficiency is improved.
It should be noted that although the operations of the method of the present invention have been described in the above embodiments and the accompanying drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the operations shown must be performed, to achieve the desired results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
For a clearer explanation of the above method for efficiently circling mass users based on the RoaringBitmap, a specific embodiment is described below, but it should be noted that the embodiment is only for better explaining the present invention and is not to be construed as an undue limitation on the present invention.
Examples of applications are as follows:
fig. 5 is a schematic diagram illustrating a subscriber circled IMSI configuration according to an embodiment of the present invention. As shown in fig. 5, the design is as follows:
<1> respectively mapping IMSIs of users into corresponding unique integers, and subsequently carrying out bitmap bit operation based on the integers;
<2> Bitmap _ Table _ A counts active users in a city per day, the statistical dimension is only the city + tag (1 day 2 history) + date, two pieces of statistical information are available each day, one is a user Bitmap set of the day, the other is a historical user Bitmap set, statistics on the second day is performed based on the set counted on the first day and the set of the day as rb _ or _ agg, and a new historical Bitmap set of the day is formed;
<3> Bitmap _ Table _ B counts the active user situation based on the statistical dimension (gender + age group) + tag + date, there are two pieces of statistical information each day, one is the current day, the other is historical, the current day is counted by group by statistics of all genders and age groups, the statistical Bitmap set is labeled as flag 1, the historical flag 2 is based on the previous day history and the current day statistics set as rb _ or _ ag, and a new current day historical Bitmap set is formed.
<4> two scene statistics:
first, 0923 Nanjing active users are counted according to the Bitmap _ Table _ A Table, as shown in Table 1 below:
TABLE 1
Date | Channel for irrigation | Flag (1 day 2 history) | Bitmap collection |
0923 | Nanjing | 1 | {A,C} |
0923 | Shanghai province | 1 | {B} |
0923 | Nanjing | 2 | {A,C} |
0923 | Shanghai province | 2 | {B} |
0924 | Nanjing | 1 | {A,D} |
0924 | Shanghai province | 1 | {} |
0924 | Nanjing | 2 | {A,C,D} |
0924 | Shanghai province | 2 | {B} |
The fact that the statistics 0923 is active is that the bitmap set { a, D }, { a, C, D } and the bitmap set { a, C } are subjected to rb _ and _ clipping bit operation, the result is { D }, and the number of active users is 1.
Second, 0826 men 80 have active users after, then according to Bitmap _ table _ B statistics, see table 2 below:
TABLE 2
Date | Sex | Age (age) | Flag | Bitmap collection |
0826 | For male | 80 | 1 | {A} |
0826 | For male | 90 | 1 | {C} |
0826 | For male | 00 | 1 | {B} |
0826 | Woman | 80 | 2 | {A} |
0826 | Woman | 90 | 2 | {C} |
0826 | Woman | 00 | 2 | {B} |
0827 | For male | 80 | 1 | {D} |
0827 | Woman | 90 | 1 | {A} |
0827 | For male | 80 | 2 | {D} |
0827 | Woman | 90 | 2 | {C,A} |
Statistics 0826 is true that the male active users are bit operations performed on the bitmap set { A } and bitmap set { D }, the result is { A }, and the number of active users is 1.
Based on the same invention concept, the invention also provides a device for efficiently selecting the massive users based on the Roaring bitmap. The implementation of the device can be referred to the implementation of the method, and repeated details are not repeated. The term "module," as used below, may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 6 is a schematic structural diagram of a device for efficiently circling mass users based on roaringingbitmap according to an embodiment of the present invention. As shown in fig. 6, the apparatus includes:
a data model creating module 101, configured to create a tag library of user behavior and a tag library of service classification according to user behavior data, create a data model according to the tag library of user behavior and the tag library of service classification, create a bitmap partition table according to the data model, create a function func () of query statistics according to the bitmap partition table, determine whether to directly calculate a result, if so, create a result table, and simultaneously determine whether to create a cube _ bitmap table according to the bitmap partition table, if so, create a cube _ bitmap table, and create a function func () of query statistics according to the cube _ bitmap table;
the data model consists of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from the data warehouse;
the bitmap partition table is composed of dimension + flag + status _ date + round bitmap fields to be counted; calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates;
the function func () of the query statistics supports a plurality of time periods and homocyclic ratio queries;
the cube _ bitmap table is used for cube query acceleration;
the result table is used for storing the result pre-calculated by the bitmap;
the data complementing module 102 is used for synchronizing data of a data bin into a data model, complementing according to a corresponding function func (), synchronizing IMSI into a dictionary table for bitmap calculation, namely synchronizing IMSI into the dictionary table, generating a unique integer identifier for each user, storing an integer for identifying the user into a bitmap, performing bitmap bit operation, deleting date data to be complemented according to partitions, complementing the bitmap partition table data according to data model data, judging whether cube acceleration is performed or not, and if so, generating a model cube _ bitmap table;
and the user-defined multi-dimensional user selection module 103 is used for judging whether to directly query the routing to the result table during query, otherwise, in the storage process, judging whether to route the cube, if so, inquiring according to the function func () corresponding to the model cube _ bitmap table, otherwise, inquiring according to the function func () corresponding to the bitmap partition table, calculating according to the bitmap partition table, returning to the temporary result table, and performing user selection statistics.
It should be noted that although several modules of the RoaringBitmap-based mass user-efficient selection of devices are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
Based on the aforementioned inventive concept, as shown in fig. 7, the present invention further provides a computer device 200, which includes a memory 210, a processor 220, and a computer program 230 stored on the memory 210 and operable on the processor 220, wherein the processor 220 implements the aforementioned RoaringBitmap-based massive user efficient circle selection method when executing the computer program 230.
Based on the foregoing inventive concept, the present invention further provides a computer-readable storage medium storing a computer program for executing the foregoing method for efficiently circling mass users based on RoaringBitmap.
The method and the device for efficiently selecting the large number of users based on Roaring bitmap improve the performance of selecting active users and sleeping users by more than 10 times in a 1 hundred million user selection cycle.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
The limitation of the protection scope of the present invention is understood by those skilled in the art, and various modifications or changes which can be made by those skilled in the art without inventive efforts based on the technical solution of the present invention are still within the protection scope of the present invention.
Claims (16)
1. A Roaring bitmap-based massive user efficient selection method is characterized by comprising the following steps:
data model creation: creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether to directly calculate a result, if so, creating a result table, simultaneously judging whether to create a cube _ bitmap table according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
data complement: synchronizing the data of the bins into a data model, performing complementation according to a corresponding function func (), synchronizing the IMSI into a dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, performing complementation according to data model data by bitmap partition table data, judging whether cube acceleration is performed or not, and generating a model cube _ bitmap table if the cube acceleration is performed;
and (3) user-defined multi-dimensional user selection: and judging whether to directly inquire the routing to the result table during inquiry, otherwise, judging whether to inquire the routing to the cube in the storage process, if so, inquiring according to a function func () corresponding to the model cube _ bitmap table, otherwise, calculating according to the bitmap partition table, returning to the temporary result table, and performing user selection statistics.
2. The roaring bitmap-based mass user efficient selection method according to claim 1, wherein the data model consists of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing duration, and data of the data model is synchronized from a data warehouse.
3. The Roaring bitmap massive user efficient selection method according to claim 1, wherein the bitmap partition table is composed of dimension + flag + states _ data + roaaring bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
4. The method for efficiently circling around massive users based on Roaring bitmap according to claim 1, wherein the function func () of query statistics supports multiple time periods and homocyclic queries.
5. The Roaring bitmap mass user-based efficient circle selection method according to claim 1, wherein the cube _ bitmap table is used for cube query acceleration.
6. The Roaring bitmap-based massive user efficient circle selection method according to claim 1, wherein the result table is used for storing results pre-calculated by bitmap.
7. The method for efficiently circling around massive users based on Roaring bitmap according to claim 1, wherein the synchronizing IMSI into a dictionary table for bitmap calculation comprises:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
8. A massive user efficient circle selection device based on Roaring bitmap is characterized by comprising:
the data model creating module is used for creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether a result is directly calculated, if so, creating a result table, and simultaneously judging whether a cube _ bitmap table is created according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
the data complementing module is used for synchronizing the data of the data bins into the data model, complementing according to a corresponding function func (), synchronizing IMSI into the dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, complementing the bitmap partition table data according to the data model data, judging whether cube acceleration is performed or not, and if so, generating a model cube _ bitmap table;
and the user-defined multi-dimensional user selection module is used for judging whether to directly inquire the result table by routing or not during inquiry, otherwise, in the storage process, judging whether to route the cube or not, if so, inquiring according to the function func () corresponding to the model cube _ bitmap table, otherwise, inquiring according to the function func () corresponding to the bitmap partition table, and calculating according to the bitmap partition table to return the temporary result table for user selection statistics.
9. The roaring bitmap-based mass user efficient selection device according to claim 8, wherein the data model consists of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from a data bin.
10. The roaring bitmap mass user based high-efficiency selection device according to claim 8, wherein the bitmap partition table is composed of dimension + flag + states _ date + roaring bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
11. The roaringingbitmap-based mass user efficient selection device according to claim 8, wherein the function func () of query statistics supports multiple time periods and homocyclic queries.
12. The Roaring bitmap mass user-based high-efficiency circle selection device according to claim 8, wherein the cube _ bitmap table is used for cube query acceleration.
13. The Roaring bitmap-based mass user efficient circle selection device according to claim 8, wherein said result table is used for storing results pre-computed by bitmap.
14. The Roaring bitmap massive user efficient selection device according to claim 8, wherein said synchronizing IMSI into a dictionary table for bitmap calculation comprises:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-7 when executing the computer program.
16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011482828.7A CN112540972A (en) | 2020-12-16 | 2020-12-16 | Roaring bitmap-based massive user efficient selection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011482828.7A CN112540972A (en) | 2020-12-16 | 2020-12-16 | Roaring bitmap-based massive user efficient selection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112540972A true CN112540972A (en) | 2021-03-23 |
Family
ID=75018842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011482828.7A Pending CN112540972A (en) | 2020-12-16 | 2020-12-16 | Roaring bitmap-based massive user efficient selection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112540972A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113434513A (en) * | 2021-07-14 | 2021-09-24 | 上海浦东发展银行股份有限公司 | User tag data storage method, device, system, equipment and storage medium |
CN115934806A (en) * | 2023-02-07 | 2023-04-07 | 云账户技术(天津)有限公司 | Statistical method, device, equipment and medium for data deduplication based on RBM |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018176623A1 (en) * | 2017-03-28 | 2018-10-04 | 上海跬智信息技术有限公司 | Olap precomputed model, automatic modeling method, and automatic modeling system |
CN110648185A (en) * | 2019-11-28 | 2020-01-03 | 苏宁云计算有限公司 | Target crowd circling method and device and computer equipment |
CN111444165A (en) * | 2019-01-16 | 2020-07-24 | 苏宁易购集团股份有限公司 | Member data circling method and system for e-commerce platform |
CN112000747A (en) * | 2020-07-08 | 2020-11-27 | 苏宁云计算有限公司 | Data multidimensional analysis method, device and system |
-
2020
- 2020-12-16 CN CN202011482828.7A patent/CN112540972A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018176623A1 (en) * | 2017-03-28 | 2018-10-04 | 上海跬智信息技术有限公司 | Olap precomputed model, automatic modeling method, and automatic modeling system |
CN111444165A (en) * | 2019-01-16 | 2020-07-24 | 苏宁易购集团股份有限公司 | Member data circling method and system for e-commerce platform |
CN110648185A (en) * | 2019-11-28 | 2020-01-03 | 苏宁云计算有限公司 | Target crowd circling method and device and computer equipment |
CN112000747A (en) * | 2020-07-08 | 2020-11-27 | 苏宁云计算有限公司 | Data multidimensional analysis method, device and system |
Non-Patent Citations (1)
Title |
---|
普通网友: "苏宁6亿会员是如何做到精确快速分析的?", 《HTTPS://BLOG.CSDN.NET/K6T9Q8XKS6IIKZPPIFQ/ARTICLE/DETAILS/107096213》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113434513A (en) * | 2021-07-14 | 2021-09-24 | 上海浦东发展银行股份有限公司 | User tag data storage method, device, system, equipment and storage medium |
CN115934806A (en) * | 2023-02-07 | 2023-04-07 | 云账户技术(天津)有限公司 | Statistical method, device, equipment and medium for data deduplication based on RBM |
CN115934806B (en) * | 2023-02-07 | 2023-05-26 | 云账户技术(天津)有限公司 | Statistical method, device, equipment and medium for data deduplication based on RBM |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110413611B (en) | Data storage and query method and device | |
CN112540972A (en) | Roaring bitmap-based massive user efficient selection method and device | |
CN110738577A (en) | Community discovery method, device, computer equipment and storage medium | |
CN112396462B (en) | Crowd circling method and device based on click house | |
CN108874803A (en) | Date storage method, device and storage medium | |
CN104636349A (en) | Method and equipment for compression and searching of index data | |
CN109299101B (en) | Data retrieval method, device, server and storage medium | |
CN105468699B (en) | Duplicate removal data statistical approach and equipment | |
CN114328981B (en) | Knowledge graph establishing and data acquiring method and device based on mode mapping | |
CN105701128B (en) | A kind of optimization method and device of query statement | |
CN111666344A (en) | Heterogeneous data synchronization method and device | |
CN108304404B (en) | Data frequency estimation method based on improved Sketch structure | |
CN108549688B (en) | Data operation optimization method, device, equipment and storage medium | |
CN108549696B (en) | Time series data similarity query method based on memory calculation | |
CN109739854A (en) | A kind of date storage method and device | |
US20140067751A1 (en) | Compressed set representation for sets as measures in olap cubes | |
CN110019054B (en) | Log duplicate removal method and system, and content distribution network system | |
US8533167B1 (en) | Compressed set representation for sets as measures in OLAP cubes | |
CN114996552A (en) | Data acquisition method and terminal | |
CN105550236B (en) | A kind of distributed data duplicate removal treatment method and device | |
CN107203550B (en) | Data processing method and database server | |
CN114595215A (en) | Data processing method and device, electronic equipment and storage medium | |
EP3488359A1 (en) | Systems and methods for database compression and evaluation | |
CN110309367B (en) | Information classification method, information processing method and device | |
CN112527836A (en) | Big data query method based on T-BOX platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210323 |
|
RJ01 | Rejection of invention patent application after publication |