CN112540972A - Roaring bitmap-based massive user efficient selection method and device - Google Patents

Roaring bitmap-based massive user efficient selection method and device Download PDF

Info

Publication number
CN112540972A
CN112540972A CN202011482828.7A CN202011482828A CN112540972A CN 112540972 A CN112540972 A CN 112540972A CN 202011482828 A CN202011482828 A CN 202011482828A CN 112540972 A CN112540972 A CN 112540972A
Authority
CN
China
Prior art keywords
bitmap
user
data
cube
creating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011482828.7A
Other languages
Chinese (zh)
Inventor
毛春阳
闫一帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongying Youchuang Information Technology Co Ltd
Original Assignee
Zhongying Youchuang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongying Youchuang Information Technology Co Ltd filed Critical Zhongying Youchuang Information Technology Co Ltd
Priority to CN202011482828.7A priority Critical patent/CN112540972A/en
Publication of CN112540972A publication Critical patent/CN112540972A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention discloses a Roaring bitmap-based massive user efficient selection method and device, wherein the method comprises the following steps: creating a tag library of user behaviors and a tag library of service classification according to user behavior data, then creating a data model according to the tag library, creating a bitmap partition table according to the data model to store user historical data, and finally calculating a result through a user-defined function; synchronizing the data of the secondary bins into a data model, then complementing the bitmap partition table from the data model according to rules, and finally performing cube acceleration on the bitmap partition table according to requirements to generate a cube _ bitmap table; and selecting to directly search the result table according to the dimension selected by the user, or to distinguish accelerated cube query or bitmap partition table query through storage process query. The method and the device can accurately and efficiently select the information to be counted in mass users.

Description

Roaring bitmap-based massive user efficient selection method and device
Technical Field
The invention relates to the field of mobile communication Internet of things, in particular to a method and a device for efficiently selecting massive users based on Roaring bitmap (efficient bitmap calculation).
Background
With the widespread use of mobile communication equipment, the data on the internet of the mobile end of a user is gradually increased day by day, and the gating of the user based on different latitudes is usually solved by adopting an OLAP-Druid pre-summary mode and a SPARK distributed efficient memory calculation mode. However, the OLAP-drive is lost in precision in a deduplication service scene, and the statistical result is inaccurate, while the SPAKR calculation mode is to pull detailed data to a memory for operation, although the SPAKR calculation mode is distributed, the memory resource occupied by the detailed data is quite large, and the operation efficiency of making a count (distict) on massive data is also low. For the efficient selection of the service scene of massive users, the user experience of the conventional technical scheme is poor.
Disclosure of Invention
In order to solve the problem that the user experience of a conventional technical scheme is poor for the efficient selection of the service scene of the mass users, the invention provides a method and a device for efficiently selecting the mass users based on Roaring bitmap.
In order to achieve the purpose, the invention adopts the following technical scheme:
in an embodiment of the present invention, a method for efficiently circling mass users based on roaring bitmap is provided, which includes:
data model creation: creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether to directly calculate a result, if so, creating a result table, simultaneously judging whether to create a cube _ bitmap table according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
data complement: synchronizing the data of the bins into a data model, performing complementation according to a corresponding function func (), synchronizing the IMSI into a dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, performing complementation according to data model data by bitmap partition table data, judging whether cube acceleration is performed or not, and generating a model cube _ bitmap table if the cube acceleration is performed;
and (3) user-defined multi-dimensional user selection: and judging whether to directly inquire the routing to the result table during inquiry, otherwise, judging whether to inquire the routing to the cube in the storage process, if so, inquiring according to a function func () corresponding to the model cube _ bitmap table, otherwise, calculating according to the bitmap partition table, returning to the temporary result table, and performing user selection statistics.
Further, the data model is composed of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from the data warehouse.
Further, the bitmap partition table is composed of dimension + flag + status _ date + round bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
Further, the function func () of the query statistics supports multiple time periods and homocyclic queries.
Further, the cube _ bitmap table is used for cube query acceleration.
Further, the result table is used to store the result pre-calculated by bitmap.
Further, synchronizing the IMSI into the dictionary table for bitmap calculation includes:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
In an embodiment of the present invention, a device for efficiently circling massive users based on roaring bitmap is further provided, where the device includes:
the data model creating module is used for creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether a result is directly calculated, if so, creating a result table, and simultaneously judging whether a cube _ bitmap table is created according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
the data complementing module is used for synchronizing the data of the data bins into the data model, complementing according to a corresponding function func (), synchronizing IMSI into the dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, complementing the bitmap partition table data according to the data model data, judging whether cube acceleration is performed or not, and if so, generating a model cube _ bitmap table;
and the user-defined multi-dimensional user selection module is used for judging whether to directly inquire the result table by routing or not during inquiry, otherwise, in the storage process, judging whether to route the cube or not, if so, inquiring according to the function func () corresponding to the model cube _ bitmap table, otherwise, inquiring according to the function func () corresponding to the bitmap partition table, and calculating according to the bitmap partition table to return the temporary result table for user selection statistics.
Further, the data model is composed of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from the data warehouse.
Further, the bitmap partition table is composed of dimension + flag + status _ date + round bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
Further, the function func () of the query statistics supports multiple time periods and homocyclic queries.
Further, the cube _ bitmap table is used for cube query acceleration.
Further, the result table is used to store the result pre-calculated by bitmap.
Further, synchronizing the IMSI into the dictionary table for bitmap calculation includes:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
In an embodiment of the present invention, a computer device is further provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the aforementioned roaring bitmap-based massive user efficient selection method is implemented.
In an embodiment of the present invention, a computer-readable storage medium is further provided, where a computer program for executing the roaring bitmap-based massive user efficient circle selection method is stored in the computer-readable storage medium.
Has the advantages that:
1. the method is applied to circle active users and sleeping users among a large number of users, the users can randomly select the dimension to be counted to carry out efficient circle selection, the query efficiency is high, and the requirements of users in the scenes are perfectly met.
2. The mass user selection is actually bit operation among bitmap sets, the storage and calculation resources are not greatly consumed, the operation efficiency is high, and the problems of duplicate removal, precision loss, low efficiency and the like of mass data at present are solved.
Drawings
Fig. 1 is a schematic flow chart of a method for efficiently circling mass users based on roaring bitmap according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data model creation process according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a data complement process according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a flow of user-defined multidimensional user selection according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating the subscriber circled IMSI configuration according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a device for efficiently circling mass users based on roaring bitmap according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The principles and spirit of the present invention will be described below with reference to several exemplary embodiments, which should be understood to be presented only to enable those skilled in the art to better understand and implement the present invention, and not to limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a method and a device for efficiently selecting circles of massive users based on Roaring bitmap are provided, and the user-defined dimension is supported for selection circles; the method supports efficient circle selection among massive users; the occupation of computing resources is less; the storage resource occupation is less; and efficient and accurate duplication removal of massive data is supported.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.
Fig. 1 is a schematic flow chart of a method for efficiently circling mass users based on roaring bitmap according to an embodiment of the present invention. As shown in fig. 1, the method includes:
the bitmap model is created: generating a bitmap model and generating query func ();
encapsulating bitmap computing logic: calculating a model bitmap and calculating a cube _ bitmap;
obtaining bitmap results of the model and the cube through the bitmap calculation of the model, providing the user terminal model for inquiring and accelerating the cube inquiry, and deleting the temporary result table;
and deleting the temporary result table through IDE timing scheduling.
The following detailed flow description is made:
1. data model creation
FIG. 2 is a schematic diagram of a data model creation process according to an embodiment of the invention. As shown in fig. 2, a tag library of user behavior and a tag library of service classification are created according to user behavior data, a data model (created according to the transmitted dimension and measurement) is created according to the tag library of user behavior and the tag library of service classification, a bitmap partition table is created according to the data model, a function func () of query statistics (parameter time, dimension, measurement sort, full link id) is created according to the bitmap partition table, whether the result is directly calculated is judged, if yes, a result table is created, and if yes, a cube _ bitmap table is created according to the bitmap partition table, if yes, a cube _ bitmap table is created, and a function func () of query statistics (parameter time, dimension, measurement sort, full link id) is created according to the cube _ bitmap table;
the data model consists of specified dimensionality (city, gender, hobby, age, terminal, occupation and the like) and indexes (active users, sleeping users, new users, internet surfing time of the users and the like), and data of the data model is synchronized from the data warehouse;
the bitmap partition table of the data model is mainly composed of fields of dimension + flag + status _ date + round bitmap to be counted, the daily and historical user bitmaps are calculated according to the function func () of dimension summary, duplication removal, aggregation and query statistics of statistics, and finally the table is set as the partition table according to the date;
querying a statistical function func () and supporting a plurality of time periods and homocyclic ratio query;
the cube _ bitmap table is mainly used for cube query acceleration;
and the result table is used for storing the result of the pre-calculation into the result table, the data of the result table is stored by the result calculated by the bitmap in advance, and the real-time operation by the bitmap is not needed during the query.
2. Data complement (bitmap correlation table complement)
FIG. 3 is a schematic diagram of a data complement process according to an embodiment of the invention. As shown in fig. 3, synchronizing the data of the bins into the data model, performing complementation according to a corresponding function func (), synchronizing the IMSI into the dictionary table for bitmap calculation, deleting the date data to be complemented according to partitions, performing complementation according to the data model data by the bitmap partition table data, judging whether cube acceleration is performed, and if so, generating a model cube _ bitmap table; the method comprises the following specific steps:
firstly, synchronizing data of a plurality of bins into a data model, and then mapping the data into a function func () of specified dimension summary, deduplication, aggregation and query statistics according to a custom rule to perform manual or automatic number complementing;
synchronizing the IMSI (user unique identifier) of each day into a dictionary table, wherein the table is mainly used for storing the IMSI and an integer value corresponding to each IMSI, the integer value is the unique identifier of the user, and the subsequent operation is bit operation based on the corresponding identifier;
the data model and the dictionary table are associated through IMSI, a user identification ID is obtained to convert rb _ or _ agg (rb _ build (ARRAY [ b.id:: INT ])), and data stored in the bitmap partition table of a statistical dimension user in the current day (flag ═ 1) are obtained;
carrying out rb _ or _ agg (bitmap) operation on bitmap data counted in the current day (flag is 1) and the previous day (flag is 2) in a bitmap partition table of a data model, and inserting the integrated data serving as the current day (flag is 2) data into the bitmap partition table;
the total daily statistics table only has fields of flag + status _ date + Roaring bitmap, and mainly counts the user and historical user bitmap conditions of the current day, the current day bitmap data of which flag is 1 is counted, and the data model and the user table are associated to perform bitmap fetching rb _ or _ agg (rb _ build [ b.id:: INT ]));
counting the daily bitmap data of which the flag is 2 in a daily gross statistics table, acquiring the data of which the flag is 1 in the daily and the data of which the flag is 2 in yesterday statistics from a self table, and then making rb _ or _ agg (bitmap);
the cube acceleration complement is based on the complement of the bitmap partition table, and the dimensionality is a subset of the dimensionality in the bitmap partition table and is mainly used for query acceleration.
3. Custom multi-dimensional user selection
Fig. 4 is a schematic diagram of a flow of user-defined multidimensional user selection according to an embodiment of the present invention. As shown in fig. 4, during query, it is determined whether to directly query the result table by routing, otherwise, in the storage process, it is determined whether to route the cube, if yes, it is queried according to the function func () corresponding to the model cube _ bitmap table, otherwise, it is computed according to the function func () corresponding to the bitmap partition table, and returned to the temporary result table; the method comprises the following specific steps:
when inquiring, judging whether to route to a result table or to a storage process according to the selected dimensionality, wherein the storage process inquiry is divided into whether to walk to accelerate cube inquiry or to inquire from a bitmap partition table of a data model;
the biatmap partition table or cube acceleration of the data model acquires bitmap _ cur and bitmap _ sum from the bitmap partition table corresponding to the subset, performs bitmap bit operation (bitmap algorithm, continuous binary bit in the memory for performing deduplication and query on a large amount of integer data) after acquiring full amount of bitmap _ all data (flag is 2 and date is the day before the query date) from the daily full amount statistical table, and finally inserts the data into the temporary result table;
and finally, performing user selection statistics according to the returned temporary result table.
4. cube acceleration
The cube is accelerated, useless statistical dimensions are removed, only dimension information needing to be counted is reserved, the base number behind the group by is reduced, and the operation efficiency is improved.
It should be noted that although the operations of the method of the present invention have been described in the above embodiments and the accompanying drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the operations shown must be performed, to achieve the desired results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
For a clearer explanation of the above method for efficiently circling mass users based on the RoaringBitmap, a specific embodiment is described below, but it should be noted that the embodiment is only for better explaining the present invention and is not to be construed as an undue limitation on the present invention.
Examples of applications are as follows:
fig. 5 is a schematic diagram illustrating a subscriber circled IMSI configuration according to an embodiment of the present invention. As shown in fig. 5, the design is as follows:
<1> respectively mapping IMSIs of users into corresponding unique integers, and subsequently carrying out bitmap bit operation based on the integers;
<2> Bitmap _ Table _ A counts active users in a city per day, the statistical dimension is only the city + tag (1 day 2 history) + date, two pieces of statistical information are available each day, one is a user Bitmap set of the day, the other is a historical user Bitmap set, statistics on the second day is performed based on the set counted on the first day and the set of the day as rb _ or _ agg, and a new historical Bitmap set of the day is formed;
<3> Bitmap _ Table _ B counts the active user situation based on the statistical dimension (gender + age group) + tag + date, there are two pieces of statistical information each day, one is the current day, the other is historical, the current day is counted by group by statistics of all genders and age groups, the statistical Bitmap set is labeled as flag 1, the historical flag 2 is based on the previous day history and the current day statistics set as rb _ or _ ag, and a new current day historical Bitmap set is formed.
<4> two scene statistics:
first, 0923 Nanjing active users are counted according to the Bitmap _ Table _ A Table, as shown in Table 1 below:
TABLE 1
Date Channel for irrigation Flag (1 day 2 history) Bitmap collection
0923 Nanjing 1 {A,C}
0923 Shanghai province 1 {B}
0923 Nanjing 2 {A,C}
0923 Shanghai province 2 {B}
0924 Nanjing 1 {A,D}
0924 Shanghai province 1 {}
0924 Nanjing 2 {A,C,D}
0924 Shanghai province 2 {B}
The fact that the statistics 0923 is active is that the bitmap set { a, D }, { a, C, D } and the bitmap set { a, C } are subjected to rb _ and _ clipping bit operation, the result is { D }, and the number of active users is 1.
Second, 0826 men 80 have active users after, then according to Bitmap _ table _ B statistics, see table 2 below:
TABLE 2
Date Sex Age (age) Flag Bitmap collection
0826 For male 80 1 {A}
0826 For male 90 1 {C}
0826 For male 00 1 {B}
0826 Woman 80 2 {A}
0826 Woman 90 2 {C}
0826 Woman 00 2 {B}
0827 For male 80 1 {D}
0827 Woman 90 1 {A}
0827 For male 80 2 {D}
0827 Woman 90 2 {C,A}
Statistics 0826 is true that the male active users are bit operations performed on the bitmap set { A } and bitmap set { D }, the result is { A }, and the number of active users is 1.
Based on the same invention concept, the invention also provides a device for efficiently selecting the massive users based on the Roaring bitmap. The implementation of the device can be referred to the implementation of the method, and repeated details are not repeated. The term "module," as used below, may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 6 is a schematic structural diagram of a device for efficiently circling mass users based on roaringingbitmap according to an embodiment of the present invention. As shown in fig. 6, the apparatus includes:
a data model creating module 101, configured to create a tag library of user behavior and a tag library of service classification according to user behavior data, create a data model according to the tag library of user behavior and the tag library of service classification, create a bitmap partition table according to the data model, create a function func () of query statistics according to the bitmap partition table, determine whether to directly calculate a result, if so, create a result table, and simultaneously determine whether to create a cube _ bitmap table according to the bitmap partition table, if so, create a cube _ bitmap table, and create a function func () of query statistics according to the cube _ bitmap table;
the data model consists of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from the data warehouse;
the bitmap partition table is composed of dimension + flag + status _ date + round bitmap fields to be counted; calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates;
the function func () of the query statistics supports a plurality of time periods and homocyclic ratio queries;
the cube _ bitmap table is used for cube query acceleration;
the result table is used for storing the result pre-calculated by the bitmap;
the data complementing module 102 is used for synchronizing data of a data bin into a data model, complementing according to a corresponding function func (), synchronizing IMSI into a dictionary table for bitmap calculation, namely synchronizing IMSI into the dictionary table, generating a unique integer identifier for each user, storing an integer for identifying the user into a bitmap, performing bitmap bit operation, deleting date data to be complemented according to partitions, complementing the bitmap partition table data according to data model data, judging whether cube acceleration is performed or not, and if so, generating a model cube _ bitmap table;
and the user-defined multi-dimensional user selection module 103 is used for judging whether to directly query the routing to the result table during query, otherwise, in the storage process, judging whether to route the cube, if so, inquiring according to the function func () corresponding to the model cube _ bitmap table, otherwise, inquiring according to the function func () corresponding to the bitmap partition table, calculating according to the bitmap partition table, returning to the temporary result table, and performing user selection statistics.
It should be noted that although several modules of the RoaringBitmap-based mass user-efficient selection of devices are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
Based on the aforementioned inventive concept, as shown in fig. 7, the present invention further provides a computer device 200, which includes a memory 210, a processor 220, and a computer program 230 stored on the memory 210 and operable on the processor 220, wherein the processor 220 implements the aforementioned RoaringBitmap-based massive user efficient circle selection method when executing the computer program 230.
Based on the foregoing inventive concept, the present invention further provides a computer-readable storage medium storing a computer program for executing the foregoing method for efficiently circling mass users based on RoaringBitmap.
The method and the device for efficiently selecting the large number of users based on Roaring bitmap improve the performance of selecting active users and sleeping users by more than 10 times in a 1 hundred million user selection cycle.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
The limitation of the protection scope of the present invention is understood by those skilled in the art, and various modifications or changes which can be made by those skilled in the art without inventive efforts based on the technical solution of the present invention are still within the protection scope of the present invention.

Claims (16)

1. A Roaring bitmap-based massive user efficient selection method is characterized by comprising the following steps:
data model creation: creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether to directly calculate a result, if so, creating a result table, simultaneously judging whether to create a cube _ bitmap table according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
data complement: synchronizing the data of the bins into a data model, performing complementation according to a corresponding function func (), synchronizing the IMSI into a dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, performing complementation according to data model data by bitmap partition table data, judging whether cube acceleration is performed or not, and generating a model cube _ bitmap table if the cube acceleration is performed;
and (3) user-defined multi-dimensional user selection: and judging whether to directly inquire the routing to the result table during inquiry, otherwise, judging whether to inquire the routing to the cube in the storage process, if so, inquiring according to a function func () corresponding to the model cube _ bitmap table, otherwise, calculating according to the bitmap partition table, returning to the temporary result table, and performing user selection statistics.
2. The roaring bitmap-based mass user efficient selection method according to claim 1, wherein the data model consists of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing duration, and data of the data model is synchronized from a data warehouse.
3. The Roaring bitmap massive user efficient selection method according to claim 1, wherein the bitmap partition table is composed of dimension + flag + states _ data + roaaring bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
4. The method for efficiently circling around massive users based on Roaring bitmap according to claim 1, wherein the function func () of query statistics supports multiple time periods and homocyclic queries.
5. The Roaring bitmap mass user-based efficient circle selection method according to claim 1, wherein the cube _ bitmap table is used for cube query acceleration.
6. The Roaring bitmap-based massive user efficient circle selection method according to claim 1, wherein the result table is used for storing results pre-calculated by bitmap.
7. The method for efficiently circling around massive users based on Roaring bitmap according to claim 1, wherein the synchronizing IMSI into a dictionary table for bitmap calculation comprises:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
8. A massive user efficient circle selection device based on Roaring bitmap is characterized by comprising:
the data model creating module is used for creating a label library of user behaviors and a label library of service classification according to user behavior data, creating a data model according to the label library of the user behaviors and the label library of the service classification, creating a bitmap partition table according to the data model, creating a function func () of query statistics according to the bitmap partition table, judging whether a result is directly calculated, if so, creating a result table, and simultaneously judging whether a cube _ bitmap table is created according to the bitmap partition table, if so, creating the cube _ bitmap table, and creating a function func () of query statistics according to the cube _ bitmap table;
the data complementing module is used for synchronizing the data of the data bins into the data model, complementing according to a corresponding function func (), synchronizing IMSI into the dictionary table for bitmap calculation, deleting date data to be complemented according to partitions, complementing the bitmap partition table data according to the data model data, judging whether cube acceleration is performed or not, and if so, generating a model cube _ bitmap table;
and the user-defined multi-dimensional user selection module is used for judging whether to directly inquire the result table by routing or not during inquiry, otherwise, in the storage process, judging whether to route the cube or not, if so, inquiring according to the function func () corresponding to the model cube _ bitmap table, otherwise, inquiring according to the function func () corresponding to the bitmap partition table, and calculating according to the bitmap partition table to return the temporary result table for user selection statistics.
9. The roaring bitmap-based mass user efficient selection device according to claim 8, wherein the data model consists of specified dimensions including city, gender, hobby, age, terminal and occupation, and indexes including active users, sleeping users, new users and user internet surfing time, and data of the data model is synchronized from a data bin.
10. The roaring bitmap mass user based high-efficiency selection device according to claim 8, wherein the bitmap partition table is composed of dimension + flag + states _ date + roaring bitmap fields to be counted; and calculating daily and historical user bitmaps according to a function func () of statistical dimension summary, deduplication, aggregation and query statistics, and finally setting the user bitmaps as bitmap partition tables according to dates.
11. The roaringingbitmap-based mass user efficient selection device according to claim 8, wherein the function func () of query statistics supports multiple time periods and homocyclic queries.
12. The Roaring bitmap mass user-based high-efficiency circle selection device according to claim 8, wherein the cube _ bitmap table is used for cube query acceleration.
13. The Roaring bitmap-based mass user efficient circle selection device according to claim 8, wherein said result table is used for storing results pre-computed by bitmap.
14. The Roaring bitmap massive user efficient selection device according to claim 8, wherein said synchronizing IMSI into a dictionary table for bitmap calculation comprises:
synchronizing the IMSI into a dictionary table, generating a unique integer identifier for each user, storing the integer for identifying the user into a bitmap, and performing bitmap bit operation.
15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-7 when executing the computer program.
16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1-7.
CN202011482828.7A 2020-12-16 2020-12-16 Roaring bitmap-based massive user efficient selection method and device Pending CN112540972A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011482828.7A CN112540972A (en) 2020-12-16 2020-12-16 Roaring bitmap-based massive user efficient selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011482828.7A CN112540972A (en) 2020-12-16 2020-12-16 Roaring bitmap-based massive user efficient selection method and device

Publications (1)

Publication Number Publication Date
CN112540972A true CN112540972A (en) 2021-03-23

Family

ID=75018842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011482828.7A Pending CN112540972A (en) 2020-12-16 2020-12-16 Roaring bitmap-based massive user efficient selection method and device

Country Status (1)

Country Link
CN (1) CN112540972A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434513A (en) * 2021-07-14 2021-09-24 上海浦东发展银行股份有限公司 User tag data storage method, device, system, equipment and storage medium
CN115934806A (en) * 2023-02-07 2023-04-07 云账户技术(天津)有限公司 Statistical method, device, equipment and medium for data deduplication based on RBM

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176623A1 (en) * 2017-03-28 2018-10-04 上海跬智信息技术有限公司 Olap precomputed model, automatic modeling method, and automatic modeling system
CN110648185A (en) * 2019-11-28 2020-01-03 苏宁云计算有限公司 Target crowd circling method and device and computer equipment
CN111444165A (en) * 2019-01-16 2020-07-24 苏宁易购集团股份有限公司 Member data circling method and system for e-commerce platform
CN112000747A (en) * 2020-07-08 2020-11-27 苏宁云计算有限公司 Data multidimensional analysis method, device and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176623A1 (en) * 2017-03-28 2018-10-04 上海跬智信息技术有限公司 Olap precomputed model, automatic modeling method, and automatic modeling system
CN111444165A (en) * 2019-01-16 2020-07-24 苏宁易购集团股份有限公司 Member data circling method and system for e-commerce platform
CN110648185A (en) * 2019-11-28 2020-01-03 苏宁云计算有限公司 Target crowd circling method and device and computer equipment
CN112000747A (en) * 2020-07-08 2020-11-27 苏宁云计算有限公司 Data multidimensional analysis method, device and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
普通网友: "苏宁6亿会员是如何做到精确快速分析的?", 《HTTPS://BLOG.CSDN.NET/K6T9Q8XKS6IIKZPPIFQ/ARTICLE/DETAILS/107096213》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434513A (en) * 2021-07-14 2021-09-24 上海浦东发展银行股份有限公司 User tag data storage method, device, system, equipment and storage medium
CN115934806A (en) * 2023-02-07 2023-04-07 云账户技术(天津)有限公司 Statistical method, device, equipment and medium for data deduplication based on RBM
CN115934806B (en) * 2023-02-07 2023-05-26 云账户技术(天津)有限公司 Statistical method, device, equipment and medium for data deduplication based on RBM

Similar Documents

Publication Publication Date Title
CN110413611B (en) Data storage and query method and device
CN112540972A (en) Roaring bitmap-based massive user efficient selection method and device
CN110738577A (en) Community discovery method, device, computer equipment and storage medium
CN112396462B (en) Crowd circling method and device based on click house
CN108874803A (en) Date storage method, device and storage medium
CN104636349A (en) Method and equipment for compression and searching of index data
CN109299101B (en) Data retrieval method, device, server and storage medium
CN105468699B (en) Duplicate removal data statistical approach and equipment
CN114328981B (en) Knowledge graph establishing and data acquiring method and device based on mode mapping
CN105701128B (en) A kind of optimization method and device of query statement
CN111666344A (en) Heterogeneous data synchronization method and device
CN108304404B (en) Data frequency estimation method based on improved Sketch structure
CN108549688B (en) Data operation optimization method, device, equipment and storage medium
CN108549696B (en) Time series data similarity query method based on memory calculation
CN109739854A (en) A kind of date storage method and device
US20140067751A1 (en) Compressed set representation for sets as measures in olap cubes
CN110019054B (en) Log duplicate removal method and system, and content distribution network system
US8533167B1 (en) Compressed set representation for sets as measures in OLAP cubes
CN114996552A (en) Data acquisition method and terminal
CN105550236B (en) A kind of distributed data duplicate removal treatment method and device
CN107203550B (en) Data processing method and database server
CN114595215A (en) Data processing method and device, electronic equipment and storage medium
EP3488359A1 (en) Systems and methods for database compression and evaluation
CN110309367B (en) Information classification method, information processing method and device
CN112527836A (en) Big data query method based on T-BOX platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210323

RJ01 Rejection of invention patent application after publication