CN114595215A - Data processing method and device, electronic equipment and storage medium - Google Patents
Data processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114595215A CN114595215A CN202210229221.0A CN202210229221A CN114595215A CN 114595215 A CN114595215 A CN 114595215A CN 202210229221 A CN202210229221 A CN 202210229221A CN 114595215 A CN114595215 A CN 114595215A
- Authority
- CN
- China
- Prior art keywords
- data
- bitmap array
- bitmap
- dimension
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/2438—Embedded query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/244—Grouping and aggregation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data processing method, a data processing device, electronic equipment and a storage medium. The data processing method comprises the following steps: firstly, grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension; secondly, performing aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions; then, acquiring a base number value of a bitmap array corresponding to each dimension data set; and finally, determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array. In the steps of the method, the data to be processed is converted into the Bitmap array (Bitmap), and the fast duplicate removal processing operation can be realized on the data under different dimensions through the characteristic of small storage space of the Bitmap array, so that the efficiency of data statistical query is improved.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
With the rapid development of information technology, the storage capacity of information data is continuously increased, and the requirements for data management and analysis are also distributed in various application scenarios. In the process of data management and analysis, deduplication statistics on data objects is usually required, for example, data such as uv (number of independent visitors) is obtained.
With the increase of data volume, the calculation amount in query statistics will be larger and larger, the requirements on computer resources such as CPU, memory, and network IO will be higher and higher, and the processing speed will be slower and slower.
Disclosure of Invention
In order to solve the problems and disadvantages of the prior art, an object of the present invention is to provide a data processing method, an apparatus, an electronic device, and a storage medium, which can implement fast deduplication processing of data.
To achieve the above object, the present invention provides a data processing method, including:
grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension;
respectively carrying out aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions;
acquiring a base number value of a bitmap array corresponding to each dimension data set;
and determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array.
Optionally, the step of obtaining a base value of the bitmap array corresponding to each dimension data set includes:
obtaining a compressed bitmap array according to the bitmap array, wherein the storage space occupied by the compressed bitmap array is smaller than that occupied by the bitmap array;
and obtaining a corresponding bitmap array base value according to the compressed bitmap array.
Optionally, the dataset for each dimension comprises a plurality of data sets; the method comprises the following steps of obtaining corresponding bitmap array base values according to the compressed bitmap arrays corresponding to the dimension data sets, wherein the steps comprise:
obtaining a base value of each data group according to the compressed bitmap array of each data group;
and determining the bitmap array base value corresponding to each dimension data set according to the base value of each data set.
Optionally, the data dimension includes at least a first packet field and a second packet field; grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension, wherein the step comprises the following steps:
acquiring a packet field of each piece of data to be processed;
according to the grouping fields of the data to be processed, the data to be processed is divided into a plurality of data sets, and the plurality of data sets comprise a first data set corresponding to the first grouping field and a second data set corresponding to the second grouping field.
Optionally, the first packet field and the second packet field each contain a plurality of data sets; the method comprises the following steps of determining the data size of each dimension data set after deduplication according to the base number value of a bitmap array, wherein the steps comprise:
determining the data volume of each data group in the first packet field after the duplication removal according to the base number value of the bitmap array corresponding to the data in the first packet field;
and determining the data size of each data group in the second grouping field after the duplication removal according to the base number value of the bitmap array corresponding to the data in the second grouping field.
The invention also provides a data processing device, comprising:
the data grouping module is used for grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension;
the aggregation duplicate removal module is used for respectively carrying out aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions;
the cardinal number obtaining module is used for obtaining cardinal number values of the bitmap arrays corresponding to the dimension data sets;
and the data determination module is used for determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array.
Optionally, the radix obtaining module includes:
the compressed bitmap acquisition module is used for acquiring a compressed bitmap array according to the bitmap array, and the storage space occupied by the compressed bitmap array is smaller than that occupied by the bitmap array;
and the bitmap base number acquisition module is used for acquiring a corresponding bitmap array base number value according to the compressed bitmap array.
Optionally, the data set of each dimension includes a plurality of data sets, and the bitmap cardinality obtaining module includes:
the base value acquisition module is used for acquiring the base value of each data set according to the compressed bitmap array of each data set;
and the base value determining module is used for determining the bitmap array base values corresponding to the dimensional data sets according to the base values of each data set.
The invention also provides an electronic device comprising a storage medium and a processor, wherein the storage medium stores a computer program, and the processor implements the steps of any one of the data processing methods when executing the computer program.
The invention also relates to a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the data processing method of any one of the above.
Compared with the prior art, the invention has the beneficial effects that: firstly, grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension; secondly, performing aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions; then, acquiring a base number value of a bitmap array corresponding to each dimension data set; and finally, determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array. In the steps of the method, the data to be processed is converted into the Bitmap array (Bitmap), and the fast duplicate removal processing operation can be realized on the data under different dimensions through the characteristic of small storage space of the Bitmap array, so that the efficiency of data statistical query is improved.
Drawings
In order to illustrate the embodiments or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and it is obvious for a person skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a first flowchart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data processing method according to an embodiment of the present invention;
FIG. 3 is a flow chart of a data processing method according to an embodiment of the present invention;
FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 5 is a block diagram of a radix acquisition module according to an embodiment of the present invention;
FIG. 6 is a block diagram of a bitmap cardinality acquisition module according to an embodiment of the invention;
fig. 7 is an architecture diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Furthermore, in the embodiments of the present invention, the terms "first", "second", and the like are used only for the purpose of distinguishing between descriptions, and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In embodiments of the present invention, "for example," example, "and" such as "are used to mean" serving as an example, instance, or illustration. Any embodiment described herein as "for example," "for example," and "such as" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the invention. In the following description, details are set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and processes are not shown in detail to avoid obscuring the description of the invention with unnecessary detail. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
An embodiment of the present invention provides a data processing method, as shown in fig. 1, including step 100, step 200, step 300, and step 400, where the data processing method specifically includes:
In one embodiment, step 100 may specifically include the following steps:
first, a block word of each data to be processed is acquired.
Then, according to the grouping field of the data to be processed, the data to be processed is divided into a plurality of data sets, and the plurality of data sets comprise a first data set corresponding to the first grouping field and a second data set corresponding to the second grouping field. For example, when a user statistical query is made, the first group field is a country, the second group field is an age, the first data set includes names or codes of various countries and IDs or names of various users, and the second data set includes IDs or names of various age segments and various users.
The dimension levels of the first packet field and the second packet field may be the same or different.
For example, the first grouping field is a country, which includes a plurality of data sets of china, usa, uk, japan, etc., and the second grouping field is a province-city county, which includes province-city data sets of beijing, wuhan, hunan province, hokkaido, shan county, fudao county, arizona, etc.
For example, the dimension level of the first packet field is the same as that of the second packet field, and the first packet field is an age, which includes data sets of age segments of 18 years, 25 years, 30 years, 40 years, and the like; the second grouping field is a school calendar, and comprises data groups of elementary school, junior high school, university and the like.
In one embodiment, the data dimension may further include a third packet field, a fourth packet field, or more packet fields, which are not exhaustive herein.
And 200, respectively carrying out aggregation deduplication processing on the data sets of each dimension through an aggregation function to obtain a bitmap array corresponding to each dimension data set. For example, the first dimensional data set is a data set corresponding to a first grouping field, and the second dimensional data set is a data set corresponding to a second grouping field.
The aggregation function comprises a rbmcreate32 function and a rbmcardiality 32 function, and is a function for aggregation and deduplication which is realized in an extended mode in an Impala execution engine. Impala is an MPP (massively parallel processing) SQL (Structured Query Language) Query engine for processing large amounts of data stored in a Hadoop (distributed system infrastructure) cluster, which is an open source software written in C + + and Java. SQL is a database query and programming language for accessing data and querying, updating, and managing relational database systems.
The basic idea of a Bitmap array is to mark the Value corresponding to an element with a bit, and the Key is the element. Because the Bit is used as the unit to store the data, the storage space can be greatly saved in the aspect of storage space.
For example, a data set is [2, 3, 5, 8], and its corresponding Bitmap array is [001101001 ]. In the data set, 2 corresponds to a position with index (index) of 2 in the Bitmap array, 3 corresponds to a position with index of 3 in the Bitmap array, 5 corresponds to a position with index (index) of 5 in the Bitmap array, and 8 corresponds to a position with index (index) of 8 in the Bitmap array. In the Bitmap array, there are only 0 and 1, where 0 represents that there is no data in the index position and 1 represents that there is data in the index position. In the Bitmap array [001101001], counting from left to right, a 0 in the first position represents that data 0 does not exist, a 0 in the second position represents that data 1 does not exist, a 1 in the third position represents that data 2 exists, a 1 in the fourth position represents that data 3 exists, and so on.
And step 300, acquiring a base value of the bitmap array corresponding to each dimension data set. The base number of the bitmap array is the number of '1's in the bitmap array. For example, in the bitmap array [001101001], the number of 1's is 4, and thus the radix value of the bitmap array is 4.
Because the indexes of the same data in the Bitmap array are the same, and only 1 is represented at the corresponding index position in the Bitmap array, the original data set is also subjected to deduplication processing in the process of constructing the Bitmap array. For example, the data set [2, 3, 5, 5, 8, 2] is converted to a bitmap array [001101001 ].
In one embodiment, as shown in fig. 2, step 300 specifically includes step 310 and step 320, where:
and 310, obtaining a compressed bitmap array according to the bitmap array, wherein the storage space occupied by the compressed bitmap array is smaller than that occupied by the bitmap array.
In particular, a compressed Bitmap (rounding Bitmap) occupies less memory space than a general Bitmap. In the Roaringbitmap, a 32-bit integer is divided into 2 to 16 blocks. The first 16 bits of any 32-bit integer determine which block to place in, and the last 16 bits are what to place in that block. Such as 0xFFFF0000 and 0xFFFF0001, the first 16 bits are FFFFs, indicating that these two numbers should be placed in one block. The last 16 bits are 0 and 1, respectively. It is sufficient to save only 0 and 1 in this block, and the whole integer does not need to be saved. Thus, the Bitmap is efficiently compressed, and the occupied space of the Bitmap array is greatly smaller.
And 320, obtaining a corresponding bitmap array base value according to the compressed bitmap array. As shown in fig. 3, specifically, the method includes step 321 and step 322, where the data set of each dimension includes a plurality of data sets, and the steps are as follows:
For example, in the task of counting uv (independent visitor) numbers by nationality and province, the data set includes hokkaido, yagi, fudao, kanda, and qingsen, and the corresponding cardinal numbers are 18, 31, 23, 54, 42, and 37, respectively. The data sets Liasia, Arabama, and Illinois correspond to radix values of 8, 9, and 7, respectively.
And step 400, determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array. The method specifically comprises the following steps:
firstly, determining the data size of each data group in the first packet field after the duplication removal according to the base number value of the bitmap array corresponding to the data in the first packet field. For example, in the task of counting uv numbers by nationality and provinces, the first packet field includes japan and usa, where japan corresponds to a radix value of 205 and usa corresponds to a radix value of 24, and then japan is 205 and usa is 24.
And then, determining the data size of each data group in the second grouping field after the duplication removal according to the base number value of the bitmap array corresponding to the data in the second grouping field. For example, in the second grouping field, if the data group corresponding to japan is hokkaido, yagi, shishou, fudao, sutian, and qingsen, and the radix values corresponding to 18, 31, 23, 54, 42, and 37, respectively, uv in hokkaido, yagi, fudao, sutian, and qingsen is 18, 31, 23, 54, 42, and 37, respectively. In the data set in the united states, the numbers of radix in linaria, alabama, and illinois are 8, 9, and 7, respectively, and uv in linaria, alabama, and illinois are 8, 9, and 7, respectively.
It should be noted that in some cases, a single data may exist in a plurality of different packet fields, for example, a user has dual nationalities, and when counting uv numbers by nationality, the whole data needs to be counted to remove the data overlapping condition in the first packet field.
The embodiment of the invention introduces a Bitmap calculation mode in the Impala engine. In order to accurately count the number of user _ id in each packet field, the user _ id details of each packet field need to be saved, and the minimum unit of the reserved information in the computer is bit.
The embodiment realizes an rbmcreate32 function and an rbmcardiality 32 function in an extension of an Impala execution engine, wherein the rbmcreate32 function can construct a bitmap of a response to any attribute, and the bitmap can be transferred upwards for the aggregation calculation of a deduplication value.
As shown in the following table, uv data of each was counted in both the hinge and title dimensions.
page | title | user_id |
waimai | gaifan | 101 |
waimai | kuaican | 102 |
xiaoxiang | qiezi | 102 |
xiaoxiang | kuaican | 101 |
The first dimension is title, the second dimension is peer, and the user uv indexes of the title and the page are calculated simultaneously, wherein the rbmcreate32 function and the rbmcardionality 32 function can be calculated through the following SQL:
selectpage,title,rbmcardinality32(uv_bitmap)over(partition by page),rbmcardinality32(uv_bitmap)over(partition by title)
from(select page,title,rbmcreate32(distinct user_id)as uv_bitmap from table group by page,title)。
firstly, grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension; secondly, performing aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions; then, acquiring a base number value of a bitmap array corresponding to each dimension data set; and finally, determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array. In the steps of the method, the data to be processed is converted into the Bitmap array (Bitmap), and by the characteristic of small storage space of the Bitmap array, the data processing operations such as quick duplicate removal, sorting, query and the like can be realized on the data under different dimensions, so that the efficiency of data query statistics is improved.
An embodiment of the present invention provides a data processing apparatus, as shown in fig. 4, including a data grouping module 500, an aggregation deduplication module 600, a radix number obtaining module 700, and a data determining module 800.
The data grouping module 500 is configured to group a plurality of data to be processed according to data dimensions, and obtain a data set of each dimension.
The aggregation deduplication module 600 is configured to perform aggregation deduplication processing on the data sets of the dimensions through an aggregation function, respectively, to obtain a bitmap array corresponding to each dimension data set.
The radix obtaining module 700 is configured to obtain a base value of the bitmap array corresponding to each dimension data set.
As shown in fig. 5, the radix obtaining module 700 specifically includes a compressed bitmap obtaining module 710 and a bitmap radix obtaining module 720. The compressed bitmap obtaining module 710 is configured to obtain a compressed bitmap array according to the bitmap array, where a storage space occupied by the compressed bitmap array is smaller than a storage space occupied by the bitmap array. The bitmap base number obtaining module 720 is configured to obtain a corresponding bitmap array base number value according to the compressed bitmap array.
Further, the data set of each dimension includes a plurality of data groups, and as shown in fig. 6, the bitmap cardinality acquisition module 720 includes a cardinality value acquisition module 721 and a cardinality value determination module 722. The radix value obtaining module 721 is configured to obtain a radix value of each data set according to the compressed bitmap array of each data set. The radix value determining module 722 is configured to determine a bitmap array radix value corresponding to each dimension data set according to the radix value of each data set.
The data determination module 800 is configured to determine the data size of each dimension data set after deduplication according to the base value of the bitmap array.
The data processing device of the embodiment of the invention firstly groups a plurality of data to be processed according to data dimensionality to obtain a data set of each dimensionality; secondly, performing aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions; then, acquiring a base number value of a bitmap array corresponding to each dimension data set; and finally, determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array. In the steps of the method, the data to be processed is converted into the Bitmap array (Bitmap), and by the characteristic of small storage space of the Bitmap array, the data processing operations such as quick duplicate removal, sorting, query and the like can be realized on the data under different dimensions, so that the efficiency of data query statistics is improved.
An electronic device according to an embodiment of the present invention, as shown in fig. 7, includes a storage medium and a processor, where the storage medium stores a computer program, and the processor implements the steps of the data processing method according to any one of the foregoing embodiments when executing the computer program.
It will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by instructions (computer programs) which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the instructions (computer programs). To this end, the storage medium of the electronic device according to the embodiment of the present invention stores a plurality of instructions, and the instructions can be loaded by the processor to execute the steps of any embodiment of the device control method according to the embodiment of the present invention.
In the electronic device of this embodiment, first, a plurality of pieces of data to be processed are grouped according to data dimensions, and a data set of each dimension is obtained; secondly, performing aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions; then, acquiring a base number value of the bitmap array corresponding to each dimension data set; and finally, determining the data size of each dimension data set after the duplication removal according to the base number value of the bitmap array. In the steps of the method, the data to be processed is converted into the Bitmap array (Bitmap), and by the characteristic of small storage space of the Bitmap array, the data processing operations such as quick duplicate removal, sorting, query and the like can be realized on the data under different dimensions, so that the efficiency of data query statistics is improved.
The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the data processing methods provided by the above embodiments.
As shown, the storage medium and the processor are electrically connected, directly or indirectly, to enable transmission or interaction of data. For example, the elements may be electrically connected to each other via one or more communication buses or signal lines, such as via a bus. The storage medium stores computer-executable instructions for implementing the data access control method, and includes at least one software functional module which can be stored in the storage medium in the form of software or firmware, and the processor executes various functional applications and data processing by running the software programs and modules stored in the storage medium. The storage medium may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable read-only memory (PROM), an erasable read-only memory (EPROM), an electrically erasable read-only memory (EEPROM), and the like. The storage medium is used for storing programs, and the processor executes the programs after receiving the execution instructions. Further, the software programs and modules within the storage media described above may also include an operating system, which may include various software components and/or drivers for managing system tasks (e.g., memory management, storage device control, power management, etc.), and may communicate with various hardware or software components to provide an operating environment for other software components. The processor may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like. The various methods, steps, and logic flow diagrams disclosed in this embodiment may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Since the instructions stored in the storage medium can execute the steps in any data processing method embodiment provided in the embodiments of the present invention, the beneficial effects that can be achieved by any data processing method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method of data processing, comprising:
grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension;
performing aggregation duplicate removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions;
acquiring a base number value of a bitmap array corresponding to each dimension data set;
and determining the data size of the dimension data sets after the duplication removal according to the base number value of the bitmap array.
2. The data processing method according to claim 1, wherein the step of obtaining the base value of the bitmap array corresponding to each dimension data set comprises:
obtaining a compressed bitmap array according to the bitmap array, wherein the storage space occupied by the compressed bitmap array is smaller than the storage space occupied by the bitmap array;
and obtaining corresponding bitmap array base values according to the compressed bitmap arrays corresponding to the dimension data sets.
3. A data processing method according to claim 2, wherein the data set for each dimension comprises a plurality of data sets; the step of obtaining the corresponding bitmap array base value according to the compressed bitmap array corresponding to each dimension data set comprises the following steps:
obtaining a base value of each data group according to the compressed bitmap array of each data group;
and determining the bitmap array base value corresponding to each dimension data set according to the base value of each data set.
4. The data processing method of claim 1, wherein the data dimension comprises at least a first packet field and a second packet field; the step of grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension includes:
acquiring a grouping field of each piece of data to be processed;
and dividing the data to be processed into a plurality of data sets according to the grouping fields of the data to be processed, wherein the plurality of data sets comprise a first data set corresponding to a first grouping field and a second data set corresponding to a second grouping field.
5. The data processing method of claim 4, wherein the first packet field and the second packet field each contain a plurality of data sets; the step of determining the data size of the dimension data sets after the deduplication according to the base number value of the bitmap array comprises the following steps:
determining the data volume of each data group in the first packet field after the duplication removal according to the base number value of the bitmap array corresponding to the data in the first packet field;
and determining the data volume of each data group in the second grouping field after the duplication removal according to the base number value of the bitmap array corresponding to the data in the second grouping field.
6. A data processing apparatus, comprising:
the data grouping module is used for grouping a plurality of data to be processed according to data dimensions to obtain a data set of each dimension;
the aggregation duplication removal module is used for respectively carrying out aggregation duplication removal processing on the data sets of all dimensions through an aggregation function to obtain bitmap arrays corresponding to the data sets of all dimensions;
a radix number obtaining module, configured to obtain a radix number value of the bitmap array corresponding to each dimension data set;
and the data determining module is used for determining the data volume of the dimension data sets after the duplication removal according to the base value of the bitmap array.
7. The data processing apparatus of claim 6, wherein the radix acquisition module comprises:
the compressed bitmap acquisition module is used for acquiring a compressed bitmap array according to the bitmap array, and the storage space occupied by the compressed bitmap array is smaller than that occupied by the bitmap array;
and the bitmap base number acquisition module is used for acquiring a corresponding bitmap array base number value according to the compressed bitmap array.
8. The data processing apparatus of claim 7, wherein the data set for each dimension comprises a plurality of data sets, the bitmap cardinality acquisition module comprising:
the base value acquisition module is used for acquiring the base value of each data group according to the compressed bitmap array of each data group;
and the base value determining module is used for determining the base value of the bitmap array corresponding to each dimension data set according to the base value of each data set.
9. An electronic device comprising a storage medium and a processor, the storage medium storing a computer program, wherein the processor implements the steps of the data processing method according to any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data processing method of one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210229221.0A CN114595215A (en) | 2022-03-10 | 2022-03-10 | Data processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210229221.0A CN114595215A (en) | 2022-03-10 | 2022-03-10 | Data processing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114595215A true CN114595215A (en) | 2022-06-07 |
Family
ID=81808549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210229221.0A Pending CN114595215A (en) | 2022-03-10 | 2022-03-10 | Data processing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114595215A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115083615A (en) * | 2022-07-20 | 2022-09-20 | 之江实验室 | Method and device for chain type parallel statistics of number of patients in multi-center treatment |
-
2022
- 2022-03-10 CN CN202210229221.0A patent/CN114595215A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115083615A (en) * | 2022-07-20 | 2022-09-20 | 之江实验室 | Method and device for chain type parallel statistics of number of patients in multi-center treatment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321344B (en) | Information query method and device for associated data, computer equipment and storage medium | |
CN110413611B (en) | Data storage and query method and device | |
US11036685B2 (en) | System and method for compressing data in a database | |
CN109885614B (en) | Data synchronization method and device | |
CN111061758B (en) | Data storage method, device and storage medium | |
CN108427736B (en) | Method for querying data | |
CN111159184A (en) | Metadata tracing method and device and server | |
CN105740405A (en) | Data storage method and device | |
CN114741368A (en) | Log data statistical method based on artificial intelligence and related equipment | |
CN113918605A (en) | Data query method, device, equipment and computer storage medium | |
CN112765163A (en) | Data index storage method, system and device capable of extending dimensionality at will | |
CN114595215A (en) | Data processing method and device, electronic equipment and storage medium | |
CN113297266B (en) | Data processing method, device, equipment and computer storage medium | |
CN116719822B (en) | Method and system for storing massive structured data | |
CN113297204A (en) | Index generation method and device | |
CN101799803B (en) | Method, module and system for processing information | |
CN113435501B (en) | Clustering-based metric space data partitioning and performance measuring method and related components | |
CN114328486A (en) | Data quality checking method and device based on model | |
CN116701386A (en) | Key value pair retrieval method, device and storage medium | |
CN110046180B (en) | Method and device for locating similar examples and electronic equipment | |
CN108984720B (en) | Data query method and device based on column storage, server and storage medium | |
CN112667859A (en) | Data processing method and device based on memory | |
CN112069164A (en) | Data query method and device, electronic equipment and computer readable storage medium | |
CN117785889B (en) | Index management method for graph database and related equipment | |
CN114238258B (en) | Database data processing method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |