CN114595215A - Data processing method, device, electronic device and storage medium - Google Patents
Data processing method, device, electronic device and storage medium Download PDFInfo
- Publication number
- CN114595215A CN114595215A CN202210229221.0A CN202210229221A CN114595215A CN 114595215 A CN114595215 A CN 114595215A CN 202210229221 A CN202210229221 A CN 202210229221A CN 114595215 A CN114595215 A CN 114595215A
- Authority
- CN
- China
- Prior art keywords
- data
- dimension
- bitmap array
- bitmap
- cardinality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 27
- 230000002776 aggregation Effects 0.000 claims abstract description 22
- 238000004220 aggregation Methods 0.000 claims abstract description 22
- 238000003491 array Methods 0.000 claims abstract 2
- 238000004590 computer program Methods 0.000 claims description 15
- 238000000034 method Methods 0.000 abstract description 14
- 230000006870 function Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 5
- 241000282813 Aepyceros melampus Species 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013523 data management Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/2438—Embedded query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/244—Grouping and aggregation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据处理的技术领域,尤其涉及一种数据处理方法、装置、电子设备及存储介质。The present invention relates to the technical field of data processing, and in particular, to a data processing method, device, electronic device and storage medium.
背景技术Background technique
随着信息技术的飞速发展,信息数据的存储量不断增大,对数据管理和分析的需求也遍布于各种应用场景中。在数据管理和分析的过程中,通常需要对数据对象进行去重统计,例如获得uv(独立访客)数量等数据。With the rapid development of information technology, the storage capacity of information data continues to increase, and the demand for data management and analysis also spreads across various application scenarios. In the process of data management and analysis, it is usually necessary to perform deduplication statistics on data objects, such as obtaining data such as the number of UVs (independent visitors).
随着数据量的增大,查询统计中的计算量也会越来越大,对CPU、内存、网络IO等计算机资源的要求也越来越高,并且处理速度也会越来越慢。As the amount of data increases, the amount of calculation in query statistics will also increase, and the requirements for computer resources such as CPU, memory, and network IO will become higher and higher, and the processing speed will become slower and slower.
发明内容SUMMARY OF THE INVENTION
为了解决上述现有技术存在的问题和不足,本发明的目的是提供一种数据处理方法、装置、电子设备及存储介质,可以实现数据快速去重处理。In order to solve the above-mentioned problems and deficiencies in the prior art, the purpose of the present invention is to provide a data processing method, apparatus, electronic device and storage medium, which can realize the rapid deduplication processing of data.
为实现上述目的,本发明首先提供一种数据处理方法,包括:To achieve the above object, the present invention first provides a data processing method, comprising:
根据数据维度对多个待处理的数据进行分组,获得各维度的数据集;Group a plurality of data to be processed according to the data dimension to obtain a data set of each dimension;
通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组;The data sets of each dimension are aggregated and deduplicated through the aggregation function to obtain the bitmap array corresponding to the data sets of each dimension;
获取各维度数据集对应的位图数组的基数值;Get the cardinality value of the bitmap array corresponding to each dimension dataset;
根据位图数组的基数值确定各维度数据集去重后的数据量。Determine the amount of data after deduplication of each dimension data set according to the cardinality value of the bitmap array.
可选地,获取各维度数据集对应的位图数组的基数值的步骤,包括:Optionally, the step of acquiring the cardinality value of the bitmap array corresponding to each dimension dataset includes:
根据位图数组获得压缩位图数组,压缩位图数组占用的存储空间小于位图数组占用的存储空间;Obtain the compressed bitmap array according to the bitmap array, and the storage space occupied by the compressed bitmap array is smaller than the storage space occupied by the bitmap array;
根据压缩位图数组,获得对应的位图数组基数值。Obtain the corresponding base value of the bitmap array according to the compressed bitmap array.
可选地,每个维度的数据集包括多项数据组;根据各维度数据集对应的压缩位图数组,获得对应的位图数组基数值的步骤,包括:Optionally, the data set of each dimension includes multiple data groups; according to the compressed bitmap array corresponding to each dimension data set, the step of obtaining the corresponding bitmap array cardinality value includes:
根据每项数据组的压缩位图数组,获得每项数据组的基数值;Obtain the cardinality value of each data group according to the compressed bitmap array of each data group;
根据每项数据组的基数值,确定各维度数据集对应的位图数组基数值。According to the cardinality value of each data set, determine the bitmap array cardinality value corresponding to each dimension data set.
可选地,数据维度至少包括第一分组字段及第二分组字段;根据数据维度对多个待处理的数据进行分组,获得各维度的数据集的步骤,包括:Optionally, the data dimension includes at least a first grouping field and a second grouping field; the steps of grouping a plurality of data to be processed according to the data dimension and obtaining a data set of each dimension include:
获取每个待处理数据的分组字段;Get the grouping field for each data to be processed;
根据待处理数据的分组字段,将待处理数据分成多个数据集,多个数据集包括第一分组字段对应的第一数据集,及第二分组字段对应的第二数据集。According to the grouping field of the data to be processed, the data to be processed is divided into multiple data sets, and the multiple data sets include a first data set corresponding to the first grouping field and a second data set corresponding to the second grouping field.
可选地,第一分组字段与第二分组字段均包含多项数据组;根据位图数组的基数值确定各维度数据集去重后的数据量的步骤,包括:Optionally, both the first grouping field and the second grouping field contain multiple data groups; the step of determining the deduplicated data volume of each dimension data set according to the cardinality value of the bitmap array includes:
根据第一分组字段中数据对应的位图数组的基数值,确定第一分组字段中各项数据组去重后的数据量;According to the cardinality value of the bitmap array corresponding to the data in the first grouping field, determine the data volume of each data group in the first grouping field after deduplication;
根据第二分组字段中数据对应的位图数组的基数值,确定第二分组字段中各项数据组去重后的数据量。According to the cardinality value of the bitmap array corresponding to the data in the second grouping field, the deduplicated data amount of each data group in the second grouping field is determined.
本发明同时提供一种数据处理装置,包括:The present invention also provides a data processing device, comprising:
数据分组模块,用于根据数据维度对多个待处理的数据进行分组,获得各维度的数据集;The data grouping module is used to group a plurality of data to be processed according to the data dimension, and obtain the data set of each dimension;
聚合去重模块,用于通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组;The aggregation deduplication module is used to aggregate and deduplicate the data sets of each dimension through the aggregation function, and obtain the bitmap array corresponding to the data set of each dimension;
基数获取模块,用于获取各维度数据集对应的位图数组的基数值;The cardinality acquisition module is used to obtain the cardinality value of the bitmap array corresponding to each dimension dataset;
数据确定模块,用于根据位图数组的基数值确定各维度数据集去重后的数据量。The data determination module is used to determine the deduplicated data volume of each dimension data set according to the cardinality value of the bitmap array.
可选地,基数获取模块包括:Optionally, the cardinality acquisition module includes:
压缩位图获取模块,用于根据位图数组获得压缩位图数组,压缩位图数组占用的存储空间小于位图数组占用的存储空间;The compressed bitmap obtaining module is used to obtain the compressed bitmap array according to the bitmap array, and the storage space occupied by the compressed bitmap array is smaller than the storage space occupied by the bitmap array;
位图基数获取模块,用于根据压缩位图数组,获得对应的位图数组基数值。The bitmap cardinality obtaining module is used to obtain the corresponding bitmap array cardinality value according to the compressed bitmap array.
可选地,每个维度的数据集包括多项数据组,位图基数获取模块包括:Optionally, the data set of each dimension includes multiple data groups, and the bitmap cardinality obtaining module includes:
基数值获取模块,用于根据每项数据组的压缩位图数组,获得每项数据组的基数值;The cardinality value acquisition module is used to obtain the cardinality value of each data group according to the compressed bitmap array of each data group;
基数值确定模块,用于根据每项数据组的基数值,确定各维度数据集对应的位图数组基数值。The cardinality value determination module is used for determining the cardinality value of the bitmap array corresponding to each dimension data set according to the cardinality value of each data group.
本发明还提供一种电子设备,包括存储介质和处理器,存储介质存储有计算机程序,处理器执行计算机程序时实现上述任一项的数据处理方法的步骤。The present invention also provides an electronic device, comprising a storage medium and a processor, wherein the storage medium stores a computer program, and the processor implements the steps of any one of the data processing methods described above when the computer program is executed.
本发明还一种计算机可读的存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述任一项的数据处理方法的步骤。The present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any one of the above data processing methods.
与现有技术相比,本发明的有益效果包括:首先根据数据维度对多个待处理的数据进行分组,获得各维度的数据集;其次通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组;然后获取各维度数据集对应的位图数组的基数值;最后根据位图数组的基数值确定各维度数据集去重后的数据量。本发明在上述方法步骤中,将待处理的数据转化为位图数组(Bitmap),通过位图数组存储空间小的特点,可以对不同维度下的数据实现快速去重处理操作,提升了数据统计查询的效率。Compared with the prior art, the beneficial effects of the present invention include: firstly, a plurality of data to be processed are grouped according to the data dimension to obtain a data set of each dimension; secondly, the data set of each dimension is aggregated and deduplicated by an aggregation function. Process to obtain the bitmap array corresponding to each dimension dataset; then obtain the cardinality value of the bitmap array corresponding to each dimension dataset; finally, determine the deduplicated data volume of each dimension dataset according to the cardinality value of the bitmap array. In the above method steps of the present invention, the data to be processed is converted into a bitmap array (Bitmap), and by virtue of the small storage space of the bitmap array, the data in different dimensions can be quickly deduplicated and processed, thereby improving data statistics. query efficiency.
附图说明Description of drawings
为了更清楚地说明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单介绍,显而易见地,下面描述中的附图仅仅是发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments or technical solutions in the prior art, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for invention. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1是本发明实施例数据处理方法的流程图一;FIG. 1 is a flowchart 1 of a data processing method according to an embodiment of the present invention;
图2是本发明实施例数据处理方法的流程图二;2 is a second flowchart of a data processing method according to an embodiment of the present invention;
图3是本发明实施例数据处理方法的流程图三;3 is a flowchart 3 of a data processing method according to an embodiment of the present invention;
图4是本发明实施例数据处理装置的框架图;4 is a frame diagram of a data processing apparatus according to an embodiment of the present invention;
图5是本发明实施例基数获取模块的框架图;5 is a frame diagram of a cardinality acquisition module according to an embodiment of the present invention;
图6是本发明实施例位图基数获取模块的框架图;6 is a frame diagram of a bitmap cardinality acquisition module according to an embodiment of the present invention;
图7是本发明实施例电子设备的架构图。FIG. 7 is a structural diagram of an electronic device according to an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,本发明实施例中,术语“第一”、“第二”等仅用于区分描述的目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。在本发明实施例中,“例如”、“例子”和“比如”用来表示“用作例子、例证或说明”。本发明中被描述为“例如”、“例子”和“比如”的任何实施例不一定被解释为比其它实施例更优选或更具优势。为了使本领域任何技术人员能够实现和使用本发明,给出了以下描述。在以下描述中,为了解释的目的而列出了细节。应当明白的是,本领域普通技术人员可以认识到,在不使用这些特定细节的情况下也可以实现本发明。在其它实例中,不会对公知的结构和过程进行详细阐述,以避免不必要的细节使本发明的描述变得晦涩。因此,本发明并非旨在限于所示的实施例,而是与符合本发明所公开的原理和特征的最广范围相一致。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, in this embodiment of the present invention, the terms "first", "second", etc. are only used for the purpose of distinguishing and describing, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. In the embodiments of the present invention, "such as", "example" and "such as" are used to mean "serving as an example, illustration or illustration". Any embodiments of the present invention described as "such as," "examples," and "such as" are not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the present invention. In the following description, details are set forth for the purpose of explanation. It will be understood by one of ordinary skill in the art that the present invention may be practiced without the use of these specific details. In other instances, well-known structures and procedures have not been described in detail so as not to obscure the description of the present invention with unnecessary detail. Thus, the present invention is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
本发明实施例提供一种数据处理方法,如图1所示,包括步骤100、步骤200、步骤300及步骤400,数据处理方法具体如下:An embodiment of the present invention provides a data processing method, as shown in FIG. 1 , including
步骤100,根据数据维度对多个待处理的数据进行分组,获得各维度的数据集。其中,数据维度至少包括第一分组字段及第二分组字段,第一分组字段与第二分组字段均包含多项数据组。Step 100: Group a plurality of data to be processed according to the data dimension to obtain a data set of each dimension. The data dimension includes at least a first grouping field and a second grouping field, and both the first grouping field and the second grouping field include multiple data groups.
一种实施例中,步骤100可以具体包括下面步骤:In an embodiment, step 100 may specifically include the following steps:
首先,获取每个待处理数据的分组字。First, get the packet word for each data to be processed.
然后,根据待处理数据的分组字段,将待处理数据分成多个数据集,多个数据集包括第一分组字段对应的第一数据集,及第二分组字段对应的第二数据集。例如,在进行用户统计查询时,第一分组字段为国家,第二分组字段为年龄,第一数据集包括各个国家名称或代码、各个用户的ID或名称,第二数据集包括各个年龄分段、各个用户的ID或名称。Then, according to the grouping field of the data to be processed, the data to be processed is divided into multiple data sets, and the multiple data sets include a first data set corresponding to the first grouping field and a second data set corresponding to the second grouping field. For example, when querying user statistics, the first grouping field is country, the second grouping field is age, the first data set includes each country name or code, the ID or name of each user, and the second data set includes each age segment , the ID or name of each user.
其中,第一分组字段与第二分组字段的维度层级可以相同,也可以不相同。The dimension levels of the first grouping field and the second grouping field may be the same or different.
例如,第一分组字段的维度层级高于第二分组字段的维度层级,第一分组字段为国家,其中包括中国、美国、英国、日本等多项数据组,第二分组字段为省市县,包括北京、武汉、湖南省、北海道、山形县、福岛县、亚利桑那州等省市县级数据组。For example, the dimension level of the first grouping field is higher than the dimension level of the second grouping field. The first grouping field is country, which includes multiple data groups such as China, the United States, the United Kingdom, and Japan, and the second grouping field is province, city and county. Including Beijing, Wuhan, Hunan Province, Hokkaido, Yamagata, Fukushima, Arizona and other provinces, cities and county-level data groups.
例如,第一分组字段的维度层级与第二分组字段的维度层级相同,第一分组字段为年龄,其中包括18岁、25岁、30岁、40岁等年龄分段的数据组;第二分组字段为学历,其中包括小学、初中、高中、大学等数据组。For example, the dimension level of the first grouping field is the same as the dimension level of the second grouping field. The first grouping field is age, which includes 18-year-old, 25-year-old, 30-year-old, 40-year-old and other age-segmented data groups; the second grouping field The field is education, which includes data groups such as elementary school, junior high school, high school, and university.
一种实施例中,数据维度还可以包括第三分组字段、第四分组字段或更多种分组字段,此处不作穷举。In one embodiment, the data dimension may further include a third grouping field, a fourth grouping field, or more grouping fields, which are not exhaustive here.
步骤200,通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组。例如,第一维度数据集为第一分组字段对应的数据集,第二维度数据集为第二分组字段对应的数据集。Step 200: Perform aggregation and deduplication processing on the data sets of each dimension through an aggregation function to obtain a bitmap array corresponding to the data sets of each dimension. For example, the first dimension data set is the data set corresponding to the first grouping field, and the second dimension data set is the data set corresponding to the second grouping field.
其中,聚合函数包括rbmcreate32函数和rbmcardinality32函数,是在Impala执行引擎当中扩展实现的用于聚合去重的函数。Impala是用于处理存储在Hadoop(分布式系统基础架构)集群中的大量数据的MPP(大规模并行处理)SQL(Structured Query Language,结构化查询语言)查询引擎,它是一个用C++和Java编写的开源软件。SQL是一种数据库查询和程序设计语言,用于存取数据以及查询、更新和管理关系数据库系统。Among them, the aggregation functions include the rbmcreate32 function and the rbmcardinality32 function, which are functions extended and implemented in the Impala execution engine for aggregation and deduplication. Impala is an MPP (massively parallel processing) SQL (Structured Query Language) query engine for processing large amounts of data stored in Hadoop (distributed systems infrastructure) clusters, it is a C++ and Java written of open source software. SQL is a database query and programming language used to access data and query, update, and manage relational database systems.
位图(Bitmap)数组的基本思想就是用一个bit位来标记某个元素对应的Value,而Key即是该元素。由于采用了Bit为单位来存储数据,因此在存储空间方面,可以大大节省存储空间。The basic idea of a bitmap array is to use a bit to mark the Value corresponding to an element, and the Key is the element. Since the unit of Bit is used to store data, the storage space can be greatly saved in terms of storage space.
例如,一个数据集合为[2,3,5,8],其对应的Bitmap数组则是[001101001]。在数据集合中,2对应到Bitmap数组中index(索引)为2的位置,3对应到Bitmap数组中index为3的位置,5对应到Bitmap数组中index(索引)为5的位置,8对应到Bitmap数组中index(索引)为8的位置。在Bitmap数组中,只有0和1,0代表该index位置没有数据,1代表该index位置有数据。在Bitmap数组[001101001]中,从左往右计数,第一个位置的0代表数据0不存在,第二个位置的0代表数据1不存在,第三个位置的1代表数据2存在,第四个位置的1代表数据3存在,剩下的以此类推。For example, a data set is [2, 3, 5, 8], and the corresponding Bitmap array is [001101001]. In the data set, 2 corresponds to the position where the index (index) is 2 in the Bitmap array, 3 corresponds to the position where the index (index) is 3 in the Bitmap array, 5 corresponds to the position where the index (index) is 5 in the Bitmap array, and 8 corresponds to The position of index (index) in the Bitmap array is 8. In the Bitmap array, there are only 0 and 1, 0 means that there is no data at the index position, and 1 means that there is data at the index position. In the Bitmap array [001101001], counting from left to right, 0 in the first position means data 0 does not exist, 0 in the second position means data 1 does not exist, 1 in the third position means data 2 exists, A 1 in the four positions means that data 3 exists, and so on for the rest.
步骤300,获取各维度数据集对应的位图数组的基数值。其中,位图数组的基数值就是该位图数组中“1”的个数。例如在位图数组[001101001]中,1的个数为4个,因此该位图数组的基数值为4。Step 300: Obtain the cardinality value of the bitmap array corresponding to each dimension data set. Among them, the cardinality value of the bitmap array is the number of "1" in the bitmap array. For example, in the bitmap array [001101001], the number of 1s is 4, so the base value of the bitmap array is 4.
因为相同数据在Bitmap数组中的index相同,在Bitmap数组中对应的index位置上只表示为1,因此,在构建Bitmap数组的过程中,也是对原数据集合进行去重处理。例如,数据集合[2,3,5,5,8,2]转化成位图数组也是[001101001]。Because the same data has the same index in the Bitmap array, the corresponding index position in the Bitmap array is only represented as 1. Therefore, in the process of constructing the Bitmap array, the original data collection is also deduplicated. For example, the data set [2, 3, 5, 5, 8, 2] converted into a bitmap array is also [001101001].
一种实施例中,如图2所示,步骤300具体包括步骤310和步骤320,其中:In an embodiment, as shown in FIG. 2 , step 300 specifically includes
步骤310,根据位图数组获得压缩位图数组,压缩位图数组占用的存储空间小于位图数组占用的存储空间。Step 310: Obtain a compressed bitmap array according to the bitmap array, where the storage space occupied by the compressed bitmap array is smaller than the storage space occupied by the bitmap array.
具体地,压缩位图(Roaring Bitmap)相比一般位图占用的存储空间更小。在RoaringBitmap中,32位整数被分成了2至16个块。任何一个32位整数的前16位决定放在哪个块里,后16位就是放在这个块里的内容。比如0xFFFF0000和0xFFFF0001,前16位都是FFFF,表明这两个数应该放在一个块里。后16位分别是0和1。在这个块中只保存0和1就可以了,不需要保存完整的整数。这样就高效的压缩了Bitmap,使位图数组占用的空间大大较小。Specifically, a compressed bitmap (Roaring Bitmap) occupies less storage space than a general bitmap. In RoaringBitmap, 32-bit integers are divided into 2 to 16 blocks. The first 16 bits of any 32-bit integer determine which block to put in, and the last 16 bits are the contents of this block. For example, 0xFFFF0000 and 0xFFFF0001, the first 16 bits are FFFF, indicating that these two numbers should be placed in one block. The last 16 bits are 0 and 1, respectively. It's fine to store only 0s and 1s in this block, no need to store full integers. In this way, the Bitmap is efficiently compressed, so that the space occupied by the bitmap array is much smaller.
步骤320,根据压缩位图数组,获得对应的位图数组基数值。如图3所示,具体包括步骤321和步骤322,其中每个维度的数据集包括多项数据组,步骤如下:Step 320: Obtain the corresponding base value of the bitmap array according to the compressed bitmap array. As shown in Figure 3, it specifically includes
步骤321,根据每项数据组的压缩位图数组,获得每项数据组的基数值。Step 321: Obtain the cardinality value of each data group according to the compressed bitmap array of each data group.
步骤322,根据每项数据组的基数值,确定各维度数据集对应的位图数组基数值。Step 322: Determine the cardinality value of the bitmap array corresponding to each dimension data set according to the cardinality value of each data set.
例如,在按国籍和省份统计uv(独立访客)数量的任务中,第二维度数据集中,数据组包括北海道、山形县、岩手县、福岛县、秋田县、青森县,对应的基数值分别为18、31、23、54、42、37。数据组利亚桑那州、亚拉巴马州、伊利诺斯州对应的基数值分别为8、9、7。For example, in the task of counting the number of uvs (independent visitors) by nationality and province, in the second dimension data set, the data groups include Hokkaido, Yamagata, Iwate, Fukushima, Akita, and Aomori, and the corresponding base values are respectively 18, 31, 23, 54, 42, 37. The corresponding base values for the data sets Arizona, Alabama, and Illinois are 8, 9, and 7, respectively.
步骤400,根据位图数组的基数值确定各维度数据集去重后的数据量。具体包括以下步骤:Step 400: Determine the deduplicated data volume of each dimension data set according to the cardinality value of the bitmap array. Specifically include the following steps:
首先,根据第一分组字段中数据对应的位图数组的基数值,确定第一分组字段中各项数据组去重后的数据量。例如,在按国籍和省份统计uv数的任务中,第一分组字段包括日本、美国,日本对应的基数值为205,美国的基数值为24,那么日本的uv数为205,美国的uv数为24。First, according to the cardinality value of the bitmap array corresponding to the data in the first grouping field, the data amount of each data group in the first grouping field after deduplication is determined. For example, in the task of counting UVs by nationality and province, the first grouping field includes Japan and the United States. The base value corresponding to Japan is 205, and the base value of the United States is 24. Then the number of UVs in Japan is 205, and the number of UVs in the United States is 205. is 24.
然后,根据第二分组字段中数据对应的位图数组的基数值,确定第二分组字段中各项数据组去重后的数据量。例如,第二分组字段中,日本对应的数据组为北海道、山形县、岩手县、福岛县、秋田县、青森县,对应的基数值分为别18、31、23、54、42、37,则北海道、山形县、岩手县、福岛县、秋田县、青森县的uv分别为18、31、23、54、42、37。美国的数据组中,利亚桑那州、亚拉巴马州、伊利诺斯州对应的基数值分别为8、9、7,则利亚桑那州、亚拉巴马州、伊利诺斯州的uv别为8、9、7。Then, according to the cardinality value of the bitmap array corresponding to the data in the second grouping field, the deduplicated data amount of each data group in the second grouping field is determined. For example, in the second grouping field, the data groups corresponding to Japan are Hokkaido, Yamagata, Iwate, Fukushima, Akita, and Aomori, and the corresponding base values are 18, 31, 23, 54, 42, and 37 respectively. , the uvs of Hokkaido, Yamagata, Iwate, Fukushima, Akita, and Aomori are 18, 31, 23, 54, 42, and 37, respectively. In the US data set, the base values corresponding to Arizona, Alabama, and Illinois are 8, 9, and 7, respectively. 8, 9, 7.
需要说明的是,在某些情况下,某单个数据可能存在于多个不同的分组字段,例如某个用户具有双重国籍,那么在按国籍统计uv数时,需要统计整个数据,以去重在第一分组字段下的数据重合情况。It should be noted that, in some cases, a single data may exist in multiple different grouping fields. For example, a user has dual nationality, then when counting the number of UVs by nationality, it is necessary to count the entire data in order to de-emphasize Data coincidence under the first grouping field.
本发明实施例在Impala引擎当中引入了Bitmap的计算方式。为了精确统计出来每个分组字段中的user_id个数,须要保存每个分组字段的user_id明细,而计算机当中保留信息的最小单位是bit。The embodiment of the present invention introduces the calculation method of Bitmap into the Impala engine. In order to accurately count the number of user_ids in each grouping field, it is necessary to save the user_id details of each grouping field, and the minimum unit of information retained in the computer is bit.
本实施例在Impala执行引擎当中扩展实现了rbmcreate32函数和rbmcardinality32函数,其中,rbmcreate32函数可以对任意的属性构造响应的bitmap,同时这个bitmap可以向上传递用于去重值的聚合计算。This embodiment extends and implements the rbmcreate32 function and the rbmcardinality32 function in the Impala execution engine. The rbmcreate32 function can construct a corresponding bitmap for any attribute, and at the same time, the bitmap can be passed upward for aggregate calculation of deduplication values.
如下表所示,在pege和title两个维度下,统计各自的uv数据。As shown in the following table, under the two dimensions of pege and title, the respective uv data are counted.
第一维度是title,第二维度是pege,同时计算title和page两个维度下各自的用户uv指标,rbmcreate32函数和rbmcardinality32函数可以通过下面的SQL进行计算:The first dimension is title, the second dimension is pege, and the respective user uv indicators under the two dimensions of title and page are calculated at the same time. The rbmcreate32 function and the rbmcardinality32 function can be calculated by the following SQL:
selectpage,title,rbmcardinality32(uv_bitmap)over(partition by page),rbmcardinality32(uv_bitmap)over(partition by title)selectpage,title,rbmcardinality32(uv_bitmap)over(partition by page),rbmcardinality32(uv_bitmap)over(partition by title)
from(select page,title,rbmcreate32(distinct user_id)as uv_bitmap fromtable group by page,title)。from(select page,title,rbmcreate32(distinct user_id)as uv_bitmap fromtable group by page,title).
本发明实施例首先根据数据维度对多个待处理的数据进行分组,获得各维度的数据集;其次通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组;然后获取各维度数据集对应的位图数组的基数值;最后根据位图数组的基数值确定各维度数据集去重后的数据量。本发明在上述方法步骤中,将待处理的数据转化为位图数组(Bitmap),通过位图数组存储空间小的特点,可以对不同维度下的数据实现快速去重、排序、查询等数据处理操作,提升了数据查询统计的效率。In the embodiment of the present invention, a plurality of data to be processed are firstly grouped according to the data dimensions to obtain a data set of each dimension; secondly, an aggregation function is used to perform aggregation and deduplication processing on the data set of each dimension to obtain the bit corresponding to the data set of each dimension. image array; then obtain the cardinality value of the bitmap array corresponding to each dimension dataset; finally, determine the deduplicated data volume of each dimension dataset according to the cardinality value of the bitmap array. In the above method steps of the present invention, the data to be processed is converted into a bitmap array (Bitmap), and by virtue of the small storage space of the bitmap array, data processing such as fast deduplication, sorting, and query can be realized for data in different dimensions operation, which improves the efficiency of data query and statistics.
本发明实施例提供一种数据处理装置,如图4所示,包括数据分组模块500、聚合去重模块600、基数获取模块700及数据确定模块800。An embodiment of the present invention provides a data processing apparatus, as shown in FIG. 4 , including a
数据分组模块500用于根据数据维度对多个待处理的数据进行分组,获得各维度的数据集。The
聚合去重模块600用于通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组。The
基数获取模块700用于获取各维度数据集对应的位图数组的基数值。The
其中,如图5所示,基数获取模块700具体包括压缩位图获取模块710和位图基数获取模块720。压缩位图获取模块710用于根据位图数组获得压缩位图数组,压缩位图数组占用的存储空间小于位图数组占用的存储空间。位图基数获取模块720用于根据压缩位图数组,获得对应的位图数组基数值。Wherein, as shown in FIG. 5 , the
进一步地,每个维度的数据集包括多项数据组,如图6所示,位图基数获取模块720包括基数值获取模块721和基数值确定模块722。基数值获取模块721用于根据每项数据组的压缩位图数组,获得每项数据组的基数值。基数值确定模块722用于根据每项数据组的基数值,确定各维度数据集对应的位图数组基数值。Further, the data set of each dimension includes multiple data groups. As shown in FIG. 6 , the bitmap
数据确定模块800用于根据位图数组的基数值确定各维度数据集去重后的数据量。The
本发明是实施例的数据处理装置,首先根据数据维度对多个待处理的数据进行分组,获得各维度的数据集;其次通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组;然后获取各维度数据集对应的位图数组的基数值;最后根据位图数组的基数值确定各维度数据集去重后的数据量。本发明在上述方法步骤中,将待处理的数据转化为位图数组(Bitmap),通过位图数组存储空间小的特点,可以对不同维度下的数据实现快速去重、排序、查询等数据处理操作,提升了数据查询统计的效率。The present invention is the data processing device of the embodiment. First, a plurality of data to be processed are grouped according to the data dimensions to obtain data sets of each dimension; The bitmap array corresponding to the dimension dataset; then, the cardinality value of the bitmap array corresponding to each dimension dataset is obtained; finally, the deduplicated data volume of each dimension dataset is determined according to the cardinality value of the bitmap array. In the above method steps of the present invention, the data to be processed is converted into a bitmap array (Bitmap), and by virtue of the small storage space of the bitmap array, data processing such as fast deduplication, sorting, and query can be realized for data in different dimensions operation, which improves the efficiency of data query and statistics.
本发明实施例提供一种电子设备,如图7所示,包括存储介质和处理器,存储介质存储有计算机程序,处理器执行计算机程序时实现上述实施例提供的任一项的数据处理方法的步骤。An embodiment of the present invention provides an electronic device, as shown in FIG. 7 , including a storage medium and a processor, the storage medium stores a computer program, and the processor implements any one of the data processing methods provided in the above embodiments when the computer program is executed. step.
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令(计算机程序)来完成,或通过指令(计算机程序)控制相关的硬件来完成,该指令可以存储于计算机可读存储介质中,并由处理器进行加载和执行。为此,本发明实施例的电子设备的存储介质中存储有多条指令,该指令能够被处理器进行加载,以执行本发明实施例所提供的设备控制方法中任一实施例的步骤。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructions (computer programs), or by controlling relevant hardware through instructions (computer programs), and the instructions can be stored in computer-readable storage medium, and loaded and executed by a processor. To this end, the storage medium of the electronic device in the embodiment of the present invention stores a plurality of instructions, and the instructions can be loaded by the processor to execute the steps of any embodiment of the device control method provided in the embodiment of the present invention.
本实施例的电子设备,首先根据数据维度对多个待处理的数据进行分组,获得各维度的数据集;其次通过聚合函数对各维度的数据集分别进行聚合去重处理,获得各维度数据集对应的位图数组;然后获取各维度数据集对应的位图数组的基数值;最后根据位图数组的基数值确定各维度数据集去重后的数据量。本发明在上述方法步骤中,将待处理的数据转化为位图数组(Bitmap),通过位图数组存储空间小的特点,可以对不同维度下的数据实现快速去重、排序、查询等数据处理操作,提升了数据查询统计的效率。In the electronic device of this embodiment, firstly, a plurality of data to be processed are grouped according to the data dimension to obtain a data set of each dimension; secondly, the data set of each dimension is aggregated and deduplicated by an aggregation function to obtain a data set of each dimension The corresponding bitmap array; then, the cardinality value of the bitmap array corresponding to each dimension dataset is obtained; finally, the deduplication data amount of each dimension dataset is determined according to the cardinality value of the bitmap array. In the above method steps of the present invention, the data to be processed is converted into a bitmap array (Bitmap), and by virtue of the small storage space of the bitmap array, data processing such as fast deduplication, sorting, and query can be realized for data in different dimensions operation, which improves the efficiency of data query and statistics.
本发明实施例还提供一种计算机可读的存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述实施例提供的任一数据处理方法的步骤。Embodiments of the present invention further provide a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the data processing methods provided in the foregoing embodiments.
如图所示,存储介质和处理器之间直接或间接地电性连接,以实现数据的传输或交互。例如,这些元件相互之间可以通过一条或者多条通信总线或信号线电性连接,如可以通过总线连接。存储介质中存储有实现数据访问控制方法的计算机执行指令,包括至少一个可以软件或固件的形式存储于存储介质中的软件功能模块,处理器通过运行存储在存储介质内的软件程序以及模块,从而执行各种功能应用以及数据处理。存储介质可以是,但不限于,随机存取存储介质(RandomAccessMemory,简称:RAM),只读存储介质(ReadOnlyMemory,简称:ROM),可编程只读存储介质(ProgrammableRead-OnlyMemory,简称:PROM),可擦除只读存储介质(ErasableProgrammableRead-OnlyMemory,简称:EPROM),电可擦除只读存储介质(ElectricErasableProgrammableRead-OnlyMemory,简称:EEPROM)等。其中,存储介质用于存储程序,处理器在接收到执行指令后,执行程序。进一步地,上述存储介质内的软件程序以及模块还可包括操作系统,其可包括各种用于管理系统任务(例如内存管理、存储设备控制、电源管理等)的软件组件和/或驱动,并可与各种硬件或软件组件相互通信,从而提供其他软件组件的运行环境。处理器可以是一种集成电路芯片,具有信号的处理能力。所述的处理器可以是通用处理器,包括中央处理器(CentralProcessingUnit,简称:CPU)、网络处理器(NetworkProcessor,简称:NP)等。可以实现或者执行本实施例中公开的各方法、步骤及逻辑流程框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。As shown in the figure, the storage medium and the processor are directly or indirectly electrically connected to realize data transmission or interaction. For example, these elements may be electrically connected to each other by one or more communication buses or signal lines, such as may be connected by a bus. The storage medium stores computer-executed instructions for implementing the data access control method, including at least one software function module that can be stored in the storage medium in the form of software or firmware, and the processor runs the software programs and modules stored in the storage medium. Execute various functional applications and data processing. The storage medium may be, but not limited to, a random access storage medium (Random Access Memory, referred to as: RAM), a read-only storage medium (ReadOnly Memory, referred to as: ROM), a programmable read-only storage medium (Programmable Read-Only Memory, referred to as: PROM), An erasable read-only storage medium (ErasableProgrammableRead-OnlyMemory, referred to as: EPROM), an electrically erasable read-only storage medium (ElectricErasableProgrammableRead-OnlyMemory, referred to as: EEPROM), and the like. The storage medium is used to store the program, and the processor executes the program after receiving the execution instruction. Further, the software programs and modules in the above-mentioned storage medium may also include an operating system, which may include various software components and/or drivers for managing system tasks (such as memory management, storage device control, power management, etc.), and Can communicate with various hardware or software components to provide an operating environment for other software components. The processor may be an integrated circuit chip with signal processing capability. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (NetworkProcessor, NP for short). Each method, step and logic flow block diagram disclosed in this embodiment can be realized or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
由于该存储介质中所存储的指令,可以执行本发明实施例所提供的任一数据处理方法实施例中的步骤,因此,可以实现本发明实施例所提供的任一数据处理方法所能实现的有益效果,详见前面的实施例,在此不再赘述。Since the instructions stored in the storage medium can execute the steps in any of the data processing method embodiments provided by the embodiments of the present invention, it is possible to implement any of the data processing methods provided by the embodiments of the present invention. For the beneficial effects, refer to the foregoing embodiments for details, which will not be repeated here.
以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210229221.0A CN114595215A (en) | 2022-03-10 | 2022-03-10 | Data processing method, device, electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210229221.0A CN114595215A (en) | 2022-03-10 | 2022-03-10 | Data processing method, device, electronic device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114595215A true CN114595215A (en) | 2022-06-07 |
Family
ID=81808549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210229221.0A Pending CN114595215A (en) | 2022-03-10 | 2022-03-10 | Data processing method, device, electronic device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114595215A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115083615A (en) * | 2022-07-20 | 2022-09-20 | 之江实验室 | Method and device for chain type parallel statistics of number of patients in multi-center treatment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955685A (en) * | 2019-11-29 | 2020-04-03 | 北京锐安科技有限公司 | Big data base estimation method, system, server and storage medium |
CN111563109A (en) * | 2020-04-26 | 2020-08-21 | 北京奇艺世纪科技有限公司 | Radix statistics method, apparatus, system, device and computer readable storage medium |
CN113641669A (en) * | 2021-06-30 | 2021-11-12 | 北京邮电大学 | A hybrid engine-based multidimensional data query method and device |
-
2022
- 2022-03-10 CN CN202210229221.0A patent/CN114595215A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955685A (en) * | 2019-11-29 | 2020-04-03 | 北京锐安科技有限公司 | Big data base estimation method, system, server and storage medium |
CN111563109A (en) * | 2020-04-26 | 2020-08-21 | 北京奇艺世纪科技有限公司 | Radix statistics method, apparatus, system, device and computer readable storage medium |
CN113641669A (en) * | 2021-06-30 | 2021-11-12 | 北京邮电大学 | A hybrid engine-based multidimensional data query method and device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115083615A (en) * | 2022-07-20 | 2022-09-20 | 之江实验室 | Method and device for chain type parallel statistics of number of patients in multi-center treatment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108427725B (en) | Data processing method, device and system | |
US10496624B2 (en) | Index key generating device, index key generating method, and search method | |
CN110909266B (en) | Deep paging method and device and server | |
CN110795458B (en) | Interactive data analysis method, device, electronic equipment and computer readable storage medium | |
CN112287182A (en) | Graph data storage and processing method and device and computer storage medium | |
WO2017096892A1 (en) | Index construction method, search method, and corresponding device, apparatus, and computer storage medium | |
CN111061758B (en) | Data storage method, device and storage medium | |
CN110674101A (en) | Data processing method, device and cloud server for file system | |
CN107133329A (en) | Data processing method, data processing equipment and storage medium | |
WO2021004266A1 (en) | Data insertion method and apparatus, device and storage medium | |
CN114741368A (en) | Log data statistical method based on artificial intelligence and related equipment | |
WO2023124217A1 (en) | Method and device for acquiring comprehensively sorted data of multi-column data | |
CN104809246A (en) | Method and device for processing charging data | |
WO2020088262A1 (en) | Data analysis method and device, and storage medium | |
CN109388659B (en) | Data storage method, device and computer readable storage medium | |
CN113918605A (en) | Data query method, device, equipment and computer storage medium | |
CN114595215A (en) | Data processing method, device, electronic device and storage medium | |
CN112307062A (en) | Database aggregation query method, device and system | |
CN111813773A (en) | A grid meter reading data storage method, uploading method, device and storage device | |
CN114860722A (en) | Data fragmentation method, device, equipment and medium based on artificial intelligence | |
CN112732711B (en) | Data storage method and device and electronic equipment | |
CN113297266B (en) | Data processing method, apparatus, device and computer storage medium | |
CN117349328A (en) | Multiple data source aggregation methods, devices, equipment and storage media | |
CN115840539A (en) | Data processing method and device, electronic equipment and storage medium | |
CN113485683A (en) | RTC code amount statistical method, RTC code amount statistical system, RTC code amount statistical medium and RTC code amount statistical terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |