CN110569263B - Real-time data deduplication counting method and device - Google Patents

Real-time data deduplication counting method and device Download PDF

Info

Publication number
CN110569263B
CN110569263B CN201910795939.4A CN201910795939A CN110569263B CN 110569263 B CN110569263 B CN 110569263B CN 201910795939 A CN201910795939 A CN 201910795939A CN 110569263 B CN110569263 B CN 110569263B
Authority
CN
China
Prior art keywords
real
dimension
time data
judgment
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910795939.4A
Other languages
Chinese (zh)
Other versions
CN110569263A (en
Inventor
汪凯
张盼盼
韩振旭
李成
孙迁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Suning Cloud Computing Co ltd
SuningCom Co ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201910795939.4A priority Critical patent/CN110569263B/en
Publication of CN110569263A publication Critical patent/CN110569263A/en
Priority to CA3152844A priority patent/CA3152844A1/en
Priority to PCT/CN2020/097839 priority patent/WO2021036452A1/en
Application granted granted Critical
Publication of CN110569263B publication Critical patent/CN110569263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations

Abstract

The embodiment of the invention discloses a method and a device for removing the duplicate of real-time data, which can realize the accurate calculation of the measurement and the duplicate removal count value of the real-time data, and simultaneously occupy small memory space in the process of removing the duplicate. The method comprises the following steps: acquiring real-time data, dimension information, dimension combination information and measurement information; decomposing all dimension combinations of the real-time data into 1-dimensional dimension combinations according to the dimension information, the dimension combination information and the measurement information to generate a weight judgment key; and adopting a distributed lock mechanism to perform Redis batch weight judgment on the weight judgment key to obtain a weight removal counting result. The uniqueness of the measurement field of the real-time data under each dimension combination can be ensured, and the accurate duplication removal counting value of the real-time data can be obtained through duplication judgment. And the repeated judgment key of the 1-dimensional dimension combination is only needed to be stored for repeated judgment, a large amount of memory resources are released, and the length of the repeated judgment key is fixed, so that the use amount of the Redis memory can be estimated.

Description

Real-time data deduplication method and device
Technical Field
The invention belongs to the technical field of data analysis, and particularly relates to a real-time data deduplication counting method and device.
Background
In big data analysis, deduplication statistics is often required to be performed on a certain measurement field, and the currently common online analytical processing (OLAP) tool cannot accurately calculate the deduplication count value of real-time data. And the duplicate is removed by using a conventional scheme, namely the Cartesian product of dimensional value sets of all the dimensional combinations is a weight judging key, and the maximum number of the weight judging keys of each dimensional combination under one measurement field is the product of cardinalities of all the dimensions under the dimensional combination. Such as a Cube [1] Including four dimensions of time, item, location and supplier, as shown in FIG. 1, there are 16 dimension combinations, namely 16 Cuboids [2] . If the cardinalities of the four dimensions are a, b, c and d respectively, the maximum number of the duplication keys of the combination (time, item) is a b, and the combination (time, i)tem, location) is a, b, c, and the maximum number of duplication keys of a combination (time, item, location, support) is a, b, c, d, and the like. By analogy, the maximum total number of the duplication keys is 1+ (a + b + c + d) + (a + b + a + c + a + d + b + c + b + d + c + d) + (a + b + c + a + d + a + c + d) + a + b + c. Therefore, the conventional deduplication method occupies a large memory space at present, so that serious resource waste is caused, and the estimation of the memory usage amount is difficult.
Disclosure of Invention
The embodiment of the invention provides a method and a device for counting duplicate data in real time, which are used for solving the problem.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a real-time data deduplication method, including:
acquiring real-time data and counting information, wherein the counting information comprises dimension information, dimension combination information and measurement information;
decomposing all dimension combinations of the real-time data into 1-dimensional dimension combinations according to the dimension information, the dimension combination information and the measurement information to generate a weight judgment key;
and Redis batch weight judgment is carried out on the weight judgment key by adopting a distributed lock mechanism, so that the de-weight counting result of the real-time data is obtained.
With reference to the first aspect, in a first possible implementation manner of the first aspect, redis batch re-judgment is performed on the re-judgment key by using a distributed lock mechanism to obtain a re-elimination counting result of the real-time data, and the method specifically includes:
judging whether each piece of data in the real-time data has competition or not, if not, judging the weight by all the weight judging keys in Redis batch; if the data exists, locking one piece of competitive data, performing Redis batch judgment on the weight judgment key of the piece of data, releasing the lock after processing, performing polling locking on other data, performing Redis batch judgment on the weight judgment key of the successfully locked data, releasing the lock after processing, and circulating until the weight judgment keys of all the competitive data finish Redis batch judgment; the Redis batch judging specifically comprises the steps that Redis batch judging judges whether a judging key exists or not, if not, the metric value under the nonrepeating dimension combination in all the dimension combinations associated with the judging key is added with 1 to generate a pre-calculation result;
and accumulating the metric values of the same metric field value under the same dimension combination in the pre-calculation result to generate a de-duplication counting result of the real-time data.
With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the method further includes: and storing the deduplication counting result of the real-time data into a database.
With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the storing the deduplication result of the real-time data in a database specifically includes:
and converting the duplication count result into search fields and values, wherein the search fields correspond to the values one to one, sequentially searching whether the same search fields exist in the database, if so, adding the values corresponding to the same search fields and then storing the values into the database, and if not, storing the search fields and the values corresponding to the search fields into the database.
With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the dimension information includes a dimension code, the dimension combination information includes a dimension combination dimension code, and the metric information includes a metric code.
With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the generating the duplication key specifically includes splicing, as the duplication key, a date of the day, a metric field value of the real-time data, and a dimension field value of the real-time data with a dimension code of a 1-dimensional dimension combination.
In a second aspect, an embodiment of the present invention provides a real-time data deduplication device, including:
the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring real-time data and counting information, and the counting information comprises dimension information, dimension combination information and measurement information;
the judgment key generation module is used for obtaining a 1-dimensional dimension combination of the real-time data according to the dimension information, the dimension combination information and the measurement information and generating a judgment key;
and the counting module is used for performing Redis batch weight judgment on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the counting module includes:
the distribution calculation unit is used for judging whether competition exists in each piece of data in the real-time data, and if not, all the judging keys are sent to the Redis batch judging unit; if the data exists, locking one piece of data with competition, sending the duplication judgment key of the piece of data to the Redis batch duplication judgment unit, releasing the lock after processing, polling and locking other data, sending the duplication judgment key of the data with successful locking to the Redis batch duplication judgment unit, releasing the lock after processing, and circulating until the duplication judgment keys of all the data with competition are sent to the Redis batch duplication judgment unit;
the Redis batch judging unit is used for judging whether a judging key exists or not in Redis batch, if not, adding 1 to the metric value under the unrepeated dimension combination in all the dimension combinations associated with the judging key to generate a pre-calculation result;
and the accumulation unit is used for accumulating the metric values of the same metric field value under the same dimensional combination in the pre-calculation result obtained by the Redis batch duplication judgment unit to generate a deduplication counting result of the real-time data.
With reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the apparatus further includes:
and the storage module is used for storing the deduplication counting result of the real-time data into a database.
With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the storage module is further configured to convert the deduplication count result into a lookup field and a value, where the lookup field and the value correspond to each other one by one, sequentially lookup whether the same lookup field exists in the database, if yes, add values corresponding to the same lookup field and store the added values into the database, and if not, store the lookup field and the corresponding value into the database.
The method and the device for counting the duplicate of the real-time data can realize the accurate calculation of the measurement duplicate counting value of the real-time data, and the duplicate removal process occupies small memory space. In the embodiment of the invention, all dimension combinations of the real-time data are decomposed into 1-dimensional dimension combinations, the judgment key is generated according to the 1-dimensional dimension combinations, and Redis is carried out on the judgment key by using the control of the distributed lock [3] And the batch deduplication processing can ensure the uniqueness of the measurement field of the real-time data under each dimensionality combination, and judge the duplication to obtain the accurate deduplication count value of the real-time data. Meanwhile, only the weight judging keys of the 1-dimensional dimension combination are needed to be stored for judging the weight, compared with the existing weight removing method, the number of the weight judging keys is small, a large number of memory resources are released, the length of the weight judging keys is fixed, and the using amount of the Redis memory is easy to estimate.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is an exemplary diagram of Cube comprising four dimensions;
FIG. 2 is a flowchart of a real-time data deduplication method according to an embodiment of the present invention;
FIG. 3 is a system architecture diagram of a process flow for operating an embodiment of the present invention;
FIG. 4 is a diagram of thread contention relationships;
fig. 5 is a block diagram of a real-time data deduplication device according to another embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for explaining the present invention and are not construed as limiting the present invention. It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The method flow of the embodiment of the present invention is specifically described below with reference to fig. 2.
The embodiment of the invention provides a real-time data deduplication counting method, which comprises the following steps:
and S10, acquiring real-time data and counting information, wherein the counting information comprises dimension information, dimension combination information and measurement information.
According to one embodiment of the invention, kafka is used [4] As a data source, real-time data is acquired from Kafka batches, and the acquired batch real-time data comprises a plurality of pieces of data, wherein each piece of data at least comprises a date, a measurement field value and a dimension field value.
Query MySQL [5] The counting information includes dimension information, dimension combination information and measurement information. The dimension is mainly some non-statistical terms in statistical data, such as: the fields of the large area, the category and the like, and the corresponding dimensionalities are combined with the large area Cuboid, the category Cuboid and the large area-category Cuboid. The measurement is a statistical item in statistical data, such as fields of membership number, access amount, order number and the like.
In some embodiments, to facilitate the recording process in the entire method flow, each dimension is provided with a corresponding dimension code, each dimension combination is provided with a corresponding dimension combination code, and each metric is provided with a corresponding metric code. The dimension information comprises dimension codes, and the dimension combination information comprises dimension combination codes. Each metric has a corresponding metric encoding, and the metric information includes a metric encoding.
And S20, obtaining all 1-dimensional combinations of the real-time data according to the dimension information, the dimension combination information and the measurement information, and generating a weight judgment key.
Traversing all Cuboids of each piece of data in the batch of real-time data, decomposing each Cuboid into one or more 1-dimensional Cuboids, and generating a judging key by each of the different Cuboids. For example, if two dimensions a and B are selected for measurement, the Cube formed by the real-time data includes two dimensions a and B, and there are 4 combinations of the dimensions, i.e., one 0-dimensional Cube, two 1-dimensional cubes (a Cube and B Cube), and one 2-dimensional Cube (a-B Cube). And decomposing the A-B cube into an A cube and a B cube, wherein the A cube generates a judging key, and the B cube generates a judging key, so that each piece of data generates two judging keys. In the actual processing, only A Cuboid and B Cuboid need to be selected, A Cuboid generates weight judging key1, B Cuboid generates weight judging key2, A Cuboid and A-B Cuboid which can be decomposed into A Cuboid are associated with weight judging key1, B Cuboid and A-B Cuboid which can be decomposed into B Cuboid are associated with weight judging key2, only weight judging key1 and key2 are needed, and all Cuboids of the data can be removed.
According to one embodiment of the invention, the dimension code splicing date of the 1-dimensional Cuboid, the measurement field value of the real-time data and the dimension field value of the real-time data are used as the judgment key.
In the conventional duplication elimination method, a theme, a consumption group, a date, a measurement field value, a dimension combination code and a dimension value are spliced to be used as a duplication judgment key, and Cuboid is a multi-dimension combination, so that the length of the duplication judgment key is not fixed, and each Cuboid corresponds to one group of duplication judgment keys, so that the difficulty in estimating the use amount of the Redis memory is caused. The embodiment of the invention only stores the 1-dimensional Cuboid duplication judgment keys, the length of the duplication judgment keys is fixed, the expansion problem of the duplication judgment keys caused by the combination of different dimensional values of the Cuboid cannot occur, the duplication judgment keys are shorter in length and less in number than those of the duplication judgment keys of the conventional duplication elimination method, the occupied memory is less, and the memory usage amount can be estimated.
And S30, performing Redis batch weight judgment on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data.
According to one embodiment of the invention, whether each piece of data in the real-time data has competition is judged, if not, the judging key of all the data carries out judging processing; if the data exists, locking one piece of the competitive data, judging the weight key of the piece of data, releasing the lock after processing, polling and trying to lock other data, judging the weight key of the successfully locked data, releasing the lock after processing, and circulating until the judging weight keys of all the competitive data finish the judging weight processing.
The method comprises the steps of judging whether each piece of real-time data is in competition or not, specifically, generating a lock key for each piece of data, using a date splicing measurement field value as the lock key, if the lock keys of each piece of data are different, determining that competition does not exist for each piece of data of the batch of real-time data, and if the lock keys are the same, determining that competition exists for the data with the same lock key.
The Redis batch deduplication processing specifically includes that Redis batch judgment is made to judge whether a duplicate key exists, and if not, the metric value under the nonrepeating dimension combination in all the dimension combinations associated with the duplicate key is added by 1.
And accumulating the metric values of the same metric field value under the same dimensional combination of the real-time data to obtain the weight-reducing meter value under each dimensional combination of the real-time data, and generating the weight-reducing counting result of the real-time data.
The weight judgment is carried out through the weight judgment key of the 1-dimensional Cuboid, if several pieces of data with the same measurement field value are processed concurrently by different threads respectively, the weight judgment result is wrong, and the embodiment of the invention uses a distributed lock mechanism to control concurrency, so that the condition can be prevented, the weight judgment purpose can be achieved only by storing the dimension value of the 1-dimensional Cuboid, and a large amount of memory resources are released.
According to an embodiment of the invention, the method further comprises:
and S40, storing the deduplication counting result into a database.
Specifically, the duplicate removal counting result is converted into a search field and a value, the search field and the value correspond to each other one by one, whether the same search field exists in the database is searched in sequence, if yes, the values corresponding to the same search field are added and stored in the database, and if not, the search field and the value corresponding to the same search field are stored in the database.
According to the embodiment of the invention, the deduplication counting result of the real-time data is stored in the database, so that the deduplication counting result can be inquired quickly in real time.
The method provided by the embodiment of the invention can be specifically operated in the system shown in FIG. 3, and Kafka is adopted as a data source, HBase [6] As a storage engine, spark Streaming [7] The method flow provided by the embodiment of the invention is executed as a calculation engine.
The following describes a specific process of the method according to the embodiment of the present invention by taking two-dimensional calculation of the measurement buyer number in large area and category as an example.
101 Real-time data from Kafka batch, the batch order data obtained is shown in table 1:
TABLE 1
Order number Order line number Date of payment Member number Large area Articles (variants)
19052001 1905200101 20190520 52001 10001 00001
19052002 1905200201 20190520 52002 10001 00001
19052002 1905200202 20190520 52002 10001 00002
19052003 1905200301 20190520 52003 10002 00001
The above data is real-time order data participating in the calculation of the number of buyers in 2019, 5, month and 20. Wherein, a field value 10001 under "large area" indicates Nanjing large area, a field value 10002 under "product type" indicates tin-free large area, a field value 00001 under "product type" indicates air conditioner, and a field value 00002 indicates ice wash.
102 Query configuration information in MySQL, obtain dimension information: large area (code: WD 1020) and class (code: WD 1029), dimension combination information: large-area cube (code: VV 254), category cube (code: VV 151) and large-area-category cube (code: VV 153), and metric information: the number of buyers (code: ZB _ YY _0001, SJ _01). In this example, the metric buyer number is calculated, the buyer number corresponds to the membership number in table 1, that is, the statistical item is the membership number in table 1.
201 According to the dimension information, the dimension combination information and the measurement information, a judging and resetting Key of each order in the batch real-time data is generated.
Specifically, the order data in table 1 has 3 cuboids, which are a large area Cuboid (code: VV 254), a class Cuboid (code: VV 151), and a large area class Cuboid (code: VV 153), and the large area class Cuboid is decomposed into the large area Cuboid and the class Cuboid, the dimension code of the large area Cuboid concatenates the current date, the measurement field value, and the dimension code of the class Cuboid concatenates the current date, the resolved measurement field value, and the resolved dimension field value as a predicate Key, so that each order generates two predicate keys, as shown in table 2. Wherein, the measurement field value is the field value under the 'membership number' in table 1, and the dimension field value is the field value under the 'big region' or/and 'item class' in table 1.
TABLE 2
Figure BDA0002180959570000101
202 Generate a lock Key for each order in the batch of real-time data. Specifically, the value of the current date splicing measurement field is used as the lock Key, as shown in table 3.
TABLE 3
Figure BDA0002180959570000102
Figure BDA0002180959570000111
203 All the judging keys are put into Redis in batches, and concurrent deduplication processing is carried out to obtain a precomputation result. The precomputation result comprises a duplication removal result Key and a corresponding duplication removal result, wherein the dimension is combined and coded, the value of the dimension field is spliced on the date as the duplication removal result Key, and the measurement code and the value of the measurement are used as the duplication removal result.
Specifically, it is determined whether there is a competition in the four orders of the batch of real-time data according to the lock keys of the four orders shown in table 3. Because the lock keys of the order 1905200101 and the order 1905200301 are different, the two orders do not compete, and the two orders are directly subjected to judging and resetting. Since the lock keys of the order 1905200201 and the order 1905200202 are the same, the two orders compete, as shown in fig. 4, the order 1905200201 is locked and processed for re-judging, and the lock is released after the processing is finished. And polling and trying to lock the order 1905200202, and performing judgment and repeat processing after the locking is successful.
The detailed steps of the judging process are as follows: taking the order 1905200101 as an example, it is determined that the judged and repeated Key20190520-52001-WD1020-10001 does not exist, i.e. the measurement field value (member number) of the large-area Cuboid is not repeated, so that there is a buyer in the large-area vision, the corresponding measurement value is 1, and the corresponding pre-calculation results are generated as ("VV 254, 20190520-10001", { "ZB _ YY _0001, sj \ 01":1 }). Then, judging that the judgment Key20190520-52001-WD1029-00001 does not exist, namely the measurement field value (member number) of the product Cuboid is not repeated, so that a buyer exists under the product vision, the corresponding measurement value is 1, and the corresponding precomputation result is generated to be (VV 151, 20190520-00001', { "ZB _ YY _0001, SJ_01": 1 }). Since the measurement field values (membership numbers) of the large-area Cuboid and the category Cuboid are not repeated, and a buyer is visually present in the large-area Cuboid, the corresponding measurement value is 1, and the corresponding pre-calculation results are generated as ("VV 153,20190520-10001-00001", { "ZB _ YY _0001, SJ \01": 1 }). The same method is used to obtain the pre-calculation result of order 1905200201 and the pre-calculation result of order 1905200202.
When the order 1905200202 is processed for judging, the judgment Key20190520-52002-WD1020-10001 is judged to exist, namely the measurement field value (member number) of the big area Cuboid is repeated. And then judging that the judgment Key20190520-52002-WD1029-00002 does not exist, namely the measurement field value (member number) of the product Cuboid is not repeated, wherein a buyer exists under the vision of the product, the corresponding measurement value is 1, and the corresponding precomputation result is generated to be (VV 151, 20190520-00002', { "ZB _ YY _0001, SJ_01": 1 }). Because the measurement field values of the category Cuboid are not repeated, the measurement field values of the large-area category Cuboid are not repeated, the large-area category has a buyer visually, the corresponding measurement value is 1, and the corresponding precomputation result is generated to be ("VV 152,20190520-10001-00002", { "ZB _ YY _0001, SJ \ 01":1 }).
The pre-calculated results to obtain the real-time order data shown in table 1 are shown in table 4.
TABLE 4
Figure BDA0002180959570000121
Figure BDA0002180959570000131
204 In the pre-calculation result, the metric values corresponding to the same deduplication result Key are accumulated, specifically, the metric values corresponding to all "VV254, 20190520-10001" in table 4 are accumulated, and the deduplication value of "VV254, 20190520-10002" is 1, the deduplication value of "VV151,20190520-00001" is 3, the deduplication value of "VV151,20190520-00002" is 1, the deduplication value of "VV153-20190520-10001-00001" is 2, the deduplication value of "VV153-20190520-10001-00002" is 1, and the deduplication value of "VV153-20190520-10002-00001" is 1.
301 The above-described accumulation result is stored in the HBase.
Specifically, dimension combination coding splicing measurement coding, current date and dimension field Value are mapped into RowKey of HBase, and the Value of the de-weighting meter is mapped into Value of HBase, as shown in Table 5. And then inquiring HBase in batches through Rowkey, and if yes, summing Value and storing in a warehouse, and if not, directly storing in the warehouse.
TABLE 5
Figure BDA0002180959570000132
Figure BDA0002180959570000141
The real-time data deduplication method provided by the embodiment of the invention can realize accurate calculation of the measurement deduplication count value of the real-time data, and meanwhile, the deduplication process occupies a small memory space. In the embodiment of the invention, all dimension combinations of the real-time data are decomposed into 1-dimensional dimension combinations, the duplication judgment key is generated according to the 1-dimensional dimension combinations, redis batch duplication elimination processing is carried out on the duplication judgment key by using the control of a distributed lock, the uniqueness of a measurement field of the real-time data under each dimension combination can be ensured, and the accurate duplication elimination count value of the real-time data is obtained by duplication judgment. Meanwhile, only the weight judging keys combined in 1-dimensional dimension need to be stored for weight judgment, compared with the existing weight removing method, the number of the weight judging keys is small, a large amount of memory resources are released, the length of the weight judging keys is fixed, and the use amount of Redis memory can be estimated. Taking fig. 1 as an example, by using the deduplication counting method according to the embodiment of the present invention, the maximum number of duplication keys of a combination (time, item) is a + b, the maximum number of duplication keys of a combination (time, item, location) is a + b + c, and the maximum number of duplication keys of a combination (time, item, location, summary) is a + b + c + d. The 16 dimensional combinations only need 0 dimension and 1 dimension Cuboid, and the maximum total number of the judging and re-keying is 1+ (a + b + c + d), namely, the summation of the cardinalities of all dimensions plus 1.
Another embodiment of the present invention provides a real-time data deduplication device, as shown in fig. 5, including:
the acquisition module is used for acquiring real-time data, dimension information, dimension combination information and measurement information;
the judgment key generation module is used for obtaining a 1-dimensional dimension combination of the real-time data according to the dimension information, the dimension combination information and the measurement information and generating a judgment key;
and the counting module is used for performing Redis batch weight judgment on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data.
Optionally, the counting module includes:
the distribution calculation unit is used for judging whether competition exists in each piece of data in the real-time data, and if not, all the judging keys are sent to the Redis batch judging unit; if the data is in the lock state, locking one piece of data in competition exists, sending the re-judging key of the piece of data to the Redis batch re-judging unit, releasing the lock after processing, polling other data to try locking, sending the re-judging key of the data which is successfully locked to the Redis batch re-judging unit, releasing the lock after processing, and circulating until the re-judging keys of all the data in competition exist are all sent to the Redis batch re-judging unit.
And the Redis batch judging unit is used for judging whether the judging key exists or not in Redis batch, if not, adding 1 to the metric value under the unrepeated dimension combination in all the dimension combinations associated with the judging key to generate a pre-calculation result.
And the accumulation unit is used for accumulating the metric values of the same metric field value under the same dimension combination in the pre-calculation result obtained by the Redis batch duplication judgment module to generate a duplication removal counting result of the real-time data.
Optionally, the apparatus further comprises:
and the storage module is used for storing the deduplication counting result of the real-time data into a database.
Optionally, the storage module is further configured to convert the deduplication count result into a search field and a value, where the search field corresponds to the value one by one, and sequentially search whether the same search field exists in the database, if so, add the value corresponding to the same search field and store the added value in the database, and if not, store the search field and the value corresponding to the added value in the database.
The real-time data deduplication counting device provided by the embodiment of the invention can realize accurate calculation of the measurement deduplication count value of real-time data, and meanwhile, the deduplication process occupies a small memory space. In the embodiment of the invention, all dimension combinations of the real-time data are decomposed into 1-dimensional dimension combinations, the judging key is generated according to the 1-dimensional dimension combinations, redis batch duplicate removal processing is carried out on the judging key by using the control of the distributed lock, the uniqueness of a measurement field of the real-time data under each dimension combination can be ensured, and the accurate duplicate removal counting value of the real-time data is obtained by judging the duplicate. Meanwhile, only the weight judging keys of the 1-dimensional dimension combination are needed to be stored for judging the weight, compared with the existing weight removing method, the number of the weight judging keys is small, a large number of memory resources are released, the length of the weight judging keys is fixed, and the using amount of the Redis memory is easy to estimate.
All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Some technical terms in the specification are explained as follows:
[1] cube: a data set composed of data of different dimensions.
[2] Cuboid: data aggregated under each combination of dimensions, the base unit of Cube.
[3] Redis: an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent.
[4] Kafka: a high throughput distributed publish-subscribe messaging system.
[5] MySQL: a relational database management system.
[6] HBase: a distributed, column-oriented, open-source database built on top of the Hadoop file system.
[7] Spark Streaming: an extensible, fault tolerant streaming application.

Claims (6)

1. A method for real-time data deduplication, comprising:
acquiring real-time data and counting information, wherein the counting information comprises dimension information, dimension combination information and measurement information;
obtaining a 1-dimensional dimension combination of the real-time data according to the dimension information, the dimension combination information and the measurement information, and generating a weight judgment key;
redis batch weight judgment is carried out on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data;
redis batch weight judgment is carried out on the weight judgment key by adopting a distribution lock mechanism to obtain a de-weight counting result of the real-time data, and the method specifically comprises the following steps:
judging whether each piece of data in the real-time data has competition or not, if not, judging the weight by all weight judging keys in Redis batch; if the data exists, locking one piece of data in competition, carrying out Redis batch judgment on the data judgment key, releasing the lock after the judgment is finished, polling locking other data, carrying out Redis batch judgment on the data which is successfully locked, releasing the lock after the judgment is finished, and circulating until the judgment keys of all the data in competition finish the batch judgment; the Redis batch re-determination specifically comprises the steps that Redis batch re-determination is performed to determine whether a re-determination key exists, if not, the metric value under the nonrepeating dimension combination in all the dimension combinations associated with the re-determination key is added by 1 to generate a pre-calculation result;
accumulating the metric values of the same metric field value under the same dimension combination in the pre-calculation result to generate a de-duplication counting result of the real-time data;
the dimension information comprises dimension codes, the dimension combination information comprises dimension combination codes, and the measurement information comprises measurement codes;
the generating of the weight judgment key specifically includes: and taking the dimension code splicing date of the 1-dimensional dimension combination, the measurement field value of the real-time data and the dimension field value of the real-time data as the judgment key.
2. The method of claim 1, further comprising: and storing the deduplication counting result of the real-time data into a database.
3. The method according to claim 2, wherein storing the deduplication results of the real-time data in a database specifically comprises:
converting the duplicate removal counting result into a search field and a value, wherein the search field corresponds to the value one by one, searching whether the same search field exists in the database or not in sequence, and if so, adding the values corresponding to the same search field and storing the added values in the database; and if not, storing the search field and the corresponding value into the database.
4. A real-time data deduplication counting apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring real-time data and counting information, and the counting information comprises dimension information, dimension combination information and measurement information;
the judgment key generation module is used for obtaining a 1-dimensional dimension combination of the real-time data according to the dimension information, the dimension combination information and the measurement information and generating a judgment key; the dimension information comprises dimension codes, the dimension combination information comprises dimension combination codes, and the measurement information comprises measurement codes; the generating of the weight judgment key specifically includes: taking the dimension code splicing date of the 1-dimensional dimension combination, the measurement field value of the real-time data and the dimension field value of the real-time data as a criterion key;
the counting module is used for performing Redis batch weight judgment on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data;
the counting module comprises:
the distribution calculating unit is used for judging whether competition exists in each piece of data in the real-time data or not, and if not, all the duplication judgment keys are sent to the Redis batch duplication judgment unit; if the data is processed, the lock is released, other data are polled and locked, the re-judging key of the successfully-locked data is sent to the Redis batch re-judging unit, the lock is released after processing, and the process is circulated until the re-judging keys of all the data with competition are sent to the Redis batch re-judging unit;
the Redis batch judging unit is used for judging whether a judging key exists or not in Redis batch, if not, adding 1 to the metric value under the unrepeated dimension combination in all the dimension combinations associated with the judging key to generate a pre-calculation result;
and the accumulation unit is used for accumulating the metric values of the same metric field value under the same dimension combination in the pre-calculation result generated by the Redis batch duplication judgment unit to generate a duplication removal counting result of the real-time data.
5. The apparatus of claim 4, further comprising:
and the storage module is used for storing the deduplication counting result of the real-time data into a database.
6. The apparatus of claim 5, wherein the storage module is further configured to convert the deduplication count result into a lookup field and a value, where the lookup field corresponds to the value one to one, and sequentially lookup whether the same lookup field exists in the database, if so, add the values corresponding to the same lookup field and store the added values in the database, and if not, store the lookup field and the corresponding values in the database.
CN201910795939.4A 2019-08-27 2019-08-27 Real-time data deduplication counting method and device Active CN110569263B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201910795939.4A CN110569263B (en) 2019-08-27 2019-08-27 Real-time data deduplication counting method and device
CA3152844A CA3152844A1 (en) 2019-08-27 2020-06-24 Real-time data deduplication counting method and device
PCT/CN2020/097839 WO2021036452A1 (en) 2019-08-27 2020-06-24 Real-time data deduplication counting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910795939.4A CN110569263B (en) 2019-08-27 2019-08-27 Real-time data deduplication counting method and device

Publications (2)

Publication Number Publication Date
CN110569263A CN110569263A (en) 2019-12-13
CN110569263B true CN110569263B (en) 2022-11-22

Family

ID=68776261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910795939.4A Active CN110569263B (en) 2019-08-27 2019-08-27 Real-time data deduplication counting method and device

Country Status (3)

Country Link
CN (1) CN110569263B (en)
CA (1) CA3152844A1 (en)
WO (1) WO2021036452A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569263B (en) * 2019-08-27 2022-11-22 苏宁云计算有限公司 Real-time data deduplication counting method and device
CN112685445A (en) * 2020-12-29 2021-04-20 杭州旷云金智科技有限公司 Data query method and device, storage medium and electronic equipment
CN113609123B (en) * 2021-08-26 2023-06-02 四川效率源信息安全技术股份有限公司 HBase-based mass user data deduplication storage method and device
CN114265849B (en) * 2022-02-28 2022-06-10 杭州广立微电子股份有限公司 Data aggregation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104125163A (en) * 2013-04-25 2014-10-29 腾讯科技(深圳)有限公司 Data processing method, device and terminal
CN107665144A (en) * 2016-07-29 2018-02-06 北京京东尚科信息技术有限公司 The balance dispatching center of distributed task scheduling, mthods, systems and devices
CN108804242A (en) * 2018-05-23 2018-11-13 武汉斗鱼网络科技有限公司 A kind of data counts De-weight method, system, server and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376020B (en) * 2013-08-16 2019-01-29 腾讯科技(深圳)有限公司 The processing method and processing device of multidimensional data
CN106682004A (en) * 2015-11-06 2017-05-17 网宿科技股份有限公司 Redis Key management method and system
CN105893421A (en) * 2015-12-02 2016-08-24 乐视网信息技术(北京)股份有限公司 UV calculation method and apparatus
CN107665241B (en) * 2017-09-07 2020-09-29 北京京东尚科信息技术有限公司 Real-time data multi-dimensional duplicate removal method and device
CN108334554B (en) * 2017-12-29 2021-10-01 上海跬智信息技术有限公司 Novel OLAP pre-calculation model and construction method
US11010380B2 (en) * 2018-02-13 2021-05-18 International Business Machines Corporation Minimizing processing using an index when non-leading columns match an aggregation key
CN109816536B (en) * 2018-12-14 2023-08-25 中国平安财产保险股份有限公司 List deduplication method, device and computer equipment
CN109800225A (en) * 2018-12-24 2019-05-24 北京奇艺世纪科技有限公司 Acquisition methods, device, server and the computer readable storage medium of operational indicator
CN110569263B (en) * 2019-08-27 2022-11-22 苏宁云计算有限公司 Real-time data deduplication counting method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104125163A (en) * 2013-04-25 2014-10-29 腾讯科技(深圳)有限公司 Data processing method, device and terminal
CN107665144A (en) * 2016-07-29 2018-02-06 北京京东尚科信息技术有限公司 The balance dispatching center of distributed task scheduling, mthods, systems and devices
CN108804242A (en) * 2018-05-23 2018-11-13 武汉斗鱼网络科技有限公司 A kind of data counts De-weight method, system, server and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SS-dedup: A high throughput stateful data routing algorithm for cluster deduplication system;Zhihao Huang 等;《2016 IEEE International Conference on Big Data (Big Data)》;20170206;第2991-2992页 *

Also Published As

Publication number Publication date
WO2021036452A1 (en) 2021-03-04
CA3152844A1 (en) 2021-03-04
CN110569263A (en) 2019-12-13

Similar Documents

Publication Publication Date Title
CN110569263B (en) Real-time data deduplication counting method and device
Pandey et al. A general-purpose counting filter: Making every bit count
Shen et al. Mining frequent graph patterns with differential privacy
US8782219B2 (en) Automated discovery of template patterns based on received server requests
US7478083B2 (en) Method and system for estimating cardinality in a database system
US8370326B2 (en) System and method for parallel computation of frequency histograms on joined tables
EP3101563A1 (en) Automated determination of network motifs
CN105517644B (en) Data partitioning method and equipment
CN109325062B (en) Data dependency mining method and system based on distributed computation
WO2014021978A4 (en) Aggregating data in a mediation system
Maree et al. Real-valued evolutionary multi-modal optimization driven by hill-valley clustering
Mo et al. Cleaning uncertain data for top-k queries
Whitman et al. Spatio-temporal join on apache spark
US10877973B2 (en) Method for efficient one-to-one join
CN108073641B (en) Method and device for querying data table
US20210248142A1 (en) Dual filter histogram optimization
CN107203550B (en) Data processing method and database server
Ganguly et al. Deterministic k-set structure
US20150213086A1 (en) System and method for determining an index of an object in a sequence of objects
EP2657862B1 (en) Parallel set aggregation
CN110096529B (en) Network data mining method and system based on multidimensional vector data
CN117076460A (en) Data processing method and device, electronic equipment and storage medium
Liu Database redundant attribute detection using fractal dimension
Dong et al. Multi-dimensional data analysis technology of business application system based on Spark framework
CN113032400A (en) High-performance TopN query method, system and medium for mass data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210042

Patentee after: Jiangsu Suning cloud computing Co.,Ltd.

Address before: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210042

Patentee before: Suning Cloud Computing Co.,Ltd.

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20240112

Address after: 210000, 1-5 story, Jinshan building, 8 Shanxi Road, Nanjing, Jiangsu.

Patentee after: SUNING.COM Co.,Ltd.

Address before: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210042

Patentee before: Jiangsu Suning cloud computing Co.,Ltd.

TR01 Transfer of patent right