Disclosure of Invention
The embodiment of the invention provides a method and a device for counting duplicate data in real time, which are used for solving the problem.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a real-time data deduplication method, including:
acquiring real-time data and counting information, wherein the counting information comprises dimension information, dimension combination information and measurement information;
decomposing all dimension combinations of the real-time data into 1-dimensional dimension combinations according to the dimension information, the dimension combination information and the measurement information to generate a weight judgment key;
and Redis batch weight judgment is carried out on the weight judgment key by adopting a distributed lock mechanism, so that the de-weight counting result of the real-time data is obtained.
With reference to the first aspect, in a first possible implementation manner of the first aspect, redis batch re-judgment is performed on the re-judgment key by using a distributed lock mechanism to obtain a re-elimination counting result of the real-time data, and the method specifically includes:
judging whether each piece of data in the real-time data has competition or not, if not, judging the weight by all the weight judging keys in Redis batch; if the data exists, locking one piece of competitive data, performing Redis batch judgment on the weight judgment key of the piece of data, releasing the lock after processing, performing polling locking on other data, performing Redis batch judgment on the weight judgment key of the successfully locked data, releasing the lock after processing, and circulating until the weight judgment keys of all the competitive data finish Redis batch judgment; the Redis batch judging specifically comprises the steps that Redis batch judging judges whether a judging key exists or not, if not, the metric value under the nonrepeating dimension combination in all the dimension combinations associated with the judging key is added with 1 to generate a pre-calculation result;
and accumulating the metric values of the same metric field value under the same dimension combination in the pre-calculation result to generate a de-duplication counting result of the real-time data.
With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the method further includes: and storing the deduplication counting result of the real-time data into a database.
With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the storing the deduplication result of the real-time data in a database specifically includes:
and converting the duplication count result into search fields and values, wherein the search fields correspond to the values one to one, sequentially searching whether the same search fields exist in the database, if so, adding the values corresponding to the same search fields and then storing the values into the database, and if not, storing the search fields and the values corresponding to the search fields into the database.
With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the dimension information includes a dimension code, the dimension combination information includes a dimension combination dimension code, and the metric information includes a metric code.
With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the generating the duplication key specifically includes splicing, as the duplication key, a date of the day, a metric field value of the real-time data, and a dimension field value of the real-time data with a dimension code of a 1-dimensional dimension combination.
In a second aspect, an embodiment of the present invention provides a real-time data deduplication device, including:
the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring real-time data and counting information, and the counting information comprises dimension information, dimension combination information and measurement information;
the judgment key generation module is used for obtaining a 1-dimensional dimension combination of the real-time data according to the dimension information, the dimension combination information and the measurement information and generating a judgment key;
and the counting module is used for performing Redis batch weight judgment on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the counting module includes:
the distribution calculation unit is used for judging whether competition exists in each piece of data in the real-time data, and if not, all the judging keys are sent to the Redis batch judging unit; if the data exists, locking one piece of data with competition, sending the duplication judgment key of the piece of data to the Redis batch duplication judgment unit, releasing the lock after processing, polling and locking other data, sending the duplication judgment key of the data with successful locking to the Redis batch duplication judgment unit, releasing the lock after processing, and circulating until the duplication judgment keys of all the data with competition are sent to the Redis batch duplication judgment unit;
the Redis batch judging unit is used for judging whether a judging key exists or not in Redis batch, if not, adding 1 to the metric value under the unrepeated dimension combination in all the dimension combinations associated with the judging key to generate a pre-calculation result;
and the accumulation unit is used for accumulating the metric values of the same metric field value under the same dimensional combination in the pre-calculation result obtained by the Redis batch duplication judgment unit to generate a deduplication counting result of the real-time data.
With reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the apparatus further includes:
and the storage module is used for storing the deduplication counting result of the real-time data into a database.
With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the storage module is further configured to convert the deduplication count result into a lookup field and a value, where the lookup field and the value correspond to each other one by one, sequentially lookup whether the same lookup field exists in the database, if yes, add values corresponding to the same lookup field and store the added values into the database, and if not, store the lookup field and the corresponding value into the database.
The method and the device for counting the duplicate of the real-time data can realize the accurate calculation of the measurement duplicate counting value of the real-time data, and the duplicate removal process occupies small memory space. In the embodiment of the invention, all dimension combinations of the real-time data are decomposed into 1-dimensional dimension combinations, the judgment key is generated according to the 1-dimensional dimension combinations, and Redis is carried out on the judgment key by using the control of the distributed lock [3] And the batch deduplication processing can ensure the uniqueness of the measurement field of the real-time data under each dimensionality combination, and judge the duplication to obtain the accurate deduplication count value of the real-time data. Meanwhile, only the weight judging keys of the 1-dimensional dimension combination are needed to be stored for judging the weight, compared with the existing weight removing method, the number of the weight judging keys is small, a large number of memory resources are released, the length of the weight judging keys is fixed, and the using amount of the Redis memory is easy to estimate.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for explaining the present invention and are not construed as limiting the present invention. It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The method flow of the embodiment of the present invention is specifically described below with reference to fig. 2.
The embodiment of the invention provides a real-time data deduplication counting method, which comprises the following steps:
and S10, acquiring real-time data and counting information, wherein the counting information comprises dimension information, dimension combination information and measurement information.
According to one embodiment of the invention, kafka is used [4] As a data source, real-time data is acquired from Kafka batches, and the acquired batch real-time data comprises a plurality of pieces of data, wherein each piece of data at least comprises a date, a measurement field value and a dimension field value.
Query MySQL [5] The counting information includes dimension information, dimension combination information and measurement information. The dimension is mainly some non-statistical terms in statistical data, such as: the fields of the large area, the category and the like, and the corresponding dimensionalities are combined with the large area Cuboid, the category Cuboid and the large area-category Cuboid. The measurement is a statistical item in statistical data, such as fields of membership number, access amount, order number and the like.
In some embodiments, to facilitate the recording process in the entire method flow, each dimension is provided with a corresponding dimension code, each dimension combination is provided with a corresponding dimension combination code, and each metric is provided with a corresponding metric code. The dimension information comprises dimension codes, and the dimension combination information comprises dimension combination codes. Each metric has a corresponding metric encoding, and the metric information includes a metric encoding.
And S20, obtaining all 1-dimensional combinations of the real-time data according to the dimension information, the dimension combination information and the measurement information, and generating a weight judgment key.
Traversing all Cuboids of each piece of data in the batch of real-time data, decomposing each Cuboid into one or more 1-dimensional Cuboids, and generating a judging key by each of the different Cuboids. For example, if two dimensions a and B are selected for measurement, the Cube formed by the real-time data includes two dimensions a and B, and there are 4 combinations of the dimensions, i.e., one 0-dimensional Cube, two 1-dimensional cubes (a Cube and B Cube), and one 2-dimensional Cube (a-B Cube). And decomposing the A-B cube into an A cube and a B cube, wherein the A cube generates a judging key, and the B cube generates a judging key, so that each piece of data generates two judging keys. In the actual processing, only A Cuboid and B Cuboid need to be selected, A Cuboid generates weight judging key1, B Cuboid generates weight judging key2, A Cuboid and A-B Cuboid which can be decomposed into A Cuboid are associated with weight judging key1, B Cuboid and A-B Cuboid which can be decomposed into B Cuboid are associated with weight judging key2, only weight judging key1 and key2 are needed, and all Cuboids of the data can be removed.
According to one embodiment of the invention, the dimension code splicing date of the 1-dimensional Cuboid, the measurement field value of the real-time data and the dimension field value of the real-time data are used as the judgment key.
In the conventional duplication elimination method, a theme, a consumption group, a date, a measurement field value, a dimension combination code and a dimension value are spliced to be used as a duplication judgment key, and Cuboid is a multi-dimension combination, so that the length of the duplication judgment key is not fixed, and each Cuboid corresponds to one group of duplication judgment keys, so that the difficulty in estimating the use amount of the Redis memory is caused. The embodiment of the invention only stores the 1-dimensional Cuboid duplication judgment keys, the length of the duplication judgment keys is fixed, the expansion problem of the duplication judgment keys caused by the combination of different dimensional values of the Cuboid cannot occur, the duplication judgment keys are shorter in length and less in number than those of the duplication judgment keys of the conventional duplication elimination method, the occupied memory is less, and the memory usage amount can be estimated.
And S30, performing Redis batch weight judgment on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data.
According to one embodiment of the invention, whether each piece of data in the real-time data has competition is judged, if not, the judging key of all the data carries out judging processing; if the data exists, locking one piece of the competitive data, judging the weight key of the piece of data, releasing the lock after processing, polling and trying to lock other data, judging the weight key of the successfully locked data, releasing the lock after processing, and circulating until the judging weight keys of all the competitive data finish the judging weight processing.
The method comprises the steps of judging whether each piece of real-time data is in competition or not, specifically, generating a lock key for each piece of data, using a date splicing measurement field value as the lock key, if the lock keys of each piece of data are different, determining that competition does not exist for each piece of data of the batch of real-time data, and if the lock keys are the same, determining that competition exists for the data with the same lock key.
The Redis batch deduplication processing specifically includes that Redis batch judgment is made to judge whether a duplicate key exists, and if not, the metric value under the nonrepeating dimension combination in all the dimension combinations associated with the duplicate key is added by 1.
And accumulating the metric values of the same metric field value under the same dimensional combination of the real-time data to obtain the weight-reducing meter value under each dimensional combination of the real-time data, and generating the weight-reducing counting result of the real-time data.
The weight judgment is carried out through the weight judgment key of the 1-dimensional Cuboid, if several pieces of data with the same measurement field value are processed concurrently by different threads respectively, the weight judgment result is wrong, and the embodiment of the invention uses a distributed lock mechanism to control concurrency, so that the condition can be prevented, the weight judgment purpose can be achieved only by storing the dimension value of the 1-dimensional Cuboid, and a large amount of memory resources are released.
According to an embodiment of the invention, the method further comprises:
and S40, storing the deduplication counting result into a database.
Specifically, the duplicate removal counting result is converted into a search field and a value, the search field and the value correspond to each other one by one, whether the same search field exists in the database is searched in sequence, if yes, the values corresponding to the same search field are added and stored in the database, and if not, the search field and the value corresponding to the same search field are stored in the database.
According to the embodiment of the invention, the deduplication counting result of the real-time data is stored in the database, so that the deduplication counting result can be inquired quickly in real time.
The method provided by the embodiment of the invention can be specifically operated in the system shown in FIG. 3, and Kafka is adopted as a data source, HBase [6] As a storage engine, spark Streaming [7] The method flow provided by the embodiment of the invention is executed as a calculation engine.
The following describes a specific process of the method according to the embodiment of the present invention by taking two-dimensional calculation of the measurement buyer number in large area and category as an example.
101 Real-time data from Kafka batch, the batch order data obtained is shown in table 1:
TABLE 1
Order number
|
Order line number
|
Date of payment
|
Member number
|
Large area
|
Articles (variants)
|
19052001
|
1905200101
|
20190520
|
52001
|
10001
|
00001
|
19052002
|
1905200201
|
20190520
|
52002
|
10001
|
00001
|
19052002
|
1905200202
|
20190520
|
52002
|
10001
|
00002
|
19052003
|
1905200301
|
20190520
|
52003
|
10002
|
00001 |
The above data is real-time order data participating in the calculation of the number of buyers in 2019, 5, month and 20. Wherein, a field value 10001 under "large area" indicates Nanjing large area, a field value 10002 under "product type" indicates tin-free large area, a field value 00001 under "product type" indicates air conditioner, and a field value 00002 indicates ice wash.
102 Query configuration information in MySQL, obtain dimension information: large area (code: WD 1020) and class (code: WD 1029), dimension combination information: large-area cube (code: VV 254), category cube (code: VV 151) and large-area-category cube (code: VV 153), and metric information: the number of buyers (code: ZB _ YY _0001, SJ _01). In this example, the metric buyer number is calculated, the buyer number corresponds to the membership number in table 1, that is, the statistical item is the membership number in table 1.
201 According to the dimension information, the dimension combination information and the measurement information, a judging and resetting Key of each order in the batch real-time data is generated.
Specifically, the order data in table 1 has 3 cuboids, which are a large area Cuboid (code: VV 254), a class Cuboid (code: VV 151), and a large area class Cuboid (code: VV 153), and the large area class Cuboid is decomposed into the large area Cuboid and the class Cuboid, the dimension code of the large area Cuboid concatenates the current date, the measurement field value, and the dimension code of the class Cuboid concatenates the current date, the resolved measurement field value, and the resolved dimension field value as a predicate Key, so that each order generates two predicate keys, as shown in table 2. Wherein, the measurement field value is the field value under the 'membership number' in table 1, and the dimension field value is the field value under the 'big region' or/and 'item class' in table 1.
TABLE 2
202 Generate a lock Key for each order in the batch of real-time data. Specifically, the value of the current date splicing measurement field is used as the lock Key, as shown in table 3.
TABLE 3
203 All the judging keys are put into Redis in batches, and concurrent deduplication processing is carried out to obtain a precomputation result. The precomputation result comprises a duplication removal result Key and a corresponding duplication removal result, wherein the dimension is combined and coded, the value of the dimension field is spliced on the date as the duplication removal result Key, and the measurement code and the value of the measurement are used as the duplication removal result.
Specifically, it is determined whether there is a competition in the four orders of the batch of real-time data according to the lock keys of the four orders shown in table 3. Because the lock keys of the order 1905200101 and the order 1905200301 are different, the two orders do not compete, and the two orders are directly subjected to judging and resetting. Since the lock keys of the order 1905200201 and the order 1905200202 are the same, the two orders compete, as shown in fig. 4, the order 1905200201 is locked and processed for re-judging, and the lock is released after the processing is finished. And polling and trying to lock the order 1905200202, and performing judgment and repeat processing after the locking is successful.
The detailed steps of the judging process are as follows: taking the order 1905200101 as an example, it is determined that the judged and repeated Key20190520-52001-WD1020-10001 does not exist, i.e. the measurement field value (member number) of the large-area Cuboid is not repeated, so that there is a buyer in the large-area vision, the corresponding measurement value is 1, and the corresponding pre-calculation results are generated as ("VV 254, 20190520-10001", { "ZB _ YY _0001, sj \ 01":1 }). Then, judging that the judgment Key20190520-52001-WD1029-00001 does not exist, namely the measurement field value (member number) of the product Cuboid is not repeated, so that a buyer exists under the product vision, the corresponding measurement value is 1, and the corresponding precomputation result is generated to be (VV 151, 20190520-00001', { "ZB _ YY _0001, SJ_01": 1 }). Since the measurement field values (membership numbers) of the large-area Cuboid and the category Cuboid are not repeated, and a buyer is visually present in the large-area Cuboid, the corresponding measurement value is 1, and the corresponding pre-calculation results are generated as ("VV 153,20190520-10001-00001", { "ZB _ YY _0001, SJ \01": 1 }). The same method is used to obtain the pre-calculation result of order 1905200201 and the pre-calculation result of order 1905200202.
When the order 1905200202 is processed for judging, the judgment Key20190520-52002-WD1020-10001 is judged to exist, namely the measurement field value (member number) of the big area Cuboid is repeated. And then judging that the judgment Key20190520-52002-WD1029-00002 does not exist, namely the measurement field value (member number) of the product Cuboid is not repeated, wherein a buyer exists under the vision of the product, the corresponding measurement value is 1, and the corresponding precomputation result is generated to be (VV 151, 20190520-00002', { "ZB _ YY _0001, SJ_01": 1 }). Because the measurement field values of the category Cuboid are not repeated, the measurement field values of the large-area category Cuboid are not repeated, the large-area category has a buyer visually, the corresponding measurement value is 1, and the corresponding precomputation result is generated to be ("VV 152,20190520-10001-00002", { "ZB _ YY _0001, SJ \ 01":1 }).
The pre-calculated results to obtain the real-time order data shown in table 1 are shown in table 4.
TABLE 4
204 In the pre-calculation result, the metric values corresponding to the same deduplication result Key are accumulated, specifically, the metric values corresponding to all "VV254, 20190520-10001" in table 4 are accumulated, and the deduplication value of "VV254, 20190520-10002" is 1, the deduplication value of "VV151,20190520-00001" is 3, the deduplication value of "VV151,20190520-00002" is 1, the deduplication value of "VV153-20190520-10001-00001" is 2, the deduplication value of "VV153-20190520-10001-00002" is 1, and the deduplication value of "VV153-20190520-10002-00001" is 1.
301 The above-described accumulation result is stored in the HBase.
Specifically, dimension combination coding splicing measurement coding, current date and dimension field Value are mapped into RowKey of HBase, and the Value of the de-weighting meter is mapped into Value of HBase, as shown in Table 5. And then inquiring HBase in batches through Rowkey, and if yes, summing Value and storing in a warehouse, and if not, directly storing in the warehouse.
TABLE 5
The real-time data deduplication method provided by the embodiment of the invention can realize accurate calculation of the measurement deduplication count value of the real-time data, and meanwhile, the deduplication process occupies a small memory space. In the embodiment of the invention, all dimension combinations of the real-time data are decomposed into 1-dimensional dimension combinations, the duplication judgment key is generated according to the 1-dimensional dimension combinations, redis batch duplication elimination processing is carried out on the duplication judgment key by using the control of a distributed lock, the uniqueness of a measurement field of the real-time data under each dimension combination can be ensured, and the accurate duplication elimination count value of the real-time data is obtained by duplication judgment. Meanwhile, only the weight judging keys combined in 1-dimensional dimension need to be stored for weight judgment, compared with the existing weight removing method, the number of the weight judging keys is small, a large amount of memory resources are released, the length of the weight judging keys is fixed, and the use amount of Redis memory can be estimated. Taking fig. 1 as an example, by using the deduplication counting method according to the embodiment of the present invention, the maximum number of duplication keys of a combination (time, item) is a + b, the maximum number of duplication keys of a combination (time, item, location) is a + b + c, and the maximum number of duplication keys of a combination (time, item, location, summary) is a + b + c + d. The 16 dimensional combinations only need 0 dimension and 1 dimension Cuboid, and the maximum total number of the judging and re-keying is 1+ (a + b + c + d), namely, the summation of the cardinalities of all dimensions plus 1.
Another embodiment of the present invention provides a real-time data deduplication device, as shown in fig. 5, including:
the acquisition module is used for acquiring real-time data, dimension information, dimension combination information and measurement information;
the judgment key generation module is used for obtaining a 1-dimensional dimension combination of the real-time data according to the dimension information, the dimension combination information and the measurement information and generating a judgment key;
and the counting module is used for performing Redis batch weight judgment on the weight judgment key by adopting a distributed lock mechanism to obtain a weight removal counting result of the real-time data.
Optionally, the counting module includes:
the distribution calculation unit is used for judging whether competition exists in each piece of data in the real-time data, and if not, all the judging keys are sent to the Redis batch judging unit; if the data is in the lock state, locking one piece of data in competition exists, sending the re-judging key of the piece of data to the Redis batch re-judging unit, releasing the lock after processing, polling other data to try locking, sending the re-judging key of the data which is successfully locked to the Redis batch re-judging unit, releasing the lock after processing, and circulating until the re-judging keys of all the data in competition exist are all sent to the Redis batch re-judging unit.
And the Redis batch judging unit is used for judging whether the judging key exists or not in Redis batch, if not, adding 1 to the metric value under the unrepeated dimension combination in all the dimension combinations associated with the judging key to generate a pre-calculation result.
And the accumulation unit is used for accumulating the metric values of the same metric field value under the same dimension combination in the pre-calculation result obtained by the Redis batch duplication judgment module to generate a duplication removal counting result of the real-time data.
Optionally, the apparatus further comprises:
and the storage module is used for storing the deduplication counting result of the real-time data into a database.
Optionally, the storage module is further configured to convert the deduplication count result into a search field and a value, where the search field corresponds to the value one by one, and sequentially search whether the same search field exists in the database, if so, add the value corresponding to the same search field and store the added value in the database, and if not, store the search field and the value corresponding to the added value in the database.
The real-time data deduplication counting device provided by the embodiment of the invention can realize accurate calculation of the measurement deduplication count value of real-time data, and meanwhile, the deduplication process occupies a small memory space. In the embodiment of the invention, all dimension combinations of the real-time data are decomposed into 1-dimensional dimension combinations, the judging key is generated according to the 1-dimensional dimension combinations, redis batch duplicate removal processing is carried out on the judging key by using the control of the distributed lock, the uniqueness of a measurement field of the real-time data under each dimension combination can be ensured, and the accurate duplicate removal counting value of the real-time data is obtained by judging the duplicate. Meanwhile, only the weight judging keys of the 1-dimensional dimension combination are needed to be stored for judging the weight, compared with the existing weight removing method, the number of the weight judging keys is small, a large number of memory resources are released, the length of the weight judging keys is fixed, and the using amount of the Redis memory is easy to estimate.
All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Some technical terms in the specification are explained as follows:
[1] cube: a data set composed of data of different dimensions.
[2] Cuboid: data aggregated under each combination of dimensions, the base unit of Cube.
[3] Redis: an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent.
[4] Kafka: a high throughput distributed publish-subscribe messaging system.
[5] MySQL: a relational database management system.
[6] HBase: a distributed, column-oriented, open-source database built on top of the Hadoop file system.
[7] Spark Streaming: an extensible, fault tolerant streaming application.