CN111177137A

CN111177137A - Data deduplication method, device, equipment and storage medium

Info

Publication number: CN111177137A
Application number: CN201911395228.4A
Authority: CN
Inventors: 叶伟成
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-19
Anticipated expiration: 2039-12-30
Also published as: CN111177137B

Abstract

The application discloses a data deduplication method, a data deduplication device, data deduplication equipment and a storage medium, and belongs to the technical field of data processing. The method comprises the following steps: receiving a real-time data stream; data attributes of currently received first data belonging to a real-time data stream are determined. And if the current deduplication cycle corresponding to the data attribute is started, storing the first data into the data set when the data set corresponding to the current deduplication cycle does not include the first data. The starting time point of the current deduplication cycle is the time point of storing the first data in the data set, or the ending time point of the previous deduplication cycle corresponding to the data set. And if the end time point of the current deduplication period is reached, determining the stored data in the data set as first deduplication data belonging to the data attribute in the current deduplication period, and determining second deduplication data of the data attribute in the real-time data stream based on the first deduplication data belonging to the data attribute determined in the multiple deduplication periods. In this manner, data deduplication may be achieved.

Description

Data deduplication method, device, equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for data deduplication.

Background

In the field of data processing technology, in order to make the data processing process simpler and more convenient, data deduplication processing may be performed on data first, duplicate data is removed, and then data processing is performed on the deduplicated data. Therefore, the data processing method can reduce the data processing data volume and the calculation pressure of the equipment, and can also reduce the storage space and the storage pressure of the equipment. Therefore, data deduplication is important for data processing, and therefore how to perform deduplication processing on data becomes a problem which needs to be solved urgently.

Disclosure of Invention

The application provides a data deduplication method, a data deduplication device, data deduplication equipment and a storage medium, which can solve the problem of how to perform deduplication processing on data in the related art. The technical scheme is as follows:

in one aspect, a method for data deduplication is provided, where the method includes:

receiving a real-time data stream;

determining a data attribute of currently received first data, wherein the first data belongs to the real-time data stream;

if the current deduplication cycle corresponding to the data attribute is started, storing the first data into the data set when the data set corresponding to the current deduplication cycle does not include the first data, wherein the starting time point of the current deduplication cycle is the time point of storing the first data into the data set, or the ending time point of the previous deduplication cycle corresponding to the data set, and the cycle duration is a specified threshold;

if the ending time point of the current deduplication cycle is reached, determining the stored data in the data set as first deduplication data belonging to the data attribute in the current deduplication cycle;

determining second deduplication data of a data attribute in the real-time data stream based on first deduplication data belonging to the data attribute determined over a plurality of deduplication cycles.

In a possible implementation manner of the present application, after determining the data attribute of the currently received first data, the method further includes:

and if the current deduplication cycle corresponding to the data attribute is not started, storing the first data into the data set, taking the current time point as the starting time point of the current deduplication cycle corresponding to the data set, and starting timing for the current deduplication cycle.

In a possible implementation manner of the present application, the determining second deduplication data of the data attribute in the real-time data stream based on first deduplication data belonging to the data attribute determined in multiple deduplication cycles includes:

after the current deduplication period is finished, acquiring first deduplication data in a specified number of deduplication periods corresponding to the data attributes before the current time point;

deleting repeated data in the first repeated data in the specified number of repeated removing periods to obtain third repeated data in a target time length corresponding to the current repeated removing period, wherein the target time length is the multiplication of the period time length and the specified number;

and when the current deduplication cycle is the last deduplication cycle, determining third deduplication data within the target duration corresponding to the multiple deduplication cycles as second deduplication data of the data attribute in the real-time data stream.

In a possible implementation manner of the present application, after determining that the stored data in the data set is the first deduplication data belonging to the data attribute in the current deduplication cycle, the method further includes:

counting the number of data in the data set corresponding to the current deduplication period;

accordingly, the de-duplication of the first de-duplicated data within the specified number of de-duplication cycles includes:

according to the number of data in the data set corresponding to the specified number of deduplication cycles, checking first deduplication data in the specified number of deduplication cycles;

and when the check is passed, deleting the repeated data in the first repeated data in the specified number of repeated cycles.

and determining first deduplication data belonging to the data attribute in the multiple deduplication cycles as second deduplication data of the data attribute in the real-time data stream.

In one possible implementation manner of the present application, the method further includes:

and when the data set corresponding to the current deduplication period comprises the first data, discarding the first data.

In another aspect, an apparatus for data deduplication is provided, the apparatus comprising:

a receiving module for receiving a real-time data stream;

a first determining module, configured to determine a data attribute of currently received first data, where the first data belongs to the real-time data stream;

a storage module, configured to store the first data into a data set when a data set corresponding to the current deduplication cycle does not include the first data if a current deduplication cycle corresponding to the data attribute is started, where a starting time point of the current deduplication cycle is a time point of storing the first data in the data set, or an ending time point of a previous deduplication cycle corresponding to the data set, and a cycle duration is a specified threshold;

a second determining module, configured to determine, if an end time point of the current deduplication cycle is reached, data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication cycle;

a third determining module, configured to determine second deduplication data of the data attribute in the real-time data stream based on the determined first deduplication data belonging to the data attribute in multiple deduplication cycles.

In one possible implementation manner of the present application, the first determining module is further configured to:

In one possible implementation manner of the present application, the third determining module is configured to:

In one possible implementation manner of the present application, the third determining module is further configured to:

In another aspect, an apparatus is provided, which includes a memory for storing a computer program and a processor for executing the computer program stored in the memory to implement the steps of the data deduplication method described above.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for data deduplication described above.

In another aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the steps of the above-described method of data deduplication.

The technical scheme provided by the application can at least bring the following beneficial effects:

receiving a real-time data stream, determining a data attribute of first data currently received, wherein the first data belongs to the real-time data stream, if a current deduplication cycle corresponding to the data attribute is started, when a data set corresponding to the current deduplication cycle does not include the first data, storing the first data into the data set, wherein the starting time point of the current deduplication cycle is a time point of storing the first data in the data set, or an ending time point of a last deduplication cycle corresponding to the data set, and the cycle duration is a specified threshold. And if the end time point of the current deduplication period is reached, determining the stored data in the data set as first deduplication data belonging to the data attribute in the current deduplication period. That is to say, in the current deduplication period, before the ending time point of the current deduplication period is reached, if first data is received, the first data is directly subjected to deduplication processing, and the first data is stored only when the first data is not stored in the data set, so that the use of a storage space is reduced. Second deduplication data for the data attribute in the real-time data stream is then determined based on the first deduplication data belonging to the data attribute determined over the plurality of deduplication cycles. Therefore, according to the difference of the data attributes, the second duplicate removal data respectively belonging to the multiple data attributes in the real-time data stream can be determined according to the mode, and therefore the effect of carrying out duplicate removal on the data in the real-time data stream is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram illustrating a method of data deduplication in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of data deduplication in accordance with another exemplary embodiment;

FIG. 3 is a block diagram illustrating an apparatus for deduplication of data according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating the structure of an apparatus according to an exemplary embodiment;

fig. 5 is a schematic diagram of a device according to another exemplary embodiment.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the data deduplication method provided in the embodiment of the present application in detail, an application scenario and an implementation environment provided in the embodiment of the present application are introduced.

First, an application scenario of the data deduplication method provided in the embodiment of the present application is introduced.

At present, generally, a Flink (pipelined real-time computing engine) may be used to perform deduplication processing on data, and specific implementations of the deduplication processing may include: the method comprises the steps of receiving a real-time data stream sent by first equipment, grouping data according to data attributes of the currently received data to obtain multiple groups of data, storing the multiple groups of data, and performing deduplication processing on the multiple groups of data stored in a previous deduplication period by using a window deduplication aggregation function every other deduplication period, so that deduplication data respectively belonging to the multiple groups of data attributes in the previous deduplication period can be obtained.

However, in the above method, the multiple sets of data are deduplicated every other deduplication period, and the data are stored in groups in the deduplication period, in this case, the amount of data to be calculated in the deduplication process is large, which increases the data processing pressure of the device, and may reduce the data processing efficiency of the device.

The method for removing duplicate data provided by the embodiment of the present application can solve the above technical problem, and specific implementation thereof can be seen in the following embodiments.

Next, an implementation environment of the data deduplication method provided in the embodiment of the present application is described.

The implementation environment may include a first device and a second device, and a communication connection may be established between the first device and the second device, and the communication connection may be a wired or wireless connection, which is not limited in this application.

Wherein the first device may be adapted to transmit the real-time data stream to the second device. The second device may be used to de-duplicate the real-time data stream.

The first device may be any electronic product that can perform human-computer interaction with a user through one or more modes such as a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment, for example, a PC (Personal computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a wearable device, a Pocket PC (Pocket PC), a tablet computer, a smart car, a smart television, a smart sound box, and the like. Alternatively, the first device may be a server, and the first device may be a server or a server cluster composed of a plurality of servers.

The second device may be a terminal, and may be any electronic product that can perform human-computer interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction, or handwriting equipment, for example, a PC, a mobile phone, a smart phone, a PDA, a wearable device, a palm PC PPC, a tablet PC, a smart car machine, a smart television, a smart sound box, and the like.

Alternatively, the second device may be a server, and the second device may be a server, a server cluster composed of a plurality of servers, or a cloud computing service center.

It will be understood by those skilled in the art that the first and second devices described above are exemplary only, and that other first and second devices, now known or later developed, that may be suitable for use in the present application are also encompassed within the scope of the present application and are hereby incorporated by reference.

After the application scenario and the implementation environment provided by the embodiment of the present application are introduced, a detailed explanation is next provided for the data deduplication method provided by the embodiment of the present application.

Fig. 1 is a flow chart illustrating a method of data deduplication, according to an example embodiment, as applied in a second device of the above-described implementation environment. Referring to fig. 1, the method may include the following steps.

Step 101: a real-time data stream is received.

Referring to fig. 2, the first device may store the generated real-time data stream through kafka (distributed publish-subscribe messaging system) and send the real-time data stream to the second device, and after receiving the real-time data stream, the second device may determine a data source for the real-time data stream according to actual needs. Wherein, the data source can be DataStreamSource or TableSource.

The system for storing the real-time data stream in the first device may be kafka, or may also be RabbitMQ, ActiveMQ, ZeroMQ, Redis (Remote Dictionary Server) database, Pulsar message system, or the like.

Further, the real-time data stream may be a data stream generated in real-time for a user's operation in a website. Illustratively, the data stream generated in real time by the user to browse, search or click web pages in the website can be used.

Step 102: data attributes of currently received first data are determined, the first data belonging to a real-time data stream.

In an implementation, after receiving the real-time data stream, data attributes of the first data may be determined based on specified fields of the currently received first data. And then calling a KeyedProcessFunction function interface for subsequent processing.

The designated field may be selected by a user according to actual needs, or may be selected by default by the device, which is not limited in the embodiment of the present application.

For convenience of description, the currently received data is referred to as first data.

That is, since the real-time data stream is continuously transmitted, the real-time data stream includes a plurality of first data, the second device receives one first data at a time, and determines the data attribute of the first data according to the specified field of the currently received first data as long as the first data is received.

It should be noted that, after the KeyedProcessFunction interface is called, a custom class (K class) may be created, and multiple data attributes of the data may be declared in the K class, or a V class may be created, which is used to receive the first data and store the first data according to the attributes of the first data and the current time.

Step 103: and if the current deduplication cycle corresponding to the data attribute is started, storing the first data into the data set when the data set corresponding to the current deduplication cycle does not include the first data. The starting time point of the current deduplication period is a time point of storing first data in a data set, or is an ending time point of a previous deduplication period corresponding to the data set, and the period duration is a specified threshold.

As an example, in the process of performing deduplication processing on data, data deduplication processing may be performed periodically, that is, every other cycle duration, and in one deduplication period, first data is received through class V, and after performing deduplication processing on the first data belonging to the same data attribute according to a Bitmap algorithm, the first data is stored in a data set corresponding to the data attribute and the deduplication period.

It should be noted that the specified threshold may be set by a user according to actual needs, or may be set by default by a device, which is not limited in this embodiment of the application. Illustratively, the cycle time period may be 5 minutes, 1 hour, etc.

As one example, the data attributes correspond to data sets, with data of different data attributes being stored in different data sets. A data attribute may correspond to multiple data sets that correspond to deduplication cycles, i.e., each data set corresponds to a different deduplication cycle for the data attribute.

As an example, the mapping relationship between the data set and the starting time point of the deduplication cycle may be represented by < Long, Bitmap > contained in the V class, and the deduplication cycle may be determined according to the starting time point of the deduplication cycle. Wherein Long is used for representing the starting time point of the deduplication cycle, and Bitmap is used for representing the data set.

Illustratively, assume that data attribute i corresponds to two data sets, namely data set a and data set b. The data in data set a is in the first deduplication cycle 10:00-10:05, the data in the data set b is, in the second deduplication cycle 10: 05-10: 10 are stored in. The mapping between data set a and the start time of the deduplication cycle may be represented as < 10:00, a >, the mapping between the data set b and the start time of the deduplication cycle may be expressed as < 10:00, b >.

In implementation, after determining the data attribute of the first data, it is determined whether the first data is to be stored in the corresponding data set. And if the current deduplication cycle corresponding to the data attribute is started, storing the first data into the data set when the data set corresponding to the current deduplication cycle does not include the first data.

As an example, if the first data is not the first data corresponding to the data attribute, it may be considered that the data corresponding to the data attribute is included in the data received before the first data, and it may be determined that the current deduplication cycle corresponding to the data attribute is started. When the data set corresponding to the current deduplication cycle does not contain data, storing the first data into the data set; when the data set corresponding to the current deduplication cycle comprises data, judging whether the data set comprises first data or not, and if not, storing the first data into the data set.

In a possible implementation manner, if a current deduplication cycle corresponding to a data attribute is started, the current deduplication cycle may be a first deduplication cycle corresponding to the data attribute, in this case, a starting time point of the current deduplication cycle is a time point of storing first data in the data set, and the current time point may be between a starting time point and an ending time point of the current deduplication cycle. In addition, data belonging to the data attribute is already stored in the data set corresponding to the current deduplication cycle, the first data is compared with the data stored in the data set, and if the data stored in the data set does not include the same data as the first data, that is, the data set does not include the first data, the first data can be stored in the data set.

In another possible implementation manner, if the current deduplication cycle corresponding to the data attribute is started, the current deduplication cycle may be an nth deduplication cycle corresponding to the data attribute, in this case, the starting time point of the current deduplication cycle is the ending time point of the previous deduplication cycle corresponding to the data set, and the current time point may be the starting time point of the current deduplication cycle, or the current time point may be between the starting time point and the ending time point of the current deduplication cycle. In addition, data belonging to the data attribute may already be stored in the data set corresponding to the current deduplication cycle, or data may not be stored. If the data set does not store data, the first data can be stored in the data set; if the data set stores data, the data stored in the data set is compared with first data, and if the data stored in the data set does not include data identical to the first data, that is, the data set does not include the first data, the first data can be stored in the data set.

Wherein N is an integer greater than 1.

For example, assuming that the data attribute is a letter and the current deduplication cycle corresponding to the letter of the data attribute is started, the data a and the data B are already stored in the data set, and if the first data is C, the first data may be stored in the data set.

Further, if the current deduplication period corresponding to the data attribute is started, when the data set corresponding to the current deduplication period includes the first data, discarding the first data.

That is, if the data set corresponding to the current deduplication cycle includes the first data, the first data is not repeatedly stored, and the first data may be discarded. Therefore, the effect of data duplicate removal can be achieved, the storage pressure of the second equipment is reduced, and the use of the storage space is reduced.

Continuing with the above example, assuming that the data attribute is a letter, and the current deduplication cycle corresponding to the letter of the data attribute is started, data a and data B have been stored in the data set, if the first data is a, it is not necessary to store the first data in the data set, and it is only necessary to directly discard the first data a.

Further, after determining the data attribute of the first data, it is necessary to determine whether to store the first data in the corresponding data set. And if the current deduplication cycle corresponding to the data attribute is not started, storing the first data into the data set, taking the current time point as the starting time point of the current deduplication cycle corresponding to the data set, and starting timing the current deduplication cycle.

In implementation, for any data attribute in the present application, after a previous deduplication cycle corresponding to the data attribute is ended, whether a current time point receives first data belonging to the data attribute or not, the current deduplication cycle of the data attribute is directly started, and the current time point is an end time point of the previous deduplication cycle corresponding to the data attribute and is also a starting time point of the current deduplication cycle. In this case, if the currently received first data is the first data corresponding to the data attribute, and it can be considered that the current deduplication cycle corresponding to the data attribute is not started, the first data is stored in the data set, and a register timer (time trigger) is called. Referring to fig. 2, that is, when first data corresponding to a data attribute is received for the first time, a registertimeter may be called through the class K, a current time point is used as a starting time point of a current deduplication cycle corresponding to the data set, the current deduplication cycle is started to be timed, and the current deduplication cycle is the first deduplication cycle corresponding to the data attribute.

Step 104: and if the ending time point of the current deduplication period is reached, determining the stored data in the data set as the first deduplication data belonging to the data attribute in the current deduplication period.

If the ending time point of the current deduplication period is reached, namely, the register timer is triggered, the onTimer method can be called, the data stored in the data set corresponding to the current deduplication period is obtained, and the stored data is determined as the first deduplication data belonging to the data attribute in the current deduplication period.

For example, assuming that the data attribute is a letter, assuming that the period duration is 5 minutes, the current deduplication period is 10:00-10:05, in the current deduplication period, the received data includes A, B, A, C, C, B, F, A, and the data stored in the data set corresponding to the data attribute in the current deduplication period is A, B, C and F, A, B, C and F may be determined as the first deduplication data belonging to the letter of the data attribute in 10:00-10: 05.

It should be noted that, in the data deduplication method provided in the embodiment of the present application, after the data attribute of the first data is determined, the first data is directly stored according to the Bitmap algorithm, and no repeated data is stored, that is, deduplication processing on data in the deduplication period has been completed in this process, so that the problem that storage space is wasted because all data belonging to the current deduplication period is stored is avoided. Moreover, if the stored data is not deduplicated until the end time of the current deduplication cycle is reached, the amount of data to be processed is relatively large, a computation peak may occur, and the data processing pressure of the second device may be increased.

The Bitmap algorithm can be replaced by Bitset, Roaring64 navigaplemap and the like.

Step 105: second deduplication data of the data attributes in the real-time data stream is determined based on the first deduplication data belonging to the data attributes determined over the plurality of deduplication cycles.

In implementation, the second deduplication data of the data attribute in the real-time data stream may be determined by a sliding window manner, or may be determined by a rolling window manner.

Wherein the window may be used to indicate a deduplication period.

The method for sliding the window comprises the steps of dividing a large window into a plurality of small windows, determining first duplicate removal data corresponding to each small window, and then determining second duplicate removal data of data attributes in a data stream in the large window during real-time according to the first duplicate removal data determined by the plurality of small windows. For example, the duration of the large window may be 1 hour, the duration of the small window may be 5 minutes, and the second deduplication data of the data attribute in the real-time data stream within 1 hour before the current time point may be determined according to the first deduplication data respectively determined within 12 5 minutes within 1 hour before the current time point every 5 minutes.

The window is scrolled by directly determining first deduplication data in a window and determining the first deduplication data in the window as second deduplication data of data attributes in the real-time data stream. For example, the window duration may be 5 minutes, and the second deduplication data of the data attribute in the real-time data stream within 5 minutes before the current time point may be determined every 5 minutes according to the first deduplication data determined within 5 minutes before the current time point.

In a possible implementation manner, if the data is deduplicated in a sliding window manner, based on the first deduplication data belonging to the data attribute determined in multiple deduplication cycles, the specific implementation of determining the second deduplication data of the data attribute in the real-time data stream may include: and after the current deduplication period is finished, acquiring first deduplication data in a specified number of deduplication periods corresponding to the data attributes before the current time point. And deleting repeated data in the first repeated data in the specified number of repeated cycles to obtain third repeated data in a target time length corresponding to the current repeated cycle.

Wherein the target duration is the period duration multiplied by a specified number. The target duration corresponds to the large window duration in the above example. The cycle duration corresponds to the small window duration in the above example.

In implementation, after the current deduplication period is finished, first deduplication data in a specified number of deduplication periods corresponding to data attributes before the current time point are obtained according to requirements, and secondary deduplication processing is performed on the first deduplication data in the specified number of deduplication periods to obtain third deduplication data of the data attributes in the target time length.

As an example, assuming that the cycle duration of the current deduplication period is 5 minutes, and the specified number is 4 in a sliding window manner, the target duration is 20 minutes, that is, every 5 minutes, third deduplication data within 20 minutes corresponding to the data attribute before the current time point is determined.

Illustratively, assuming that the first deduplication data in the first 5 minutes is a and B, the first deduplication data in the second 5 minutes is D, the first deduplication data in the third 5 minutes is C and B, and the first deduplication data in the fourth 5 minutes is F, the deduplication data in A, B, D, C, B and F may be deleted to obtain A, B, D, C and F, and A, B, D, C and F may be determined as the third deduplication data in the 20 minutes corresponding to the fourth 5 minutes.

In some embodiments, after the third deduplication data in the target duration corresponding to the current deduplication period is obtained, when the current deduplication period is the last deduplication period, the third deduplication data in the target duration corresponding to the multiple deduplication periods is determined as the second deduplication data of the data attribute in the real-time data stream.

That is, after the current deduplication cycle is ended, if the data of the data attribute is no longer received, the current deduplication cycle may be considered as the last deduplication cycle, and the third deduplication data within the target duration corresponding to the multiple deduplication cycles is determined as the second deduplication data of the data attribute in the real-time data stream.

Further, after the stored data in the data set is determined as the first deduplication data belonging to the data attribute in the current deduplication period, the number of data in the data set corresponding to the current deduplication period may be counted.

Accordingly, a specific implementation of deduplication in the first deduplication data over a specified number of deduplication cycles may include: and checking the first deduplication data in the specified number of deduplication cycles according to the number of data in the data set corresponding to the specified number of deduplication cycles, and deleting the deduplication data in the first deduplication data in the specified number of deduplication cycles when the check is passed.

That is to say, the number of data in the data set corresponding to each deduplication cycle may be counted, and if the number of data in the data set corresponding to each deduplication cycle in the specified number of deduplication cycles is respectively the same as the number of data in the first deduplication data in the specified number of deduplication cycles, it may be determined that the check is passed, the currently determined specified number of deduplication cycles is the deduplication cycle within the target duration corresponding to the current deduplication cycle, and the deduplication data in the first deduplication data in the specified number of deduplication cycles may be deleted.

Exemplarily, assuming that the duration of the large window is 20 minutes, the duration of the small window is 5 minutes, the number of data of the data set corresponding to 4 small windows before the current time is 2, 1 and 0, respectively, and the number of data of the first deduplication data in the specified number of deduplication cycles is 2, 1 and 0, respectively, it may be determined that the check passes, and the deduplication data in the first deduplication data in the specified number of deduplication cycles may be deleted.

It should be noted that, in the embodiment of the present application, after the previous deduplication period corresponding to the data attribute is ended, the current deduplication period is directly started, the first deduplication data corresponding to the data attribute in the multiple deduplication periods before the current time point is started, and the third deduplication data in the target duration corresponding to the current deduplication period is determined. Therefore, the step of determining the third deduplication data is avoided when the first deduplication data does not exist in the current deduplication period, the problem that the third deduplication data determined in the target duration corresponding to the current deduplication period is still the third deduplication data corresponding to the previous target duration is avoided, and the problem of outdated data is avoided.

Exemplarily, assuming that a cycle time length is 5 minutes, a target time length is 20 minutes, a current deduplication cycle is 11:40-12:00, the number of data of first deduplication data of data attributes corresponding to 11:35-11:40 is 1, the number of data of first deduplication data of data attributes corresponding to 11:40-11:45 is 1, the number of data of first deduplication data of data attributes corresponding to 11:45-11:50 is 2, the number of data of first deduplication data of data attributes corresponding to 11:50-11:55 is 1, the number of data of first deduplication data of data attributes corresponding to 11:55-12:00 is 0, third deduplication data of data attributes corresponding to 11:40-12:00 needs to be determined at a current time point, but there is no data in a data set of data attributes corresponding to 11:55-12:00, if the step of determining the third deduplication data corresponding to 11:40-12:00 is not started, the third deduplication data determined in the target duration corresponding to the current deduplication period is still the third deduplication data corresponding to 11:35-11:55, and includes outdated data corresponding to 11:35-11: 40. In the application, even if no data exists in the data set of the data attribute corresponding to 11:55-12:00, the step of determining the third deduplication data corresponding to 11:40-12:00 is started, and the obtained third deduplication data corresponding to 11:40-12:00 does not include stale data.

In another possible implementation manner, if data is deduplicated in a rolling window manner, based on first deduplication data belonging to a data attribute determined in multiple deduplication cycles, a specific implementation of determining second deduplication data of the data attribute in the real-time data stream may include: and determining first deduplication data belonging to the data attribute in a plurality of deduplication cycles as second deduplication data of the data attribute in the real-time data stream.

That is, if the data is deduplicated by using a rolling window, when the current deduplication cycle is the last deduplication cycle, the first deduplication data belonging to the data attribute in the multiple deduplication cycles may be determined as the second deduplication data of the data attribute in the real-time data stream.

It should be noted that, in the embodiment of the present application, only one data attribute is taken as an example to describe the data deduplication method, and the data deduplication can be implemented according to the above manner for each data attribute in multiple data attributes. As an example, data deduplication is performed according to data attributes, different data attributes do not interfere with each other, and time points at which data belonging to different data attributes are stored in a data set are different due to different time points at which first data is received, so that start time points of deduplication cycles corresponding to different data attributes may be different, but deduplication cycles corresponding to different data attributes are the same, and a method for performing data deduplication on data corresponding to different data attributes is the same as the above method.

In the embodiment of the application, a real-time data stream is received, a data attribute of first data currently received, which belongs to the real-time data stream, is determined, if a current deduplication cycle corresponding to the data attribute is started, when a data set corresponding to the current deduplication cycle does not include the first data, the first data is stored in the data set, a starting time point of the current deduplication cycle is a time point of storing the first data in the data set, or an ending time point of a last deduplication cycle corresponding to the data set, and a cycle duration is a specified threshold. And if the end time point of the current deduplication period is reached, determining the stored data in the data set as first deduplication data belonging to the data attribute in the current deduplication period. That is to say, in the current deduplication period, before the ending time point of the current deduplication period is reached, if first data is received, the first data is directly subjected to deduplication processing, and the first data is stored only when the first data is not stored in the data set, so that the use of a storage space is reduced. Second deduplication data for the data attribute in the real-time data stream is then determined based on the first deduplication data belonging to the data attribute determined over the plurality of deduplication cycles. Therefore, according to the difference of the data attributes, the second duplicate removal data respectively belonging to the multiple data attributes in the real-time data stream can be determined according to the mode, and therefore the effect of carrying out duplicate removal on the data in the real-time data stream is achieved.

Fig. 3 is a schematic structural diagram illustrating a data deduplication apparatus according to an exemplary embodiment, which may be implemented by software, hardware or a combination of the two as part of or all of a second device. Referring to fig. 3, the apparatus may include: a receiving module 301, a first determining module 302, a storing module 303, a second determining module 304 and a third determining module 305.

A receiving module 301, configured to receive a real-time data stream;

a first determining module 302, configured to determine a data attribute of currently received first data, where the first data belongs to a real-time data stream;

a storage module 303, configured to store first data into a data set when a data set corresponding to a current deduplication cycle does not include the first data if the current deduplication cycle corresponding to the data attribute is started, where a starting time point of the current deduplication cycle is a time point of storing the first data in the data set, or an ending time point of a previous deduplication cycle corresponding to the data set, and a cycle duration is a specified threshold;

a second determining module 304, configured to determine, if an end time point of the current deduplication cycle is reached, data already stored in the data set as first deduplication data belonging to a data attribute in the current deduplication cycle;

a third determining module 305, configured to determine second deduplication data of the data attribute in the real-time data stream based on the determined first deduplication data belonging to the data attribute in the multiple deduplication cycles.

In one possible implementation manner of the present application, the first determining module 302 is further configured to:

and if the current deduplication cycle corresponding to the data attribute is not started, storing the first data into the data set, taking the current time point as the starting time point of the current deduplication cycle corresponding to the data set, and starting timing the current deduplication cycle.

In one possible implementation manner of the present application, the third determining module 305 is configured to:

after the current deduplication period is finished, acquiring first deduplication data in a specified number of deduplication periods corresponding to data attributes before the current time point;

deleting repeated data in the first repeated data in a specified number of repeated removing periods to obtain third repeated data in a target time length corresponding to the current repeated removing period, wherein the target time length is the multiplication of the period time length and the specified number;

and when the current deduplication period is the last deduplication period, determining third deduplication data in the target duration corresponding to the multiple deduplication periods as second deduplication data of the data attribute in the real-time data stream.

In one possible implementation manner of the present application, the third determining module 305 is further configured to:

counting the number of data in a data set corresponding to the current deduplication period;

according to the number of data in a data set corresponding to the specified number of deduplication cycles, checking first deduplication data in the specified number of deduplication cycles;

when the check passes, the duplicate data in the first deduplication data within a specified number of deduplication cycles is deleted.

and determining first deduplication data belonging to the data attribute in a plurality of deduplication cycles as second deduplication data of the data attribute in the real-time data stream.

It should be noted that: in the data deduplication device provided in the foregoing embodiment, only the division of the functional modules is illustrated when data deduplication is performed, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for removing duplicate data provided in the foregoing embodiment and the method embodiment for removing duplicate data belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and will not be described herein again.

FIG. 4 is a schematic diagram illustrating the structure of an apparatus according to an exemplary embodiment. The device 400 may be a server. The device 400 includes a Central Processing Unit (CPU)401, a system memory 404 including a Random Access Memory (RAM)402 and a Read Only Memory (ROM)403, and a system bus 405 connecting the system memory 404 and the central processing unit 401. The device 400 also includes a basic input/output system (I/O system) 406, which facilitates the transfer of information between devices within the computer, and a mass storage device 407 for storing an operating system 413, application programs 414, and other program modules 415.

The basic input/output system 406 includes a display 408 for displaying information and an input device 409 such as a mouse, keyboard, etc. for user input of information. Wherein a display 408 and an input device 409 are connected to the central processing unit 401 through an input output controller 410 connected to the system bus 405. The basic input/output system 406 may also include an input/output controller 410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input/output controller 410 may also provide output to a display screen, a printer, or other type of output device.

The mass storage device 407 is connected to the central processing unit 401 through a mass storage controller (not shown) connected to the system bus 405. The mass storage device 407 and its associated computer-readable media provide non-volatile storage for the device 400. That is, the mass storage device 407 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 404 and mass storage device 407 described above may be collectively referred to as memory.

According to various embodiments of the present application, device 400 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the device 400 may be connected to the network 412 through the network interface unit 411 attached to the system bus 405, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 411.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Fig. 5 is a block diagram illustrating a structure of an apparatus 500 according to another exemplary embodiment. The device 500 may be a terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group audio Layer III, motion Picture Experts compression standard audio Layer 3), an MP4 player (Moving Picture Experts Group audio Layer IV, motion Picture Experts compression standard audio Layer 4), a notebook computer, or a desktop computer. Device 500 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the apparatus 500 includes: a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the method of data deduplication provided by method embodiments herein.

In some embodiments, the apparatus 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, touch screen display 505, camera 506, audio circuitry 507, positioning components 508, and power supply 509.

The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the device 500; in other embodiments, the display screens 505 may be at least two, each disposed on a different surface of the device 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the device 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. The microphones may be multiple and placed at different locations on the device 500 for stereo sound acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.

The positioning component 508 is operable to locate a current geographic location of the device 500 for navigation or LBS (location based Service). The positioning component 508 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, or the galileo System in russia.

A power supply 509 is used to power the various components in the device 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the device 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.

The acceleration sensor 511 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the apparatus 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the touch screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the device 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user with respect to the device 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensors 513 may be disposed on a side bezel of the device 500 and/or underneath the touch display screen 505. When the pressure sensor 513 is disposed on the side frame of the device 500, the holding signal of the user to the device 500 can be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the touch display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back, or side of the device 500. When a physical key or vendor Logo is provided on the device 500, the fingerprint sensor 514 may be integrated with the physical key or vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 505 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 505 is turned down. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.

A proximity sensor 516, also known as a distance sensor, is typically provided on the front panel of the device 500. The proximity sensor 516 is used to capture the distance between the user and the front of the device 500. In one embodiment, the touch display screen 505 is controlled by the processor 501 to switch from the bright screen state to the dark screen state when the proximity sensor 516 detects that the distance between the user and the front surface of the device 500 is gradually decreased; when the proximity sensor 516 detects that the distance between the user and the front surface of the device 500 becomes gradually larger, the touch display screen 505 is controlled by the processor 501 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 5 does not constitute a limitation of the apparatus 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

In some embodiments, a computer-readable storage medium is also provided, in which a computer program is stored, which, when being executed by a processor, implements the steps of the method for data deduplication in the above embodiments. For example, the computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is noted that the computer-readable storage medium referred to herein may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps for implementing the above embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer-readable storage medium described above.

That is, in some embodiments, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the above-described method of data deduplication.

The above-mentioned embodiments are provided not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of data deduplication, the method comprising:

receiving a real-time data stream;

2. The method of claim 1, wherein after determining the data attribute of the currently received first data, further comprising:

3. The method of claim 1, wherein determining second deduplication data for a data attribute in the real-time data stream based on first deduplication data belonging to the data attribute determined over a plurality of deduplication cycles comprises:

4. The method of claim 3, wherein after determining the stored data in the data set as the first deduplication data belonging to the data attribute in the current deduplication cycle, further comprising:

5. The method of claim 1, wherein determining second deduplication data for a data attribute in the real-time data stream based on first deduplication data belonging to the data attribute determined over a plurality of deduplication cycles comprises:

6. The method of claim 1, wherein the method further comprises:

7. An apparatus for data deduplication, the apparatus comprising:

a receiving module for receiving a real-time data stream;

8. The apparatus of claim 7, wherein the first determining module is further to:

9. The apparatus of claim 7, wherein the third determination module is to:

10. The apparatus of claim 9, wherein the third determination module is further configured to:

11. The apparatus of claim 7, wherein the third determination module is to:

12. The apparatus of claim 7, wherein the third determination module is further to:

13. An apparatus comprising a memory for storing a computer program and a processor for executing the computer program stored in the memory to perform the steps of the method of any one of claims 1 to 6.

14. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.