CN111177137B

CN111177137B - Method, device, equipment and storage medium for data deduplication

Info

Publication number: CN111177137B
Application number: CN201911395228.4A
Authority: CN
Inventors: 叶伟成
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2023-10-13
Anticipated expiration: 2039-12-30
Also published as: CN111177137A

Abstract

The application discloses a data deduplication method, a device, equipment and a storage medium, and belongs to the technical field of data processing. The method comprises the following steps: receiving a real-time data stream; a data attribute of currently received first data belonging to the real-time data stream is determined. If the current deduplication cycle corresponding to the data attribute is started, when the data set corresponding to the current deduplication cycle does not comprise the first data, the first data is stored in the data set. The starting time point of the current deduplication cycle is the time point of storing the first data in the data set, or the ending time point of the last deduplication cycle corresponding to the data set. And if the end time point of the current deduplication period is reached, determining the data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication period, and determining second deduplication data of the data attribute in the real-time data stream based on the first deduplication data belonging to the data attribute determined in the plurality of deduplication periods. In this way, data deduplication may be achieved.

Description

Method, device, equipment and storage medium for data deduplication

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for data deduplication.

Background

In the technical field of data processing, in order to make the data processing process simpler and more convenient, data deduplication processing can be generally performed on data first, repeated data are removed, and then data processing is performed on the deduplicated data. Therefore, the data volume of data processing can be reduced, the calculation pressure of the equipment can be reduced, the storage space can be reduced, and the storage pressure of the equipment can be reduced. It can be seen that data deduplication is important for data processing, and therefore how to perform deduplication on data becomes a problem that needs to be solved.

Disclosure of Invention

The application provides a data deduplication method, a device, equipment and a storage medium, which can solve the problem of how to perform deduplication processing on data in the related technology. The technical scheme is as follows:

in one aspect, a method for deduplication of data is provided, the method comprising:

receiving a real-time data stream;

determining the data attribute of the first data currently received, wherein the first data belongs to the real-time data stream;

if the current deduplication period corresponding to the data attribute is started, when the data set corresponding to the current deduplication period does not comprise the first data, storing the first data into the data set, wherein the starting time point of the current deduplication period is the time point of storing the first data in the data set, or is the ending time point of the last deduplication period corresponding to the data set, and the period duration is a specified threshold;

If the ending time point of the current deduplication period is reached, determining the data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication period;

second deduplication data of the data attribute in the real-time data stream is determined based on first deduplication data belonging to the data attribute determined in a plurality of deduplication cycles.

In one possible implementation manner of the present application, after determining the data attribute of the first data currently received, the method further includes:

and if the current deduplication period corresponding to the data attribute is not started, storing the first data into the data set, taking the current time point as the starting time point of the current deduplication period corresponding to the data set, and starting to time the current deduplication period.

In one possible implementation manner of the present application, the determining, based on the first deduplication data belonging to the data attribute determined in the multiple deduplication periods, second deduplication data of the data attribute in the real-time data stream includes:

after the current deduplication period is finished, first deduplication data in a specified number of deduplication periods corresponding to the data attribute before the current time point are obtained;

Deleting repeated data in the first de-duplication data in the specified number of de-duplication cycles to obtain third de-duplication data in a target duration corresponding to the current de-duplication cycle, wherein the target duration is the multiplication of the cycle duration and the specified number;

and when the current deduplication period is the last deduplication period, determining third deduplication data in target time periods corresponding to the multiple deduplication periods as second deduplication data of the data attribute in the real-time data stream.

In one possible implementation manner of the present application, after determining the data stored in the data set as the first deduplication data belonging to the data attribute in the current deduplication cycle, the method further includes:

counting the number of data in a data set corresponding to the current deduplication period;

accordingly, the deleting the repeated data in the first deduplication data in the specified number of deduplication cycles includes:

according to the number of data in the data set corresponding to the specified number of the deduplication periods, verifying the first deduplication data in the specified number of the deduplication periods;

and deleting the repeated data in the first repeated data in the designated number of repeated cycles when the verification passes.

and determining the first deduplication data belonging to the data attribute in the plurality of deduplication periods as the second deduplication data of the data attribute in the real-time data stream.

In one possible implementation manner of the present application, the method further includes:

and discarding the first data when the first data is included in the data set corresponding to the current deduplication cycle.

In another aspect, an apparatus for deduplicating data is provided, the apparatus comprising:

a receiving module for receiving a real-time data stream;

a first determining module, configured to determine a data attribute of first data currently received, where the first data belongs to the real-time data stream;

the storage module is used for storing the first data into the data set when the first data is not included in the data set corresponding to the current deduplication period if the current deduplication period corresponding to the data attribute is started, wherein the starting time point of the current deduplication period is the time point of storing the first data in the data set, or is the ending time point of the last deduplication period corresponding to the data set, and the period duration is a specified threshold;

The second determining module is used for determining the data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication period if the end time point of the current deduplication period is reached;

and a third determining module, configured to determine second deduplication data of the data attribute in the real-time data stream based on the first deduplication data belonging to the data attribute determined in the multiple deduplication periods.

In one possible implementation manner of the present application, the first determining module is further configured to:

In one possible implementation manner of the present application, the third determining module is configured to:

In a possible implementation manner of the present application, the third determining module is further configured to:

In another aspect, an apparatus is provided, the apparatus including a memory for storing a computer program and a processor for executing the computer program stored on the memory to implement the steps of the method for deduplication of data described above.

In another aspect, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the steps of the method for deduplication of data described above.

In another aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the steps of the method of deduplication of data as described above.

The technical scheme provided by the application has at least the following beneficial effects:

and receiving the real-time data stream, determining the data attribute of the first data which belongs to the real-time data stream and is currently received, if the current deduplication period corresponding to the data attribute is started, storing the first data into the data set when the data set corresponding to the current deduplication period does not comprise the first data, wherein the starting time point of the current deduplication period is the time point of storing the first data into the data set, or the ending time point of the last deduplication period corresponding to the data set, and the period duration is a specified threshold value. And if the ending time point of the current deduplication cycle is reached, determining the data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication cycle. That is, before reaching the end time point of the current deduplication period, if the first data is received, the first data is directly subjected to deduplication processing, and the first data is stored only when the first data is not stored in the data set, so that the use of storage space is reduced, and if reaching the end time point of the current deduplication period, the data stored in the data set can be directly determined as the first deduplication data, and the data calculation amount of the device is reduced. Second deduplication data for the data attribute in the real-time data stream is then determined based on the first deduplication data belonging to the data attribute determined over the plurality of deduplication cycles. According to the data attribute difference, the second de-duplication data respectively belonging to the multiple data attributes in the real-time data stream can be determined in the mode, so that the effect of de-duplication of the data in the real-time data stream is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating a method of data deduplication in accordance with an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of data deduplication in accordance with another exemplary embodiment;

FIG. 3 is a schematic diagram illustrating an apparatus for deduplication of data according to an exemplary embodiment;

FIG. 4 is a schematic diagram of an apparatus according to an exemplary embodiment;

fig. 5 is a schematic structural view of an apparatus according to another exemplary embodiment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Before explaining the method for removing the duplicate data provided by the embodiment of the application in detail, an application scene and an implementation environment provided by the embodiment of the application are introduced.

Firstly, an application scenario of the data deduplication method provided by the embodiment of the application is introduced.

Currently, the data may be subjected to deduplication processing generally using a Flink (pipelined real-time computing engine), and specific implementations thereof may include: and receiving a real-time data stream sent by the first device, grouping the data according to the data attribute of the currently received data to obtain a plurality of groups of data, storing the plurality of groups of data, and performing deduplication processing on the plurality of groups of data stored in the previous deduplication period by using a window deduplication aggregation function every other deduplication period, so that deduplication data respectively belonging to the plurality of groups of data attributes in the previous deduplication period can be obtained.

However, in the method, the multiple sets of data are subjected to the deduplication processing every other deduplication period, and the data are stored in groups in the deduplication period, so that the amount of data required to be calculated in the deduplication processing is large, the data processing pressure of the equipment is increased, and the data processing efficiency of the equipment may be reduced.

The method for data deduplication provided by the embodiment of the application can solve the technical problems, and the specific implementation of the method can be seen in the following embodiments.

Next, an implementation environment of the method for data deduplication provided by the embodiment of the present application will be described.

The implementation environment may include a first device and a second device, and a communication connection may be established between the first device and the second device, where the communication connection may be a wired or wireless connection, and the application is not limited in this regard.

Wherein the first device may be used to send a real-time data stream to the second device. The second device may be used to de-duplicate the real-time data stream.

The first device may be any electronic product that can perform man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction or a handwriting device, for example, a PC (Personal Computer, a personal computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant, a personal digital assistant), a wearable device, a palm computer PPC (Pocket PC), a tablet computer, a smart car machine, a smart television, a smart sound box, and the like. Alternatively, the first device may be a server, and the first device may be a server or a server cluster formed by multiple servers.

The second device may be a terminal, and may be any electronic product capable of performing man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction device, a handwriting device, and the like, for example, a PC, a mobile phone, a smart phone, a PDA, a wearable device, a palm computer PPC, a tablet personal computer, an intelligent car machine, an intelligent television, an intelligent sound box, and the like.

Alternatively, the second device may be a server, and the second device may be a server, or a server cluster formed by a plurality of servers, or a cloud computing service center.

It will be appreciated by those skilled in the art that the first device and the second device described above are merely examples, and that other first devices or second devices that may be present in the present application or in the future are applicable to the present application and are intended to be included within the scope of the present application and are incorporated herein by reference.

After the application scenario and the implementation environment provided by the embodiment of the present application are introduced, a detailed explanation of the method for data deduplication provided by the embodiment of the present application follows.

FIG. 1 is a flow chart illustrating a method of data deduplication in a second device of the above-described implementation environment, according to an exemplary embodiment. Referring to fig. 1, the method may include the following steps.

Step 101: a real-time data stream is received.

Referring to fig. 2, a first device may store a generated real-time data stream through a kafka (distributed publish-subscribe messaging system) and transmit the real-time data stream to a second device, and after the second device receives the real-time data stream, a data source may be determined for the real-time data stream according to actual needs. The data source may be DataStreamSource or TableSource.

The system for storing the real-time data stream in the first device may be kafka, rabbitMQ, activeMQ, zeroMQ, redis (Remote Dictionary Server remote dictionary service) database, pulsar message system, etc.

Further, the real-time data stream may be a data stream generated in real-time for a user's operation in a website. By way of example, it may be a data stream generated in real time by a user performing operations such as web browsing, searching, or clicking in a web site.

Step 102: a data attribute of the currently received first data is determined, the first data belonging to a real-time data stream.

In an implementation, after receiving the real-time data stream, a data attribute of the first data may be determined from a specified field of the first data currently received. And then calling a KeyedProcessFunction function interface to carry out subsequent processing.

The specified field may be selected by the user according to actual needs, or may be selected by default by the device, which is not limited in the embodiment of the present application.

For convenience of description, the currently received data will be referred to as first data.

That is, since the real-time data stream is continuously transmitted, the real-time data stream includes a plurality of first data, and the second device receives one first data at a time, and determines the data attribute of the first data according to the designated field of the currently received first data whenever the first data is received.

It should be noted that, after the keyedprocessing function interface is called, a custom class (K class) may be created, where multiple data attributes of data are declared in the K class, and a V class may be created, where the V class may be used to receive first data, and store the first data according to the attribute of the first data and the current time.

Step 103: if the current deduplication cycle corresponding to the data attribute is started, when the data set corresponding to the current deduplication cycle does not comprise the first data, the first data is stored in the data set. The starting time point of the current deduplication period is the time point of storing first data in the data set, or the ending time point of the last deduplication period corresponding to the data set, and the period duration is a specified threshold.

As an example, in the process of performing the deduplication processing on the data, the data deduplication processing may be performed periodically, that is, every other period, and in one deduplication period, the first data is received through class V, and after the first data belonging to the same data attribute is subjected to the deduplication processing according to the Bitmap algorithm, the first data is stored in the data set corresponding to the data attribute and the deduplication period.

It should be noted that, the specified threshold may be set by the user according to actual needs, or may be set by default by the device, which is not limited in the embodiment of the present application. Illustratively, the period duration may be 5 minutes, 1 hour, etc.

As one example, data attributes correspond to data sets, with data of different data attributes stored in different data sets. A data attribute may correspond to a plurality of data sets that correspond to a deduplication cycle, i.e., each data set corresponds to a different deduplication cycle of the data attribute.

As an example, the deduplication cycle may be determined according to a starting time point of the deduplication cycle by expressing a mapping relationship between a data set and the starting time point of the deduplication cycle by < Long, bitmap > included in the V-class. Wherein Long is used to represent the start time point of the deduplication cycle, bitmap is used to represent the data set.

Illustratively, it is assumed that data attribute I corresponds to two data sets, namely data set a and data set b. The data in data set a is during the first deduplication cycle 10:00-10:05, the data in the data set b is stored in the second deduplication cycle 10:05-10: 10. The mapping relationship between the data set a and the start time of the deduplication cycle can be expressed as <10:00, a >, the mapping between the data set b and the start time of the deduplication cycle can be expressed as <10:00, b >.

In an implementation, after determining the data attributes of the first data, it is necessary to determine whether the first data is to be stored in the corresponding data set. If the current deduplication cycle corresponding to the data attribute is started, when the data set corresponding to the current deduplication cycle does not comprise the first data, the first data is stored in the data set.

As an example, if the first data is not the first data corresponding to the data attribute, the data corresponding to the data attribute may be considered to be included in the data received before the first data, and further it may be determined that the current deduplication cycle corresponding to the data attribute is started. When the data set corresponding to the current deduplication period does not comprise data, storing the first data into the data set; when the data set corresponding to the current deduplication period comprises data, judging whether the data set comprises first data or not, and if not, storing the first data into the data set.

In one possible implementation, if the current deduplication cycle corresponding to the data attribute is started, the current deduplication cycle may be the first deduplication cycle corresponding to the data attribute, in which case, the starting time point of the current deduplication cycle is a time point of storing the first data in the data set, and the current time point may be between the starting time point and the ending time point of the current deduplication cycle. In addition, the data belonging to the data attribute is already stored in the data set corresponding to the current deduplication cycle, the first data is compared with the data stored in the data set, and if the data stored in the data set does not include the same data as the first data, that is, the data set does not include the first data, the first data can be stored in the data set.

In another possible implementation, if the current deduplication period corresponding to the data attribute is started, the current deduplication period may be the nth deduplication period corresponding to the data attribute, in which case, the starting time point of the current deduplication period may be the ending time point of the last deduplication period corresponding to the data set, or the current time point may be between the starting time point and the ending time point of the current deduplication period. In addition, the data belonging to the data attribute may or may not be already stored in the data set corresponding to the current deduplication cycle. If the data set does not store data, the first data can be stored in the data set; and if the data set stores data, comparing the data stored in the data set with first data, and if the data stored in the data set does not include the same data as the first data, namely the data set does not include the first data, storing the first data into the data set.

Wherein N is an integer greater than 1.

For example, assuming that the data attribute is a letter, and the current deduplication cycle corresponding to the letter of the data attribute has been started, the data set already stores data a and data B, and if the first data is C, the first data may be stored in the data set.

Further, if the current deduplication cycle corresponding to the data attribute is started, discarding the first data when the data set corresponding to the current deduplication cycle includes the first data.

That is, if the data set corresponding to the current deduplication cycle includes the first data, the first data is not stored repeatedly, and the first data may be discarded. Therefore, the data deduplication effect can be achieved, the storage pressure of the second device is reduced, and the use of the storage space is reduced.

Continuing the above example, assuming that the data attribute is a letter, and the current deduplication cycle corresponding to the letter of the data attribute has been started, the data set already stores data a and data B, if the first data is a, the first data need not be stored in the data set any more, and the first data a can be discarded directly.

Further, after determining the data attribute of the first data, it is necessary to determine whether the first data is to be stored in the corresponding data set. If the current deduplication period corresponding to the data attribute is not started, storing the first data into the data set, taking the current time point as the starting time point of the current deduplication period corresponding to the data set, and starting to time the current deduplication period.

In implementation, since in the present application, for any data attribute, after the last deduplication period corresponding to the data attribute is ended, whether the current time point receives the first data belonging to the data attribute or not, the current deduplication period of the data attribute is directly started, and the current time point is the ending time point of the last deduplication period corresponding to the data attribute and is also the starting time point of the current deduplication period. In this case, if the first data currently received is the first data corresponding to the data attribute, it may be considered that the current deduplication cycle corresponding to the data attribute is not started, and the first data is stored in the data set, and the registerTimer (time trigger) is invoked. Referring to fig. 2, when first data corresponding to a data attribute is received for the first time, a register timer may be called through K classes, a current time point is taken as a starting time point of a current deduplication period corresponding to the data set, a current deduplication period is started to be timed, and the current deduplication period is the first deduplication period corresponding to the data attribute.

Step 104: and if the ending time point of the current deduplication cycle is reached, determining the data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication cycle.

If the ending time point of the current deduplication period is reached, namely triggering the register timer, an onTimer method can be called, stored data in a data set corresponding to the current deduplication period is obtained, and the stored data is determined to be first deduplication data belonging to the data attribute in the current deduplication period.

For example, assuming that the data attribute is a letter, the period duration is 5 minutes, the current deduplication period is 10:00-10:05, the received data includes A, B, A, C, C, B, F, A in the current deduplication period, the data stored in the data set corresponding to the data attribute in the current deduplication period is A, B, C and F, and A, B, C and F can be determined as the first deduplication data belonging to the data attribute letter in the 10:00-10:05.

It should be noted that, after determining the data attribute of the first data, the method for data deduplication provided by the embodiment of the application directly stores the first data according to the Bitmap algorithm, and does not store duplicate data, that is, the deduplication processing of the data in the deduplication period is completed in the process, so that the problem that the storage space is wasted due to the fact that the data belonging to the current deduplication period is stored completely is avoided. Moreover, if the stored data is not subjected to the deduplication process until the end time of the current deduplication cycle is reached, the amount of data to be processed is relatively large, and a calculation peak may occur, so that the data processing pressure of the second device is increased.

Wherein, the Bitmap algorithm can be replaced by Bitset, roaring NavigabaleMap and the like.

Step 105: second deduplication data of the data attribute in the real-time data stream is determined based on the first deduplication data belonging to the data attribute determined in the plurality of deduplication cycles.

In implementation, when determining the second deduplication data of the data attribute in the real-time data stream, the determination may be performed by means of a sliding window, or may be performed by means of a scrolling window.

Wherein a window may be used to indicate a deduplication cycle.

The method for sliding the windows is to divide a big window into a plurality of small windows, determine first de-duplication data corresponding to each small window, and then determine second de-duplication data of data attributes in real-time data streams in the big window according to the first de-duplication data determined by the small windows. Illustratively, the large window duration may be 1 hour, the small window duration may be 5 minutes, and the second deduplication data of the data attribute in the real-time data stream within 1 hour before the current time point may be determined according to the first deduplication data respectively determined within 12 5 minutes within 1 hour before the current time point every 5 minutes.

The window is scrolled in such a way that first deduplication data in one window is directly determined, and the first deduplication data in the window is determined as second deduplication data of data attributes in the real-time data stream. For example, the window duration may be 5 minutes, and the second deduplication data of the data attribute in the real-time data stream within 5 minutes before the current time point may be determined from the first deduplication data determined within 5 minutes before the current time point every 5 minutes.

In one possible implementation manner, if the data is deduplicated by adopting a sliding window manner, determining, based on the first deduplication data belonging to the data attribute determined in a plurality of deduplication periods, a specific implementation of the second deduplication data of the data attribute in the real-time data stream may include: and after the current deduplication period is ended, acquiring first deduplication data in a specified number of deduplication periods corresponding to the data attribute before the current time point. And deleting the repeated data in the first repeated data in the designated number of repeated cycles to obtain third repeated data in the target time length corresponding to the current repeated cycle.

The target duration is the multiplication of the period duration and the specified number. The target duration corresponds to the large window duration in the above example. The period duration corresponds to the widget duration in the above example.

In implementation, after the current deduplication period is finished, first deduplication data in a specified number of deduplication periods corresponding to the data attribute before the current time point is obtained according to requirements, and second deduplication processing is carried out on the first deduplication data in the specified number of deduplication periods, so that third deduplication data of the data attribute in the target duration is obtained.

As an example, assuming that the period duration of the current deduplication period is 5 minutes, and the specified number is 4 in a sliding window manner, the target duration is 20 minutes, that is, every 5 minutes, and the third deduplication data within 20 minutes corresponding to the data attribute before the current time point is determined.

Illustratively, assuming that the first deduplication data in the first 5 minutes is a and B, the first deduplication data in the second 5 minutes is D, the first deduplication data in the third 5 minutes is C and B, the first deduplication data in the fourth 5 minutes is F, the duplicates in A, B, D, C, B and F may be deleted, A, B, D, C and F may be obtained, and A, B, D, C and F may be determined as the third deduplication data in 20 minutes corresponding to the fourth 5 minutes.

In some embodiments, after obtaining the third deduplication data in the target duration corresponding to the current deduplication cycle, when the current deduplication cycle is the last deduplication cycle, determining the third deduplication data in the target duration corresponding to the multiple deduplication cycles as the second deduplication data of the data attribute in the real-time data stream.

That is, after the current deduplication period is finished, if the data of the data attribute is not received any more, the current deduplication period is considered to be the last deduplication period, and the third deduplication data in the target duration corresponding to the multiple deduplication periods is determined to be the second deduplication data of the data attribute in the real-time data stream.

Further, after the stored data in the data set is determined to be the first deduplication data belonging to the data attribute in the current deduplication period, the number of data in the data set corresponding to the current deduplication period can be counted.

Accordingly, a specific implementation of deduplication in first deduplication data over a specified number of deduplication cycles may include: and checking the first deduplication data in the specified number of deduplication periods according to the number of data in the data set corresponding to the specified number of deduplication periods, and deleting the repeated data in the first deduplication data in the specified number of deduplication periods when the check passes.

That is, the number of data in the data set corresponding to each deduplication cycle may be counted, if the number of data in the data set corresponding to each deduplication cycle in the specified number of deduplication cycles is the same as the number of data in the first deduplication data in the specified number of deduplication cycles, the verification may be determined to pass, the specified number of deduplication cycles currently determined is the deduplication cycle in the target duration corresponding to the current deduplication cycle, and the duplicate data in the first deduplication data in the specified number of deduplication cycles may be deleted.

For example, assuming that the large window duration is 20 minutes, the small window duration is 5 minutes, the number of data sets corresponding to 4 small windows before the current time is 2, 1 and 0, and the number of data of the first deduplication data in the specified number of deduplication periods is 2, 1 and 0, it may be determined that the verification passes, and the deduplication data in the first deduplication data in the specified number of deduplication periods may be deleted.

It should be noted that, in the embodiment of the present application, after the last deduplication period corresponding to the data attribute is ended, the current deduplication period is directly started, and the third deduplication data in the target duration corresponding to the current deduplication period is determined based on the first deduplication data corresponding to the data attribute in the multiple deduplication periods before the current time point. Therefore, the step of determining the third deduplication data is not started when the first deduplication data is not available in the current deduplication period, the problem that the third deduplication data determined in the target time length corresponding to the current deduplication period is still the third deduplication data corresponding to the last target time length is avoided, and the problem of outdated data is avoided.

For example, assuming that the period duration is 5 minutes, the target duration is 20 minutes, the number of data of the first deduplication data of the data attribute corresponding to 11:40-12:00, the number of data of the first deduplication data of the data attribute corresponding to 11:35-11:40 is 1, the number of data of the first deduplication data of the data attribute corresponding to 11:40-11:45 is 1, the number of data of the first deduplication data of the data attribute corresponding to 11:45-11:50 is 2, the number of data of the first deduplication data of the data attribute corresponding to 11:50-11:55 is 1, the number of data of the first deduplication data of the data attribute corresponding to 11:55-12:00 is 0, the current time point needs to determine the third deduplication data of the data attribute corresponding to 11:40-12:00, but no data is in the data set of the data attribute corresponding to 11:55-12:00, if the step of determining the third deduplication data corresponding to 11:40-12:00 is not started, the third deduplication data determined in the current deduplication period is still included in the target duration of 11:40-11:55-12:00. In the present application, even if there is no data in the data set of the data attribute corresponding to 11:55-12:00, the step of determining the third deduplication data corresponding to 11:40-12:00 is started, and the third deduplication data corresponding to 11:40-12:00 is obtained, and the expiration data is not included.

In another possible implementation manner, if the data is deduplicated by adopting a rolling window manner, determining, based on the first deduplication data belonging to the data attribute determined in a plurality of deduplication periods, a specific implementation of the second deduplication data of the data attribute in the real-time data stream may include: first deduplication data belonging to the data attribute in a plurality of deduplication periods is determined as second deduplication data of the data attribute in the real-time data stream.

That is, if the data is deduplicated by using a rolling window, when the current deduplication cycle is the last deduplication cycle, the first deduplication data belonging to the data attribute in the multiple deduplication cycles can be determined as the second deduplication data of the data attribute in the real-time data stream.

It should be noted that, in the embodiment of the present application, only one data attribute is taken as an example to describe a method for deduplicating data, and for each data attribute in multiple data attributes, data deduplication may be implemented in the above manner. As an example, the deduplication processing is performed according to the data attributes, the different data attributes do not interfere with each other, and because the time points of receiving the first data are different, the time points of storing the data belonging to the different data attributes into the data set are different, so the starting time points of the deduplication periods corresponding to the different data attributes may be different, but the deduplication periods corresponding to the different data attributes are the same, and the method for performing data deduplication on the data corresponding to the different data attributes is the same as the method described above.

In the embodiment of the application, a real-time data stream is received, a data attribute of first data which belongs to the real-time data stream and is currently received is determined, if a current deduplication period corresponding to the data attribute is started, when a data set corresponding to the current deduplication period does not comprise the first data, the first data is stored in the data set, the starting time point of the current deduplication period is the time point of storing the first data in the data set, or the ending time point of the last deduplication period corresponding to the data set, and the period duration is a specified threshold value. And if the ending time point of the current deduplication cycle is reached, determining the data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication cycle. That is, before reaching the end time point of the current deduplication period, if the first data is received, the first data is directly subjected to deduplication processing, and the first data is stored only when the first data is not stored in the data set, so that the use of storage space is reduced, and if reaching the end time point of the current deduplication period, the data stored in the data set can be directly determined as the first deduplication data, and the data calculation amount of the device is reduced. Second deduplication data for the data attribute in the real-time data stream is then determined based on the first deduplication data belonging to the data attribute determined over the plurality of deduplication cycles. According to the data attribute difference, the second de-duplication data respectively belonging to the multiple data attributes in the real-time data stream can be determined in the mode, so that the effect of de-duplication of the data in the real-time data stream is achieved.

Fig. 3 is a schematic structural diagram of an apparatus for data deduplication, which may be implemented as part or all of the second device by software, hardware, or a combination of both, according to an exemplary embodiment. Referring to fig. 3, the apparatus may include: a receiving module 301, a first determining module 302, a storing module 303, a second determining module 304 and a third determining module 305.

A receiving module 301, configured to receive a real-time data stream;

a first determining module 302, configured to determine a data attribute of first data currently received, where the first data belongs to a real-time data stream;

the storage module 303 is configured to store the first data into the data set when the data set corresponding to the current deduplication period does not include the first data if the current deduplication period corresponding to the data attribute is started, where a starting time point of the current deduplication period is a time point of storing the first data into the data set, or an ending time point of a last deduplication period corresponding to the data set, and a period duration is a specified threshold;

a second determining module 304, configured to determine, if the end time point of the current deduplication cycle is reached, the data stored in the data set as first deduplication data belonging to the data attribute in the current deduplication cycle;

A third determining module 305 is configured to determine second deduplication data of the data attribute in the real-time data stream based on the first deduplication data belonging to the data attribute determined in the plurality of deduplication cycles.

In one possible implementation of the present application, the first determining module 302 is further configured to:

if the current deduplication period corresponding to the data attribute is not started, storing the first data into the data set, taking the current time point as the starting time point of the current deduplication period corresponding to the data set, and starting to time the current deduplication period.

In one possible implementation of the present application, the third determining module 305 is configured to:

deleting repeated data in the first repeated data in the designated number of repeated cycles to obtain third repeated data in a target time length corresponding to the current repeated cycle, wherein the target time length is the multiplication of the cycle time length and the designated number;

and when the current deduplication period is the last deduplication period, determining third deduplication data in a target duration corresponding to the multiple deduplication periods as second deduplication data of the data attribute in the real-time data stream.

In one possible implementation of the present application, the third determining module 305 is further configured to:

when the check passes, the duplicate data in the first duplicate removal data in the specified number of duplicate removal cycles is deleted.

first deduplication data belonging to the data attribute in a plurality of deduplication periods is determined as second deduplication data of the data attribute in the real-time data stream.

It should be noted that: in the data deduplication device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the apparatus for data deduplication and the method embodiment for data deduplication provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiment are detailed in the method embodiment, and are not repeated here.

Fig. 4 is a schematic diagram of an apparatus according to an exemplary embodiment. The device 400 may be a server. The apparatus 400 includes a Central Processing Unit (CPU) 401, a system memory 404 including a Random Access Memory (RAM) 402 and a Read Only Memory (ROM) 403, and a system bus 405 connecting the system memory 404 and the central processing unit 401. Device 400 also includes a basic input/output system (I/O system) 406, which helps to transfer information between various devices within the computer, and a mass storage device 407 for storing an operating system 413, application programs 414, and other program modules 415.

The basic input/output system 406 includes a display 408 for displaying information and an input device 409, such as a mouse, keyboard, etc., for user input of information. Wherein both the display 408 and the input device 409 are coupled to the central processing unit 401 via an input output controller 410 coupled to the system bus 405. The basic input/output system 406 may also include an input/output controller 410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 410 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 407 is connected to the central processing unit 401 through a mass storage controller (not shown) connected to the system bus 405. The mass storage device 407 and its associated computer-readable medium provide non-volatile storage for the device 400. That is, mass storage device 407 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 404 and mass storage device 407 described above may be collectively referred to as memory.

According to various embodiments of the application, the device 400 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., device 400 may be connected to network 412 through a network interface unit 411 coupled to system bus 405, or other types of networks or remote computer systems (not shown) may also be connected using network interface unit 411.

The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.

Fig. 5 is a block diagram illustrating an apparatus 500 according to another exemplary embodiment. The device 500 may be a terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. The device 500 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the apparatus 500 comprises: a processor 501 and a memory 502.

Processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 501 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 501 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 501 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 501 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the method of data deduplication provided by method embodiments of the present application.

In some embodiments, the apparatus 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502, and peripheral interface 503 may be connected by buses or signal lines. The individual peripheral devices may be connected to the peripheral device interface 503 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, touch display 505, camera assembly 506, audio circuitry 507, positioning assembly 508, and power supply 509.

Peripheral interface 503 may be used to connect at least one Input/Output (I/O) related peripheral to processor 501 and memory 502. In some embodiments, processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 501, memory 502, and peripheral interface 503 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 504 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 504 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 504 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 504 may also include NFC (Near Field Communication ) related circuitry, which is not limited by the present application.

The display 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 505 is a touch display, the display 505 also has the ability to collect touch signals at or above the surface of the display 505. The touch signal may be input as a control signal to the processor 501 for processing. At this time, the display 505 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 505 may be one, providing a front panel of the device 500; in other embodiments, the display 505 may be at least two, respectively disposed on different surfaces of the device 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or a folded surface of the device 500. Even more, the display 505 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 505 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 506 is used to capture images or video. Optionally, the camera assembly 506 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 for voice communication. The microphone may be provided in a plurality of different locations of the apparatus 500 for stereo acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 507 may also include a headphone jack.

The location component 508 is used to locate the current geographic location of the device 500 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 508 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, or the Galileo system of Russia.

A power supply 509 is used to power the various components in the device 500. The power supply 509 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 509 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the device 500 further includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: an acceleration sensor 511, a gyro sensor 512, a pressure sensor 513, a fingerprint sensor 514, an optical sensor 515, and a proximity sensor 516.

The acceleration sensor 511 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the apparatus 500. For example, the acceleration sensor 511 may be used to detect components of gravitational acceleration on three coordinate axes. The processor 501 may control the touch display 505 to display a user interface in a landscape view or a portrait view according to a gravitational acceleration signal acquired by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the apparatus 500, and the gyro sensor 512 may collect a 3D motion of the user to the apparatus 500 in cooperation with the acceleration sensor 511. The processor 501 may implement the following functions based on the data collected by the gyro sensor 512: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side frame of the device 500 and/or on an underlying layer of the touch screen 505. When the pressure sensor 513 is disposed on a side frame of the apparatus 500, a grip signal of the apparatus 500 by a user may be detected, and the processor 501 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the touch display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 505. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 514 is used for collecting the fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 501 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back or side of the device 500. When a physical key or vendor Logo is provided on the device 500, the fingerprint sensor 514 may be integrated with the physical key or vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 505 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 505 is turned down. In another embodiment, the processor 501 may also dynamically adjust the shooting parameters of the camera assembly 506 based on the ambient light intensity collected by the optical sensor 515.

A proximity sensor 516, also known as a distance sensor, is typically provided on the front panel of the device 500. The proximity sensor 516 is used to collect the distance between the user and the front of the device 500. In one embodiment, when the proximity sensor 516 detects a gradual decrease in the distance between the user and the front face of the device 500, the processor 501 controls the touch display 505 to switch from the bright screen state to the off screen state; when the proximity sensor 516 detects that the distance between the user and the front of the device 500 gradually increases, the touch display 505 is controlled by the processor 501 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 5 is not limiting of the apparatus 500 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In some embodiments, a computer readable storage medium is also provided, in which a computer program is stored, which when executed by a processor, implements the steps of the method of deduplication of data in the above embodiments. For example, the computer readable storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It is noted that the computer readable storage medium mentioned in the present application may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps to implement the above-described embodiments may be implemented by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer-readable storage medium described above.

That is, in some embodiments, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the steps of the method of deduplication of data described above.

The above embodiments are not intended to limit the present application, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A method of deduplication of data, the method comprising:

receiving a real-time data stream;

determining second deduplication data of the data attribute in the real-time data stream by a sliding window mode or a rolling window mode based on first deduplication data belonging to the data attribute determined in a plurality of deduplication periods;

The sliding window mode comprises the following steps: after the current deduplication period is finished, first deduplication data in a specified number of deduplication periods corresponding to the data attribute before the current time point are obtained; deleting repeated data in the first de-duplication data in the specified number of de-duplication cycles to obtain third de-duplication data in a target duration corresponding to the current de-duplication cycle, wherein the target duration is the multiplication of the cycle duration and the specified number; when the current deduplication period is the last deduplication period, determining third deduplication data in target time periods corresponding to the multiple deduplication periods as second deduplication data of the data attribute in the real-time data stream;

the manner of scrolling the window includes: and determining the first deduplication data belonging to the data attribute in the plurality of deduplication periods as the second deduplication data of the data attribute in the real-time data stream.

2. The method of claim 1, wherein after determining the data attribute of the first data currently received, further comprising:

3. The method of claim 1, wherein the determining the stored data in the data set as the first deduplication data belonging to the data attribute within a current deduplication cycle further comprises:

4. The method of claim 1, wherein the method further comprises:

5. An apparatus for deduplication of data, the apparatus comprising:

a receiving module for receiving a real-time data stream;

a third determining module, configured to determine, based on first deduplication data belonging to a data attribute determined in a plurality of deduplication periods, second deduplication data of the data attribute in the real-time data stream by a sliding window manner or a rolling window manner;

6. The apparatus of claim 5, wherein the first determination module is further to:

7. The apparatus of claim 5, wherein the third determination module is further for:

8. The apparatus of claim 5, wherein the third determination module is further for:

9. An apparatus comprising a memory for storing a computer program and a processor for executing the computer program stored on the memory to perform the steps of the method of any of the preceding claims 1-4.

10. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-4.