CN109522305B - Big data deduplication method and device - Google Patents

Big data deduplication method and device Download PDF

Info

Publication number
CN109522305B
CN109522305B CN201811488881.0A CN201811488881A CN109522305B CN 109522305 B CN109522305 B CN 109522305B CN 201811488881 A CN201811488881 A CN 201811488881A CN 109522305 B CN109522305 B CN 109522305B
Authority
CN
China
Prior art keywords
data
redis
deduplicated
key value
occurrence time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811488881.0A
Other languages
Chinese (zh)
Other versions
CN109522305A (en
Inventor
郭冰
程广艺
罗天成
夏曙东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing China Transinfo Stock Co ltd
Original Assignee
Beijing China Transinfo Stock Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing China Transinfo Stock Co ltd filed Critical Beijing China Transinfo Stock Co ltd
Priority to CN201811488881.0A priority Critical patent/CN109522305B/en
Publication of CN109522305A publication Critical patent/CN109522305A/en
Application granted granted Critical
Publication of CN109522305B publication Critical patent/CN109522305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a big data duplicate removal method and a device, wherein the method comprises the following steps: receiving data to be deduplicated, wherein the data to be deduplicated comprises occurrence time and a data character string; generating a Redis key value pair corresponding to the data to be deduplicated according to the occurrence time and the data character string; and inserting the Redis key value pair into the Redis server pair, and determining whether the data to be deduplicated is repeated data according to a returned result of the Redis server pair. According to the invention, the server cluster is used for carrying out big data deduplication, and data operation is dispersed to different nodes in the cluster environment as far as possible. And a key value pair database Redis with high concurrent access is adopted during deduplication, so that the minimal system resource occupation of deduplication operation is ensured from the space and time perspectives. The occurrence time of the data to be deduplicated is extended to a plurality of adjacent times, approximate data with close time can be effectively filtered, the deduplication accuracy is high, the precision is high, the universality is good, and the method can be applied to big data application scenes with various data having the characteristic of time continuity.

Description

Big data deduplication method and device
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a big data duplicate removal method and device.
Background
At present, big data technology is widely applied in various fields. In some big data application scenes, certain time continuity exists in data, for example, when a vehicle passes through a reader of a gate in traffic big data, the reader uploads vehicle passing records of the vehicle to a big data platform, the vehicle passing records have certain time continuity, and if the vehicle slowly moves or is static at the gate, the reader can repeatedly upload the vehicle passing records of the vehicle in a short time, so that the big data platform stores a lot of repeated or approximate data. Therefore, the big data platform needs to perform deduplication processing on the received data.
Currently, in the related art, a data deduplication method is provided, that is, every time a piece of data is received in a deduplication period, a preset number of keywords are determined from the data, whether data containing the keywords exists in each piece of other data received in the deduplication period is judged, and if yes, the data is deleted. If not, the data is stored.
However, in the related technology, simple keyword deduplication cannot eliminate approximate data, deduplication accuracy is poor, a large amount of data redundancy still exists after deduplication, a large amount of storage space is wasted, information pollution is also formed, and real valuable information is covered.
Disclosure of Invention
In order to solve the problems, the invention provides a big data deduplication method and a big data deduplication device, which can effectively filter approximate data with close time by extending the occurrence time of data to be deduplicated to multiple close times, and have high deduplication accuracy and precision. The present invention solves the above problems by the following aspects.
In a first aspect, an embodiment of the present invention provides a big data deduplication method, where the method includes:
receiving data to be deduplicated, wherein the data to be deduplicated comprises occurrence time and a data character string;
generating a Redis key value pair corresponding to the data to be deduplicated according to the occurrence time and the data character string;
and inserting the Redis key value pair into a Redis server pair, and determining whether the data to be deduplicated is repeated data according to a returned result of the Redis server pair.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the generating, according to the occurrence time and the data character string, a Redis key-value pair corresponding to the data to be deduplicated includes:
generating a Redis key corresponding to the data to be deduplicated according to the occurrence time and the data character string;
generating a key value corresponding to the Redis key according to the occurrence time;
and forming the Redis key and the key value into a Redis key value pair corresponding to the data to be deduplicated.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the generating a Redis key corresponding to the to-be-deduplicated data according to the occurrence time and the data character string includes:
calculating a period identifier corresponding to the data to be deduplicated according to the occurrence time and a preset period length;
and generating a Redis key corresponding to the data to be deduplicated according to the data character string and the period identifier.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the generating, according to the occurrence time, a key value corresponding to the Redis key includes:
expanding the occurrence time to a preset number of adjacent times;
and determining the preset number of the approaching time as key values corresponding to the Redis keys.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the determining, according to a returned result of the Redis server pair, whether the data to be deduplicated is duplicate data includes:
judging whether the returned result of the Redis server pair is equal to the number of key values included in the Redis key value pair;
if so, determining that the data to be deduplicated is not repeated data;
if not, determining that the data to be deduplicated is repeated data, and discarding the data to be deduplicated.
With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the inserting the Redis key value pair into a Redis server pair further includes:
calculating a first boundary coefficient and a second boundary coefficient according to the occurrence time and a preset cycle length;
if the first boundary coefficient is smaller than or equal to a preset threshold value, generating a first boundary key value pair corresponding to the Redis key value pair, and inserting the first boundary key value pair into a Redis server pair;
and if the second boundary coefficient is smaller than or equal to the preset threshold, generating a second boundary key value pair corresponding to the Redis key value pair, and inserting the second boundary key value pair into the Redis server pair.
In a second aspect, an embodiment of the present invention provides a big data deduplication apparatus, where the apparatus includes:
the device comprises a receiving module, a sending module and a receiving module, wherein the receiving module is used for receiving data to be deduplicated, and the data to be deduplicated comprises occurrence time and a data character string;
the generating module is used for generating a Redis key value pair corresponding to the data to be deduplicated according to the occurrence time and the data character string;
and the determining module is used for inserting the Redis key value pair into a Redis server pair and determining whether the data to be deduplicated is repeated data according to a return result of the Redis server pair.
With reference to the second aspect, an embodiment of the present invention provides a first possible implementation manner of the second aspect, where the generating module includes:
the generating unit is used for generating a Redis key corresponding to the data to be deduplicated according to the occurrence time and the data character string; generating a key value corresponding to the Redis key according to the occurrence time;
and the composition unit is used for composing the Redis key and the key value into a Redis key value pair corresponding to the data to be deduplicated.
In a third aspect, an embodiment of the present invention provides a big data deduplication device, including:
one or more processors;
storage means for storing one or more programs;
the one or more programs are executable by the one or more processors to cause the one or more processors to implement the method of the first aspect described above or any one of the possible implementations of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method according to the first aspect or any one of the possible implementation manners of the first aspect.
In the embodiment of the invention, data to be deduplicated is received, wherein the data to be deduplicated comprises occurrence time and a data character string; generating a Redis key value pair corresponding to the data to be deduplicated according to the occurrence time and the data character string; and inserting the Redis key value pair into the Redis server pair, and determining whether the data to be deduplicated is repeated data according to a returned result of the Redis server pair. And performing big data deduplication through the server cluster, and dispersing data operation to different nodes in the cluster environment as far as possible. And a key value pair database Redis with high concurrent access is adopted during deduplication, so that the minimal system resource occupation of deduplication operation is ensured from the space and time perspectives. The occurrence time of the data to be deduplicated is extended to a plurality of adjacent times, approximate data with close time can be effectively filtered, the deduplication accuracy is high, the precision is high, the universality is good, and the method can be applied to big data application scenes with various data having the characteristic of time continuity.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic diagram illustrating a network architecture on which a big data deduplication method provided in embodiment 1 of the present invention is based;
fig. 2 is a schematic flowchart illustrating a big data deduplication method provided in embodiment 1 of the present invention;
fig. 3 is a schematic flow chart illustrating another big data deduplication method provided in embodiment 1 of the present invention.
Fig. 4 shows a schematic flow chart of the passing record deduplication provided by embodiment 1 of the present invention.
Fig. 5 is a schematic structural diagram illustrating a big data deduplication device provided in embodiment 2 of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example 1
The embodiment of the invention provides a big data duplicate removal method. Referring to fig. 1, the network architecture on which the method is based includes a data acquisition device, a server cluster, and a Redis server pair. The data acquisition equipment is used for acquiring data to be deduplicated and uploading the data to be deduplicated to the server cluster. The server cluster comprises a plurality of servers, the execution main body of the embodiment of the invention is the server, and the duplicate removal work is dispersed to different nodes of the cluster environment as much as possible through the server cluster so as to obtain the maximum computation. In the embodiment of the invention, a plurality of Redis server pairs are arranged, each Redis server pair comprises a Redis main server and a Redis standby server, intermediate data in the process of removing the duplicate of the big data is stored through the Redis server pair, and the stability of the storage of the intermediate data is improved through the main server and the standby server in the Redis server pair. Redis is a Key-Value storage system, when big data is deduplicated, an atomicity operation is carried out on a database through a Key Value which can be accessed at high concurrency, and the minimization of occupied resources is ensured to the maximum extent from the perspective of space and time.
The embodiment of the invention can be applied to various big data scenes with time continuity of data. For example, in a traffic big data scene, the passing record has time continuity. The reader of the gate uploads the vehicle passing record of the vehicle passing through the gate to the server cluster, wherein the vehicle passing record comprises the license plate number and the license plate type of the vehicle, the equipment identification and the equipment type of the reader and the occurrence time of the vehicle identified by the reader. The number plate type of the vehicle can be a large-sized vehicle or a small-sized vehicle. When a vehicle is slowly driven or is static at the gate, the reader of the gate can upload the vehicle passing records of the vehicle for many times in a short time, and the uploaded vehicle passing records are only different in occurrence time and are approximate data, so the method provided by the embodiment of the invention can be used for identifying the approximate data and deduplicating the approximate data, only the first vehicle passing record of the vehicle uploaded by the reader of the gate is reserved, the subsequent vehicle passing records uploaded for many times are removed, and the repeated data and the approximate data can be effectively filtered, so that excessive redundant data are prevented from being stored in the server cluster.
Referring to fig. 2, the method specifically includes the following steps:
step 101: receiving data to be deduplicated, wherein the data to be deduplicated comprises occurrence time and a data character string.
The server receives data to be deduplicated uploaded by the data acquisition equipment, wherein the data to be deduplicated comprises occurrence time and a data character string. For example, in the traffic big data, the server receives a vehicle passing record of a certain vehicle uploaded by a reader of a gate, wherein the vehicle passing record comprises 241 seconds of occurrence time of the vehicle identified by the reader, the license plate number and the license plate type of the vehicle, and the equipment identification and the equipment type of the reader. The data string is a data string included in the data to be deduplicated, wherein the character string is composed of the license plate number and the license plate type of the vehicle, and the equipment identifier and the equipment type of the reader.
In an embodiment of the present invention, the occurrence time may be UNIX time, which is the total number of seconds from 1970, 1, 0 minutes, 0 seconds to when the data to be deduplicated is collected by the data collection device. The occurrence time may also be the total number of seconds from a certain preset time to the time when the data acquisition device acquires the data to be deduplicated, and the preset time may be 0 minute 0 second at 0 time 1 month and 1 day of 2018, or 0 minute 0 second at 0 time 1 month and 1 day of 2018, 11 months and 1 day of 2018, and the like.
Step 102: and generating a Redis key value pair corresponding to the data to be deduplicated according to the occurrence time and the data character string.
The embodiment of the invention generates the Redis key-value pair corresponding to the data to be deduplicated by the following operations of steps A1-A3, and specifically comprises the following steps:
a1: and generating a Redis key corresponding to the data to be deduplicated according to the occurrence time and the data character string.
Specifically, according to the occurrence time and the preset period length included in the data to be deduplicated, the period identifier corresponding to the data to be deduplicated is calculated by the following formula (1).
CI=INT(t/CL)…(1)
In the above formula 1, CI is a period identifier, t is an occurrence time of data to be deduplicated, CL is a preset period length, and INT () is a rounding function.
In the embodiment of the present invention, the preset period length may be one day or two days, etc., and the unit of the preset period length is expressed by seconds, that is, assuming that the preset period length is one day, the value of the preset period length is 86400 seconds of the total seconds of one day.
After the period identifier corresponding to the data to be deduplicated is calculated by the above method (1), the Redis key corresponding to the data to be deduplicated is generated by the following formula (2) according to the data character string included in the data to be deduplicated and the calculated period identifier.
Key=“{a}-{CI}”…(2)
In the above formula (2), Key is a Redis Key, { a } is a data string included in the data to be deduplicated, and { CI } is a period identifier corresponding to the data to be deduplicated. The cycle identifier is used for identifying the current deduplication cycle to which the data to be deduplicated belongs.
A2: and generating a key value corresponding to the Redis key according to the occurrence time of the data to be deduplicated.
Expanding the occurrence time to a preset number of adjacent times; and determining the preset number of the approach time as key values corresponding to the Redis keys.
The predetermined number may be 3 or 5, etc. In the embodiment of the invention, in order to simplify the operation, the occurrence time of the data to be deduplicated is generalized to a certain time unit, and the occurrence time is specifically converted into time representation on the time unit by the following formula (3):
M=INT(t/c)…(3)
in the above formula (3), M is the occurrence time after the time generalization operation, t is the occurrence time before the time generalization operation, and c is a preset time unit. Wherein, the value of c can be 10 seconds or 20 seconds and the like. In the embodiment of the present invention, the deduplication period may be an integer multiple of the time unit c, such as 3c or 5c around the deduplication.
And after the occurrence time is converted to the time representation M under the time unit c, expanding the occurrence time to a preset number of adjacent times. For example, assuming that the preset number is 3, the occurrence time can be extended to 3 adjacent times (M-1), M, and (M + 1). Assuming that the predetermined number is 5, the occurrence time can be extended to 5 adjacent times (M-2), (M-1), M, (M +1), and (M + 2).
And taking the extended preset number of adjacent time as the key value of the Redis key corresponding to the data to be deduplicated.
A3: and forming the Redis key value pair corresponding to the data to be deduplicated by the generated Redis key and the key value.
Assuming that Key represents Redis Key, the Key values generated in step A2 are (M-1), M and (M +1), and then a Redis Key value pair (Key, ((M-1), M, (M +1))) corresponding to the data to be deduplicated can be composed.
Step 103: inserting the Redis key value pair corresponding to the data to be deduplicated into the Redis server pair, and determining whether the data to be deduplicated is repeated data according to a returned result of the Redis server pair.
Inserting the Redis key value pair corresponding to the data to be deduplicated into the Redis server pair through an insertion function SADD () of the Redis storage system. The insertion function SADD () is characterized as inserting one or more elements into a set, and elements that duplicate elements already present in the set are ignored and only elements that are not currently present in the set are inserted. The return value of the insert function SADD () is the number of elements inserted in the set. Therefore, the server cluster receives the return result sent by the Redis server pair, and judges whether the return result of the Redis server pair is equal to the number of the key values included in the Redis key value pair or not; if so, determining that the data to be deduplicated is not the repeated data, and subsequently storing the data to be deduplicated. If not, namely the return result is smaller than the number of the key values included in the Redis key value pair, determining that the data to be deduplicated is repeated data, and discarding the data to be deduplicated.
For example, assuming that the Redis Key-value pair corresponding to the data to be deduplicated is (Key, ((M-1), M, (M +1))), an insertion operation SADD (Key, ((M-1), M, (M +1))) is executed in the Redis server pair, that is, three elements (M-1), M and (M +1) are inserted into the set Key. If the set Key does not have the three elements, all the three elements are inserted into the set Key, a return result is sent to the server cluster, the value of the return result is 3, the server cluster determines that the data to be deduplicated is not repeated data according to the return result, and the data to be deduplicated is stored. If one or more elements of the three elements exist in the set Key, discarding the repeated element in the calculated element, inserting the non-repeated element into the set Key, and sending a return result to the server cluster, wherein the value of the return result is less than 3, the server cluster determines that the data to be deduplicated is repeated data according to the return result, and discards the data to be deduplicated.
Due to the fact that the data to be deduplicated is continuous in time, the data to be deduplicated may cross a boundary in time characteristics, that is, if the occurrence time included in the data to be deduplicated is close to a zero point in a day, the data to be deduplicated may be duplicate data or approximate data with the data close to 24 points in the previous day, and therefore, it is further necessary to insert a Redis key value pair of a previous deduplication period of a current deduplication period corresponding to the data to be deduplicated into a Redis server pair. If the occurrence time included in the data to be deduplicated is close to 24 points in a day, the data to be deduplicated may be duplicate data or approximate data with data close to a zero point in a subsequent day, and therefore, a Redis key value pair of a subsequent deduplication period of a current deduplication period corresponding to the data to be deduplicated needs to be inserted into a Redis server pair.
Specifically, as shown in fig. 3, before step 103 is executed, specifically, the border crossing process is also performed by the operations of steps S1-S3, including:
s1: and calculating a first boundary coefficient and a second boundary coefficient according to the occurrence time of the data to be deduplicated and the preset cycle length.
Calculating a first boundary coefficient and a second boundary coefficient by the following formulas (4) and (5) according to the occurrence time and the preset cycle length of the data to be deduplicated:
R1=t%CL…(4)
R2=CL-t%CL…(5)
in the above formulas (4) and (5), R1Is a first boundary coefficient, R2Is the second boundary coefficient, t is the occurrence time, and CL is the preset period length.
S2: and judging whether the first boundary coefficient is less than or equal to a preset threshold value, if so, executing the step S3, and if not, executing the step S4.
The preset threshold may be c, 2c, etc., where c is the preset time unit. If the first boundary coefficient is less than or equal to the preset threshold, it indicates that there is a possibility that the data to be deduplicated may be duplicated or similar to the data of the adjacent 24 points of the previous day.
And S3, generating a first boundary key value pair corresponding to the Redis key value pair, inserting the first boundary key value pair into the Redis server pair, and then executing the step 103.
And adjusting the Redis key included in the Redis key value pair generated in the step 102 to be the Redis key corresponding to the previous deduplication period of the current deduplication period. Specifically, the period identification in the Redis key included in the Redis key value pair generated in step 102 is decremented by one. Namely, the new Redis key is "{ a } - { CI-1 }", and the new Redis key and the key value included in the Redis key value pair generated in step 102 are combined into a first boundary key value pair. Assuming that the Redis key-value pair generated in step 102 includes key values of (M-1), M and (M +1), the first boundary key-value pair is ("{ a } - { CI-1 }", ((M-1), M, (M +1))), and the first boundary key-value pair is inserted into the Redis server pair. An insert operation SADD ("{ a } - { CI-1 }", ((M-1), M, (M +1))) is performed in the Redis server pair, i.e., three elements (M-1), M, and (M +1) are inserted into the set "{ a } - { CI-1 }".
Because the set "{ a } - { CI-1 }" has already performed the deduplication operation in the previous deduplication period, only the key values corresponding to the data to be deduplicated are inserted into the set "{ a } - { CI-1 }" in the current deduplication period, and a Redis server pair is not required to send the return result to the server cluster.
S4: it is determined whether the second boundary coefficient is less than or equal to a preset threshold, and if so, step S5 is performed, and if not, step S103 is performed.
The preset threshold may be c, 2c, etc., where c is the preset time unit. If the second boundary coefficient is less than or equal to the preset threshold, it indicates that there is a possibility that the data to be deduplicated may be duplicated or similar to the data near the zero point in the next day.
S5: and generating a second boundary key value pair corresponding to the Redis key value pair, inserting the second boundary key value pair into the Redis server pair, and then executing step 103.
And adjusting the Redis key included in the Redis key value pair generated in the step 102 to be the Redis key corresponding to the next deduplication period of the current deduplication period. Specifically, the Redis key value generated in step 102 is added by one to the cycle identifier in the Redis key included in the Redis key value. Namely, the new Redis key is "{ a } - { CI +1 }", and the new Redis key and the key value included in the Redis key value pair generated in step 102 are combined into a first boundary key value pair. Assuming that the Redis key-value pair generated in step 102 includes key values of (M-1), M and (M +1), the first boundary key-value pair is ("{ a } - { CI +1 }", ((M-1), M, (M +1))), and the first boundary key-value pair is inserted into the Redis server pair. An insert operation SADD ("{ a } - { CI +1 }", ((M-1), M, (M +1))) is performed in the Redis server pair, i.e., three elements (M-1), M, and (M +1) are inserted into the set "{ a } - { CI +1 }".
Because the set "{ a } - { CI +1 }" is to perform the deduplication operation when a subsequent deduplication period comes, only the key values corresponding to the data to be deduplicated are inserted into the set "{ a } - { CI +1 }" in the current deduplication period, and a Redis server is not required to send the return result to the server cluster.
The key values corresponding to the data to be deduplicated are preset number of near time expanded from the occurrence time of the data to be deduplicated, repeated data and approximate data are identified between the near time by using the characteristics of the Redis storage system, the approximate data in the big data can be effectively filtered, the deduplication precision is high, a large amount of redundant data is prevented from being stored, and the storage space is saved.
In order to facilitate understanding of the method provided by the embodiment of the invention, the following description is given by taking the passing record deduplication as an example with reference to the accompanying drawings. For example, the time unit c is 10, the deduplication period is 3 × c 30 seconds, the preset period length CL is 100, assuming that a vehicle a stays before the device 1 from 241 th second, the device 1 continuously reports the vehicle passing record of the vehicle a, and assuming that the device 1 reports the vehicle passing record of the vehicle a successively from 241 th second, 256 th second, and 271 second, the deduplication process is as shown in fig. 4. And finally, only the first vehicle passing record is saved in the database, and the other two vehicle passing records are deduplicated.
In the embodiment of the invention, the big data deduplication is carried out through the server cluster, and the data operation is dispersed to different nodes in the cluster environment as far as possible. And a key value pair database Redis with high concurrent access is adopted during deduplication, so that the minimal system resource occupation of deduplication operation is ensured from the space and time perspectives. The occurrence time of the data to be deduplicated is extended to a plurality of adjacent times, approximate data with close time can be effectively filtered, the deduplication accuracy is high, the precision is high, the universality is good, and the method can be applied to big data application scenes with various data having the characteristic of time continuity.
Example 2
Referring to fig. 5, an embodiment of the present invention provides a big data deduplication device, where the big data deduplication device is configured to perform the big data deduplication method provided in embodiment 1 above, and the device includes:
a receiving module 20, configured to receive data to be deduplicated, where the data to be deduplicated includes occurrence time and a data character string;
the generating module 21 is configured to generate a Redis key value pair corresponding to data to be deduplicated according to the occurrence time and the data character string;
the determining module 22 is configured to insert the Redis key value pair into the Redis server pair, and determine whether the data to be deduplicated is duplicate data according to a return result of the Redis server pair.
The generating module 21 includes:
the generating unit is used for generating a Redis key corresponding to the data to be deduplicated according to the occurrence time and the data character string; generating a key value corresponding to the Redis key according to the occurrence time;
and the composition unit is used for composing the Redis key and the key value into a Redis key value pair corresponding to the data to be deduplicated.
The generating unit is used for calculating a period identifier corresponding to the data to be deduplicated according to the occurrence time and a preset period length; generating a Redis key corresponding to the data to be deduplicated according to the data character string and the period identifier; and for extending the occurrence time to a preset number of adjacent times; and determining the preset number of the approach time as key values corresponding to the Redis keys.
The determining module 22 is configured to determine whether a return result of the Redis server pair is equal to the number of key values included in the Redis key value pair; if so, determining that the data to be deduplicated is not repeated data; if not, determining that the data to be deduplicated is repeated data, and discarding the data to be deduplicated.
In this embodiment of the present invention, the foregoing determining module 22 inserts the Redis key-value pair into the Redis server pair, and before that, the method further includes:
the boundary crossing processing module is used for calculating a first boundary coefficient and a second boundary coefficient according to the occurrence time and the preset cycle length; if the first boundary coefficient is smaller than or equal to a preset threshold value, generating a first boundary key value pair corresponding to the Redis key value pair, and inserting the first boundary key value pair into the Redis server pair; and if the second boundary coefficient is less than or equal to the preset threshold value, generating a second boundary key value pair corresponding to the Redis key value pair, and inserting the second boundary key value pair into the Redis server pair.
In the embodiment of the invention, the big data deduplication is carried out through the server cluster, and the data operation is dispersed to different nodes in the cluster environment as far as possible. And a key value pair database Redis with high concurrent access is adopted during deduplication, so that the minimal system resource occupation of deduplication operation is ensured from the space and time perspectives. The occurrence time of the data to be deduplicated is extended to a plurality of adjacent times, approximate data with close time can be effectively filtered, the deduplication accuracy is high, the precision is high, the universality is good, and the method can be applied to big data application scenes with various data having the characteristic of time continuity.
Example 3
An embodiment of the present invention provides a big data deduplication apparatus, which includes one or more processors and one or more storage devices, where the one or more storage devices store one or more programs, and when the one or more programs are loaded and executed by the one or more processors, the big data deduplication method provided in embodiment 1 above is implemented.
In the embodiment of the invention, the big data deduplication is carried out through the server cluster, and the data operation is dispersed to different nodes in the cluster environment as far as possible. And a key value pair database Redis with high concurrent access is adopted during deduplication, so that the minimal system resource occupation of deduplication operation is ensured from the space and time perspectives. The occurrence time of the data to be deduplicated is extended to a plurality of adjacent times, approximate data with close time can be effectively filtered, the deduplication accuracy is high, the precision is high, the universality is good, and the method can be applied to big data application scenes with various data having the characteristic of time continuity.
Example 4
The embodiment of the present invention provides a computer-readable storage medium, where an executable program is stored in the storage medium, and when the executable program is loaded and executed by a processor, the method for removing duplicate data provided in embodiment 1 above is implemented.
In the embodiment of the invention, the big data deduplication is carried out through the server cluster, and the data operation is dispersed to different nodes in the cluster environment as far as possible. And a key value pair database Redis with high concurrent access is adopted during deduplication, so that the minimal system resource occupation of deduplication operation is ensured from the space and time perspectives. The occurrence time of the data to be deduplicated is extended to a plurality of adjacent times, approximate data with close time can be effectively filtered, the deduplication accuracy is high, the precision is high, the universality is good, and the method can be applied to big data application scenes with various data having the characteristic of time continuity.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the creation apparatus of a virtual machine according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (9)

1. A big data deduplication method, the method comprising:
receiving data to be deduplicated, wherein the data to be deduplicated comprises occurrence time and a data character string;
generating a Redis key value pair corresponding to the data to be deduplicated according to the occurrence time and the data character string;
calculating a first boundary coefficient and a second boundary coefficient according to the occurrence time and a preset cycle length; calculating a first boundary coefficient and a second boundary coefficient by the following formulas (4) and (5) according to the occurrence time and the preset cycle length of the data to be deduplicated:
R1=t%CL…(4)
R2=CL-t%CL…(5)
in the above formulas (4) and (5), R1Is a first boundary coefficient, R2Is a second boundary coefficient, t is the occurrence time, and CL is the preset period length;
inserting the Redis key value pair into a Redis server pair, and determining whether the data to be deduplicated is repeated data according to a return result of the Redis server pair; if the first boundary coefficient is smaller than or equal to a preset threshold value, generating a first boundary key value pair corresponding to the Redis key value pair, and inserting the first boundary key value pair into a Redis server pair; and if the second boundary coefficient is smaller than or equal to the preset threshold, generating a second boundary key value pair corresponding to the Redis key value pair, and inserting the second boundary key value pair into the Redis server pair.
2. The method according to claim 1, wherein the generating a Redis key-value pair corresponding to the data to be deduplicated according to the occurrence time and the data string comprises:
generating a Redis key corresponding to the data to be deduplicated according to the occurrence time and the data character string;
generating a key value corresponding to the Redis key according to the occurrence time;
and forming the Redis key and the key value into a Redis key value pair corresponding to the data to be deduplicated.
3. The method according to claim 2, wherein the generating a Redis key corresponding to the data to be deduplicated according to the occurrence time and the data string comprises:
calculating a period identifier corresponding to the data to be deduplicated according to the occurrence time and a preset period length;
and generating a Redis key corresponding to the data to be deduplicated according to the data character string and the period identifier.
4. The method according to claim 2, wherein the generating, according to the occurrence time, a key value corresponding to the Redis key includes:
expanding the occurrence time to a preset number of adjacent times;
and determining the preset number of the approaching time as key values corresponding to the Redis keys.
5. The method according to claim 2, wherein the determining whether the data to be deduplicated is duplicate data according to the returned result of the Redis server pair includes:
judging whether the returned result of the Redis server pair is equal to the number of key values included in the Redis key value pair;
if so, determining that the data to be deduplicated is not repeated data;
if not, determining that the data to be deduplicated is repeated data, and discarding the data to be deduplicated.
6. A big data deduplication apparatus, the apparatus comprising:
the device comprises a receiving module, a sending module and a receiving module, wherein the receiving module is used for receiving data to be deduplicated, and the data to be deduplicated comprises occurrence time and a data character string;
the generating module is used for generating a Redis key value pair corresponding to the data to be deduplicated according to the occurrence time and the data character string; calculating a first boundary coefficient and a second boundary coefficient according to the occurrence time and a preset cycle length; calculating a first boundary coefficient and a second boundary coefficient by the following formulas (4) and (5) according to the occurrence time and the preset cycle length of the data to be deduplicated:
R1=t%CL…(4)
R2=CL-t%CL…(5)
in the above formulas (4) and (5), R1Is a first boundary coefficient, R2Is a second boundary coefficient, t is the occurrence time, and CL is the preset period length;
the determining module is used for inserting the Redis key value pair into a Redis server pair and determining whether the data to be deduplicated is repeated data according to a return result of the Redis server pair; if the first boundary coefficient is smaller than or equal to a preset threshold value, generating a first boundary key value pair corresponding to the Redis key value pair, and inserting the first boundary key value pair into a Redis server pair; and if the second boundary coefficient is smaller than or equal to the preset threshold, generating a second boundary key value pair corresponding to the Redis key value pair, and inserting the second boundary key value pair into the Redis server pair.
7. The apparatus of claim 6, wherein the generating module comprises:
the generating unit is used for generating a Redis key corresponding to the data to be deduplicated according to the occurrence time and the data character string; generating a key value corresponding to the Redis key according to the occurrence time;
and the composition unit is used for composing the Redis key and the key value into a Redis key value pair corresponding to the data to be deduplicated.
8. A big data deduplication apparatus, comprising:
one or more processors;
storage means for storing one or more programs;
the one or more programs are executed by the one or more processors such that the one or more processors implement the method of any of claims 1-5.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN201811488881.0A 2018-12-06 2018-12-06 Big data deduplication method and device Active CN109522305B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811488881.0A CN109522305B (en) 2018-12-06 2018-12-06 Big data deduplication method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811488881.0A CN109522305B (en) 2018-12-06 2018-12-06 Big data deduplication method and device

Publications (2)

Publication Number Publication Date
CN109522305A CN109522305A (en) 2019-03-26
CN109522305B true CN109522305B (en) 2021-02-02

Family

ID=65794946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811488881.0A Active CN109522305B (en) 2018-12-06 2018-12-06 Big data deduplication method and device

Country Status (1)

Country Link
CN (1) CN109522305B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625523B (en) * 2020-04-20 2023-08-08 沈阳派客动力科技有限公司 Method, device and equipment for synthesizing data
CN112306998B (en) * 2020-10-13 2023-11-24 武汉中科通达高新技术股份有限公司 Method, device and server for de-duplication of traffic and delegation data
CN116846893A (en) * 2023-09-01 2023-10-03 北京钱安德胜科技有限公司 Vehicle-road-oriented cooperative automatic driving traffic big data verification method and device

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350869B (en) * 2007-07-19 2011-08-24 中国电信股份有限公司 Method and apparatus for removing repeat of telecom charging based on index and hash
CN103064908B (en) * 2012-12-18 2016-03-16 北京讯鸟软件有限公司 A kind of method by the quick duplicate removal list of internal memory
CN105354246B (en) * 2015-10-13 2018-11-02 华南理工大学 A kind of data duplicate removal method calculated based on distributed memory
CN106682004A (en) * 2015-11-06 2017-05-17 网宿科技股份有限公司 Redis Key management method and system
CN108063957B (en) * 2016-11-08 2020-12-01 北京国双科技有限公司 Statistical method, device, storage medium and processor for network television user state
US10769152B2 (en) * 2016-12-02 2020-09-08 Cisco Technology, Inc. Automated log analysis
CN106649646A (en) * 2016-12-09 2017-05-10 北京锐安科技有限公司 Method and device for deleting duplicated data
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
CN107832406B (en) * 2017-11-03 2020-09-11 北京锐安科技有限公司 Method, device, equipment and storage medium for removing duplicate entries of mass log data

Also Published As

Publication number Publication date
CN109522305A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN109522305B (en) Big data deduplication method and device
CN110781231A (en) Batch import method, device, equipment and storage medium based on database
US11977532B2 (en) Log record identification using aggregated log indexes
CN109740129B (en) Report generation method, device and equipment based on blockchain and readable storage medium
CN110888981B (en) Title-based document clustering method and device, terminal equipment and medium
CN111125298A (en) Method, equipment and storage medium for reconstructing NTFS file directory tree
CN110287201A (en) Data access method, device, equipment and storage medium
CN108121774B (en) Data table backup method and terminal equipment
CN109271545A (en) A kind of characteristic key method and device, storage medium and computer equipment
CN113468118B (en) File increment storage method, device and storage medium based on blockchain
CN108154024A (en) A kind of data retrieval method, device and electronic equipment
CN109344255B (en) Label filling method and terminal equipment
CN113901037A (en) Data management method, device and storage medium
CN113448946A (en) Data migration method and device and electronic equipment
CN112631920A (en) Test method, test device, electronic equipment and readable storage medium
CN116009889A (en) Deep learning model deployment method and device, electronic equipment and storage medium
CN108090128B (en) Recovery method and device for merged storage space and electronic equipment
CN115238194A (en) Book recommendation method, computing device and computer storage medium
CN110851437A (en) Storage method, device and equipment
CN108614838B (en) User group index processing method, device and system
CN110609854A (en) Method, system, electronic device and computer storage medium for field name query
CN111371818A (en) Data request verification method, device and equipment
CN112860694B (en) Service data processing method, device and equipment
US20170154096A1 (en) Data service system and electronic apparatus
CN116028448A (en) Identification code determining method, device, equipment and storage medium of electronic file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant