CN109522305A - A kind of big data De-weight method and device - Google Patents

A kind of big data De-weight method and device Download PDF

Info

Publication number
CN109522305A
CN109522305A CN201811488881.0A CN201811488881A CN109522305A CN 109522305 A CN109522305 A CN 109522305A CN 201811488881 A CN201811488881 A CN 201811488881A CN 109522305 A CN109522305 A CN 109522305A
Authority
CN
China
Prior art keywords
data
redis
key
deduplicated
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811488881.0A
Other languages
Chinese (zh)
Other versions
CN109522305B (en
Inventor
郭冰
程广艺
罗天成
夏曙东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Polytron Technologies Inc
Original Assignee
Beijing Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Polytron Technologies Inc filed Critical Beijing Polytron Technologies Inc
Priority to CN201811488881.0A priority Critical patent/CN109522305B/en
Publication of CN109522305A publication Critical patent/CN109522305A/en
Application granted granted Critical
Publication of CN109522305B publication Critical patent/CN109522305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of big data De-weight method and devices, this method comprises: receiving data to be deduplicated, data to be deduplicated includes time of origin and data character string;According to time of origin and data character string, the corresponding Redis key-value pair of data to be deduplicated is generated;The insertion Redis server centering of Redis key-value pair is determined whether data to be deduplicated is repeated data according to Redis server to returning the result.The present invention carries out big data duplicate removal by server cluster, data operation is distributed to as far as possible on the different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, all ensured that duplicate removal operation occupies the smallest system resource from the angle of room and time.Multiple the time is closed on by expanding to the time of origin of data to be deduplicated, time close approximate data can effectively be filtered out, duplicate removal accuracy is high, precision is high, and versatility is good, and can be applied to various data has the characteristics that in the big data application scenarios of time continuity.

Description

A kind of big data De-weight method and device
Technical field
The invention belongs to technical field of data processing, and in particular to a kind of big data De-weight method and device.
Background technique
Big data technology is widely applied in every field at present.Data exist in some big data application scenarios Regular hour continuity, vehicle is crossed vehicle record by reader when vehicle passes through the reader of bayonet such as in traffic big data It is uploaded to big data platform, vehicle, which crosses vehicle record, there is regular hour continuity, if vehicle is walked or drive slowly or static at bayonet, The vehicle of crossing that then reader can repeat to upload the vehicle in a short time records, and causes big data platform storage many repetitions or approximate Data.Therefore big data platform needs to carry out duplicate removal processing to the data received.
Currently, a kind of data duplicate removal method is provided in the related technology, i.e., goes in reset cycle often to receive a number at one According to determining preset number keyword, judges that this goes in other each data received in reset cycle whether to deposit from the data In the data comprising these keywords, if it is, the data are deleted.If it is not, then storing the data.
But simple keyword duplicate removal in the related technology can not eliminate approximate data, and the accuracy of duplicate removal is very poor, duplicate removal it A large amount of data redundancy is still had afterwards, wastes a large amount of memory space, also will form information pollution, is covered really valuable Information.
Summary of the invention
In order to solve the above problem, the present invention provides a kind of big data De-weight method and device, by the generation of data to be deduplicated Time expand to it is multiple close on the time, can effectively filter out time close approximate data, duplicate removal accuracy is high, precision is high. The present invention solves problem above by the following aspects.
In a first aspect, the embodiment of the invention provides a kind of big data De-weight methods, which comprises
Data to be deduplicated is received, the data to be deduplicated includes time of origin and data character string;
According to the time of origin and the data character string, the corresponding Redis key-value pair of the data to be deduplicated is generated;
The Redis key-value pair is inserted into the centering of Redis server, according to the Redis server to returning the result, Determine whether the data to be deduplicated is repeated data.
With reference to first aspect, the embodiment of the invention provides the first possible implementation of above-mentioned first aspect, In, it is described according to the time of origin and the data character string, the corresponding Redis key-value pair of the data to be deduplicated is generated, Include:
According to the time of origin and the data character string, the corresponding Redis key of the data to be deduplicated is generated;
According to the time of origin, the corresponding key assignments of the Redis key is generated;
The Redis key and the key assignments are formed into the corresponding Redis key-value pair of the data to be deduplicated.
The possible implementation of with reference to first aspect the first, the embodiment of the invention provides the of above-mentioned first aspect Two kinds of possible implementations, wherein it is described according to the time of origin and the data character string, generate the tuple to be gone According to corresponding Redis key, comprising:
According to the time of origin and predetermined period length, the corresponding period indication of the data to be deduplicated is calculated;
According to the data character string and the period indication, the corresponding Redis key of the data to be deduplicated is generated.
The possible implementation of with reference to first aspect the first, the embodiment of the invention provides the of above-mentioned first aspect Three kinds of possible implementations, wherein it is described according to the time of origin, generate the corresponding key assignments of the Redis key, comprising:
The time of origin is extended to preset number and closes on the time;
The preset number is closed on into the time and is determined as the corresponding key assignments of the Redis key.
The possible implementation of with reference to first aspect the first, the embodiment of the invention provides the of above-mentioned first aspect Four kinds of possible implementations, wherein it is described according to the Redis server to returning the result, determine the tuple to be gone According to whether being repeated data, comprising:
Judge the Redis server pair returns the result the number for whether being equal to the key assignments that the Redis key-value pair includes Mesh;
If it is, determining that the data to be deduplicated is not repeated data;
If it is not, then determining that the data to be deduplicated is repeated data, the data to be deduplicated is abandoned.
With reference to first aspect, the embodiment of the invention provides the 5th kind of possible implementation of above-mentioned first aspect, In, it is described that the Redis key-value pair is inserted into the centering of Redis server, before further include:
According to the time of origin and predetermined period length, the first border coefficient and the second boundary coefficient are calculated;
If first border coefficient is less than or equal to preset threshold, the Redis key-value pair corresponding first is generated First boundary key-value pair is inserted into the centering of Redis server by boundary key-value pair;
If the second boundary coefficient is less than or equal to the preset threshold, it is corresponding to generate the Redis key-value pair The second boundary key-value pair is inserted into the centering of Redis server by the second boundary key-value pair.
Second aspect, the embodiment of the invention provides a kind of big data duplicate removal device, described device includes:
Receiving module, for receiving data to be deduplicated, the data to be deduplicated includes time of origin and data character string;
Generation module, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string Redis key-value pair;
Determining module, for the Redis key-value pair to be inserted into the centering of Redis server, according to the Redis server To returning the result, determine whether the data to be deduplicated is repeated data.
In conjunction with second aspect, the embodiment of the invention provides the first possible implementation of above-mentioned second aspect, In, the generation module includes:
Generation unit, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string Redis key;According to the time of origin, the corresponding key assignments of the Redis key is generated;
Component units, for the Redis key and the key assignments to be formed the corresponding Redis key assignments of the data to be deduplicated It is right.
The third aspect, the embodiment of the invention provides a kind of big datas to go heavy equipment, comprising:
One or more processors;
Storage device, for storing one or more programs;
One or more of programs are executed by one or more of processors, so that one or more of processors Realize method described in any possible implementation of above-mentioned first aspect or first aspect.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer journey Sequence, described in any possible implementation that above-mentioned first aspect or first aspect are realized when described program is executed by processor Method.
In embodiments of the present invention, data to be deduplicated is received, data to be deduplicated includes time of origin and data character string;Root According to time of origin and data character string, the corresponding Redis key-value pair of data to be deduplicated is generated;Redis key-value pair is inserted into Redis Server centering determines whether data to be deduplicated is repeated data according to Redis server to returning the result.Pass through service Device cluster carries out big data duplicate removal, data operation is distributed to as far as possible on the different nodes in cluster environment.And in duplicate removal Using can high concurrent access key-value pair data library Redis, from the angle of room and time all ensured that duplicate removal operation occupy most Small system resource.By by the time of origin of data to be deduplicated expand to it is multiple close on the time, can effectively filter out the time Close approximate data, duplicate removal accuracy is high, precision is high, and versatility is good, can be applied to various data with Time Continuous In the big data application scenarios of property feature.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of network architecture that big data De-weight method is based on signal provided by the embodiment of the present invention 1 Figure;
Fig. 2 shows a kind of flow diagrams of big data De-weight method provided by the embodiment of the present invention 1;
Fig. 3 shows the flow diagram of another kind big data De-weight method provided by the embodiment of the present invention 1.
Fig. 4 shows the flow diagram that vehicle record duplicate removal is crossed provided by the embodiment of the present invention 1.
Fig. 5 shows a kind of structural schematic diagram of big data duplicate removal device provided by the embodiment of the present invention 2.
Specific embodiment
The illustrative embodiments of the disclosure are more fully described below with reference to accompanying drawings.Although showing this public affairs in attached drawing The illustrative embodiments opened, it being understood, however, that may be realized in various forms the disclosure without the reality that should be illustrated here The mode of applying is limited.It is to be able to thoroughly understand the disclosure on the contrary, providing these embodiments, and can be by this public affairs The range opened is fully disclosed to those skilled in the art.
Embodiment 1
The embodiment of the invention provides a kind of big data De-weight methods.Referring to Fig. 1, network architecture packet that this method is based on Include data acquisition equipment, server cluster and Redis server pair.Wherein, data acquisition equipment is used to acquire data to be deduplicated, And data to be deduplicated is uploaded to server cluster.It include multiple servers, the execution of the embodiment of the present invention in server cluster Main body is server, duplicate removal work is distributed to as far as possible in the different nodes of cluster environment by server cluster, to obtain Take maximum operand.Multiple groups Redis server pair is set in the embodiment of the present invention, and each Redis server is to including Redis Primary server and Redis standby server, by Redis server to storing the intermediate data during big data duplicate removal, And the stability of intermediate data storage is improved by the active and standby server of Redis server centering.Redis is a kind of Key- Value storage system, in big data duplicate removal by can the key-value pair data library of high concurrent access carry out the operation of atomicity, It all ensure that the minimum for occupying resource to greatest extent from the angle of room and time.
The embodiment of the present invention can be applied to various data in the big data scene of time continuity.Such as in the big number of traffic Just has time continuity according to vehicle record in scene, is crossed.The reader of bayonet will be uploaded by the vehicle record of crossing of the vehicle of bayonet To server cluster, the license plate number and number plate type, the device identification of the reader and equipment class that vehicle record includes the vehicle are crossed Type and the reader recognize the time of origin of the vehicle.The number plate type of vehicle can be large car or compact car etc..When When vehicle walks or drive slowly at bayonet or is static, the vehicle of crossing that the reader of bayonet can repeatedly upload the vehicle in a short time is recorded, on Pass this it is multiple cross vehicles record only time of origin be different each other, mutually approximate data each other, therefore using of the invention real The method of example offer is applied to identify these approximate datas, and duplicate removal is carried out to these approximate datas, only retains readding for the bayonet First for reading the vehicle of device upload crosses vehicle record, and the subsequent vehicle record of crossing repeatedly uploaded is removed, and can effectively filter Repeated data and approximate data, to avoid the redundant data of storing excess in server cluster.
Referring to fig. 2, this method specifically includes the following steps:
Step 101: receiving data to be deduplicated, data to be deduplicated includes time of origin and data character string.
Server receives the data to be deduplicated that data acquisition equipment uploads, which includes time of origin and data Character string.For example, certain vehicle that the reader that server receives bayonet in traffic big data uploads crosses vehicle record, this crosses vehicle Record include reader identify the time of origin the 241st second of the vehicle and the license plate number of the vehicle, number plate type and this read Read device identification, the device type of device.Wherein, the license plate number of the vehicle, number plate type and the reader device identification, set The character string of standby type composition is the data character string that data to be deduplicated includes.
In embodiments of the present invention, time of origin can be the UNIX time, i.e., when 1 day 0 January in 1970 0 divide 0 second to Data acquisition equipment collects total number of seconds when the data to be deduplicated.Time of origin may be since some predetermined time to Data acquisition equipment collects total number of seconds when the data to be deduplicated, and 0 divides 0 second when predetermined time can be 1 day 0 January in 2018 Or 0 divide 0 second etc. when being 1 day 0 November in 2018.
Step 102: according to time of origin and data character string, generating the corresponding Redis key-value pair of data to be deduplicated.
The operation of A1-A3 as follows of the embodiment of the present invention generates the corresponding Redis key-value pair of data to be deduplicated, It specifically includes:
A1: according to time of origin and data character string, the corresponding Redis key of data to be deduplicated is generated.
Specifically, the time of origin and predetermined period length for including according to data to be deduplicated pass through following formula (1) and calculate The corresponding period indication of data to be deduplicated.
CI=INT (t/CL) ... (1)
In above-mentioned formula 1, CI is period indication, and t is the time of origin of data to be deduplicated, and CL is predetermined period length, INT () is bracket function.
In embodiments of the present invention, predetermined period length can be one day or two days etc., and the unit of predetermined period length is used Second indicates that is, hypothesis predetermined period length is one day, then the value of predetermined period length is total number of seconds 86400 seconds of one day.
After (1) calculates the corresponding period indication of data to be deduplicated through the above way, include according to data to be deduplicated Data character string and the period indication of above-mentioned calculating generate the corresponding Redis key of data to be deduplicated by following formula (2).
Key=" { a }-{ CI } " ... (2)
In above-mentioned formula (2), Key is Redis key, and { a } is the data to be deduplicated data character string that includes, { CI } be to The corresponding period indication of duplicate removal data.Period indication currently removes reset cycle for identifying belonging to data to be deduplicated.
A2: according to the time of origin of data to be deduplicated, the corresponding key assignments of above-mentioned Redis key is generated.
Time of origin is extended to preset number and closes on the time;Preset number is closed on into the time and is determined as Redis key Corresponding key assignments.
Above-mentioned preset number can be 3 or 5 etc..In order to simplify operation in the embodiment of the present invention, by the hair of data to be deduplicated The raw time is extensive in some chronomere, converts time of origin in the chronomere especially by following formula (3) Time indicates:
M=INT (t/c) ... (3)
In above-mentioned formula (3), M is the time of origin after time extensive operation, and t is extensive preoperative time of origin, c For preset chronomere.Wherein, the value of c can be 10 seconds or 20 seconds etc..Go reset cycle can be in embodiments of the present invention It for the integral multiple of chronomere c, such as is 3c or 5c around duplicate removal.
After time of origin is transformed into the expression of the time under chronomere c M, time of origin is extended to preset number and is faced The nearly time.For example, it is assumed that preset number is 3, then time of origin can be extended to 3 and close on time (M-1), M and (M+1).It is false If preset number is 5, then time of origin can be extended to 5 and close on time (M-2), (M-1), M, (M+1) and (M+2).
The preset number of above-mentioned extension is closed on into the time as the key assignments of the corresponding Redis key of data to be deduplicated.
A3: by the corresponding Redis key-value pair of Redis key and key assignments composition data to be deduplicated of above-mentioned generation.
Assuming that indicate Redis key with Key, the key assignments that step A2 is generated is (M-1), M and (M+1), then constitutes to duplicate removal The corresponding Redis key-value pair of data (Key, ((M-1), M, (M+1))).
Step 103: the corresponding Redis key-value pair of data to be deduplicated being inserted into the centering of Redis server, is taken according to Redis Device be engaged in returning the result, determines whether data to be deduplicated is repeated data.
The corresponding Redis key-value pair of data to be deduplicated is inserted by the insertion function SADD () of Redis storage system Redis server centering.The characteristics of being inserted into function SADD () is as being inserted into one or more elements in a set, with the collection The already present duplicate element of element will be ignored in conjunction, be only inserted in the set there is currently no element.It is inserted into function The return value of SADD () is the number for the element being inserted into set.Therefore server cluster receives Redis server to hair That send returns the result, and judge Redis server pair returns the result the number for whether being equal to the key assignments that Redis key-value pair includes; If it is, determining that data to be deduplicated is not repeated data, the follow-up storage data to be deduplicated.If not, namely the return knot Fruit is less than the number for the key assignments that Redis key-value pair includes, it is determined that data to be deduplicated is repeated data, and abandoning should tuple be gone According to.
For example, it is assumed that the corresponding Redis key-value pair of data to be deduplicated is (Key, ((M-1), M, (M+1))), then in Redis Server centering executes insertion operation SADD (Key, ((M-1), M, (M+1))), and three elements are inserted into as into set Key (M-1), M and (M+1).If these three elements are not present in set Key, these three elements are entirely insertable in set Key, and Transmission returns result to server cluster, which is 3, and server cluster returns the result determining be somebody's turn to do wait go according to this Tuple stores the data to be deduplicated according to not being repeated data.If having existed one or more in these three elements in set Key This is then calculated duplicate element in an element and abandoned, unduplicated element is inserted into set Key, and send return by a element As a result server cluster is given, for the value returned the result at this time less than 3, server cluster returns the result determination according to this should tuple be gone According to being repeated data, the data to be deduplicated is abandoned.
Due to data to be deduplicated have temporal continuity, in time response data to be deduplicated there may be across The time of origin that the case where boundary, even data to be deduplicated include closes on the zero point in one day, then the data to be deduplicated may The data for closing at 24 points with the previous day are repeated data or approximate data, therefore also needing to be inserted into Redis server centering should The corresponding previous Redis key-value pair for removing reset cycle for currently removing reset cycle of data to be deduplicated.If the hair that data to be deduplicated includes The raw time closes on 24 points in one day, then the data to be deduplicated may be with the data for closing on zero point one day after repeated data or Approximate data, therefore also need to be inserted into that the data to be deduplicated is corresponding currently goes the latter of reset cycle to Redis server centering Remove the Redis key-value pair of reset cycle.
Specifically, as shown in figure 3, specifically, before executing step 103, also the operation of S1-S3 comes as follows Carry out cross-border processing, comprising:
S1: according to the time of origin of data to be deduplicated and predetermined period length, the first border coefficient and the second boundary are calculated Coefficient.
Pass through following formula (4) and (5) respectively according to the time of origin of data to be deduplicated and predetermined period length to calculate First border coefficient and the second boundary coefficient:
R1=t%CL ... (4)
R2=CL-t%CL ... (5)
In above-mentioned formula (4) and (5), R1For the first border coefficient, R2For the second boundary coefficient, t is time of origin, CL For predetermined period length.
S2: judging whether the first border coefficient is less than or equal to preset threshold, if so, S3 is thened follow the steps, if not, Then follow the steps S4.
Above-mentioned preset threshold can be c or 2c etc., and c is above-mentioned preset chronomere.If the first border coefficient be less than or Equal to the preset threshold, then there is repetition or approximately may be used in the data for showing that the data to be deduplicated may close at 24 points with the previous day It can property.
S3: generating the corresponding first boundary key-value pair of above-mentioned Redis key-value pair, and the first boundary key-value pair is inserted into Redis Then server centering executes step 103.
The Redis key that the Redis key-value pair that step 102 generates includes is adjusted to currently to go the previous duplicate removal week of reset cycle Phase corresponding Redis key.Specifically, the period indication in Redis key that the Redis key-value pair that step 102 generates includes is subtracted One.I.e. new Redis key is " { a }-{ CI-1 } ", includes by the Redis key-value pair that new Redis key and step 102 generate Key assignments forms the first boundary key-value pair.Assuming that the key assignments that the Redis key-value pair that step 102 generates includes is (M-1), M and (M+ 1), then the first boundary key-value pair is (" { a }-{ CI-1 } ", ((M-1), M, (M+1))), and the first boundary key-value pair is inserted into Redis Server centering.Insertion operation SADD (" { a }-{ CI-1 } ", ((M-1), M, (M+1))) is executed in Redis server pair, i.e., To three elements (M-1) of insertion in set " { a }-{ CI-1 } ", M and (M+1).
Since set " { a }-{ CI-1 } " goes reset cycle that deduplication operation has been carried out previous, in current duplicate removal week Interim, by the corresponding key assignments insertion set " { a }-{ CI-1 } " of data to be deduplicated, does not need Redis server to transmission Return result to server cluster.
S4: judging whether the second boundary coefficient is less than or equal to preset threshold, if so, S5 is thened follow the steps, if not, Then follow the steps 103.
Above-mentioned preset threshold can be c or 2c etc., and c is above-mentioned preset chronomere.If the second boundary coefficient be less than or Equal to the preset threshold, then shows that the data to be deduplicated may exist with the data for closing on zero point one day after and repeat or approximately may be used It can property.
S5: generating the corresponding the second boundary key-value pair of above-mentioned Redis key-value pair, and the second boundary key-value pair is inserted into Redis Then server centering executes step 103.
The Redis key that the Redis key-value pair that step 102 generates includes is adjusted to currently to go the latter duplicate removal week of reset cycle Phase corresponding Redis key.Specifically, the period indication in Redis key that the Redis key-value pair that step 102 generates includes is added One.I.e. new Redis key is " { a }-{ CI+1 } ", includes by the Redis key-value pair that new Redis key and step 102 generate Key assignments forms the first boundary key-value pair.Assuming that the key assignments that the Redis key-value pair that step 102 generates includes is (M-1), M and (M+ 1), then the first boundary key-value pair is (" { a }-{ CI+1 } ", ((M-1), M, (M+1))), and the first boundary key-value pair is inserted into Redis Server centering.Insertion operation SADD (" { a }-{ CI+1 } ", ((M-1), M, (M+1))) is executed in Redis server pair, i.e., To three elements (M-1) of insertion in set " { a }-{ CI+1 } ", M and (M+1).
Due to set " { a }-{ CI+1 } " will it is latter go reset cycle to arrive when carry out deduplication operation, in current duplicate removal Redis server only is not needed to hair by the corresponding key assignments insertion set " { a }-{ CI+1 } " of data to be deduplicated in period It send and returns result to server cluster.
The corresponding key assignments of data to be deduplicated is that the preset number expanded from the time of origin of data to be deduplicated is closed on Time, the characteristic using Redis storage system are being closed on identification repeated data and approximate data between the time, can effectively filtered Approximate data in big data, duplicate removal precision is high, avoids storage mass of redundancy data, saves memory space.
To facilitate the understanding of the present invention embodiment provide method, with reference to the accompanying drawing by cross vehicle record duplicate removal for into Row explanation.For example, enabling chronomere c=10, removing reset cycle is 3*c=30 seconds, predetermined period length CL=100, it is assumed that one Before vehicle a rested on No. 1 equipment since the 241st second, then No. 1 equipment constantly reports vehicle a's to cross vehicle record, it is assumed that No. 1 equipment Successively at the 241st second, 256 seconds, report within 271 seconds vehicle a's to cross vehicle record, then duplicate removal process is as shown in Figure 4.It is final to only have It crosses vehicle record for first and is saved to database, two other crosses vehicle record by duplicate removal.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good, Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
Embodiment 2
Referring to Fig. 5, the embodiment of the invention provides a kind of big data duplicate removal device, the device is for executing above-described embodiment Big data De-weight method provided by 1, the device include:
Receiving module 20, for receiving data to be deduplicated, data to be deduplicated includes time of origin and data character string;
Generation module 21, for generating the corresponding Redis key assignments of data to be deduplicated according to time of origin and data character string It is right;
Determining module 22, for Redis key-value pair to be inserted into the centering of Redis server, according to Redis server to returning Return as a result, determining whether data to be deduplicated is repeated data.
Above-mentioned generation module 21 includes:
Generation unit, for generating the corresponding Redis key of data to be deduplicated according to time of origin and data character string;Root According to time of origin, the corresponding key assignments of Redis key is generated;
Component units, for Redis key and key assignments to be formed the corresponding Redis key-value pair of data to be deduplicated.
Above-mentioned generation unit, for calculating the data to be deduplicated corresponding period according to time of origin and predetermined period length Mark;According to data character string and period indication, the corresponding Redis key of data to be deduplicated is generated;And it is used for time of origin It is extended to preset number and closes on the time;Preset number is closed on into the time and is determined as the corresponding key assignments of Redis key.
Above-mentioned determining module 22, for judge Redis server pair return the result whether be equal to Redis key-value pair include Key assignments number;If it is, determining that data to be deduplicated is not repeated data;If it is not, then determining that data to be deduplicated is weight Complex data abandons data to be deduplicated.
In embodiments of the present invention, Redis key-value pair is inserted into the centering of Redis server by above-mentioned determining module 22, before Further include:
Cross-border processing module, for calculating the first border coefficient and second according to time of origin and predetermined period length Border coefficient;If the first border coefficient is less than or equal to preset threshold, the corresponding first boundary key assignments of Redis key-value pair is generated It is right, the first boundary key-value pair is inserted into the centering of Redis server;If the second boundary coefficient is less than or equal to preset threshold, give birth to At the corresponding the second boundary key-value pair of Redis key-value pair, the second boundary key-value pair is inserted into the centering of Redis server.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good, Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
Embodiment 3
The embodiment of the present invention provides a kind of big data and goes heavy equipment, which includes one or more processors, Yi Jiyi A or multiple storage devices are stored with one or more programs in one or more of storage devices, one or more of When program is loaded and executed by one or more of processors, big data De-weight method provided by above-described embodiment 1 is realized.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good, Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
Embodiment 4
The embodiment of the present invention provide a kind of computer can storage medium, be stored with executable program in the storage medium, institute Executable code processor is stated to load and realize big data De-weight method provided by above-described embodiment 1 when executing.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good, Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
It should be understood that
Algorithm and display do not have intrinsic phase with any certain computer, virtual bench or other equipment provided herein It closes.Various fexible units can also be used together with teachings based herein.As described above, this kind of device is constructed to be wanted The structure asked is obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use each Kind programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this The preferred forms of invention.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice One in the creating device of microprocessor or digital signal processor (DSP) to realize virtual machine according to an embodiment of the present invention The some or all functions of a little or whole components.The present invention is also implemented as executing method as described herein Some or all device or device programs (for example, computer program and computer program product).Such realization Program of the invention can store on a computer-readable medium, or may be in the form of one or more signals.This The signal of sample can be downloaded from an internet website to obtain, and is perhaps provided on the carrier signal or mentions in any other forms For.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of the claim Subject to enclosing.

Claims (10)

1. a kind of big data De-weight method, which is characterized in that the described method includes:
Data to be deduplicated is received, the data to be deduplicated includes time of origin and data character string;
According to the time of origin and the data character string, the corresponding Redis key-value pair of the data to be deduplicated is generated;
The Redis key-value pair is inserted into the centering of Redis server, according to the Redis server to returning the result, is determined Whether the data to be deduplicated is repeated data.
2. the method according to claim 1, wherein described according to the time of origin and the data character String, generates the corresponding Redis key-value pair of the data to be deduplicated, comprising:
According to the time of origin and the data character string, the corresponding Redis key of the data to be deduplicated is generated;
According to the time of origin, the corresponding key assignments of the Redis key is generated;
The Redis key and the key assignments are formed into the corresponding Redis key-value pair of the data to be deduplicated.
3. according to the method described in claim 2, it is characterized in that, described according to the time of origin and the data character String, generates the corresponding Redis key of the data to be deduplicated, comprising:
According to the time of origin and predetermined period length, the corresponding period indication of the data to be deduplicated is calculated;
According to the data character string and the period indication, the corresponding Redis key of the data to be deduplicated is generated.
4. according to the method described in claim 2, generating the Redis key it is characterized in that, described according to the time of origin Corresponding key assignments, comprising:
The time of origin is extended to preset number and closes on the time;
The preset number is closed on into the time and is determined as the corresponding key assignments of the Redis key.
5. according to the method described in claim 2, it is characterized in that, it is described according to the Redis server to returning the result, Determine whether the data to be deduplicated is repeated data, comprising:
Judge the Redis server pair returns the result the number for whether being equal to the key assignments that the Redis key-value pair includes;
If it is, determining that the data to be deduplicated is not repeated data;
If it is not, then determining that the data to be deduplicated is repeated data, the data to be deduplicated is abandoned.
6. method according to claim 1-5, which is characterized in that described to be inserted into the Redis key-value pair Redis server centering, before further include:
According to the time of origin and predetermined period length, the first border coefficient and the second boundary coefficient are calculated;
If first border coefficient is less than or equal to preset threshold, corresponding first boundary of the Redis key-value pair is generated First boundary key-value pair is inserted into the centering of Redis server by key-value pair;
If the second boundary coefficient is less than or equal to the preset threshold, the Redis key-value pair corresponding second is generated The second boundary key-value pair is inserted into the centering of Redis server by boundary key-value pair.
7. a kind of big data duplicate removal device, which is characterized in that described device includes:
Receiving module, for receiving data to be deduplicated, the data to be deduplicated includes time of origin and data character string;
Generation module, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string Redis key-value pair;
Determining module, for the Redis key-value pair to be inserted into the centering of Redis server, according to the Redis server pair It returns the result, determines whether the data to be deduplicated is repeated data.
8. the apparatus according to claim 1, which is characterized in that the generation module includes:
Generation unit, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string Redis key;According to the time of origin, the corresponding key assignments of the Redis key is generated;
Component units, for the Redis key and the key assignments to be formed the corresponding Redis key-value pair of the data to be deduplicated.
9. a kind of big data goes heavy equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
One or more of programs are executed by one or more of processors, so that one or more of processors are realized Such as method as claimed in any one of claims 1 to 6.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed Such as method as claimed in any one of claims 1 to 6 is realized when device executes.
CN201811488881.0A 2018-12-06 2018-12-06 Big data deduplication method and device Active CN109522305B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811488881.0A CN109522305B (en) 2018-12-06 2018-12-06 Big data deduplication method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811488881.0A CN109522305B (en) 2018-12-06 2018-12-06 Big data deduplication method and device

Publications (2)

Publication Number Publication Date
CN109522305A true CN109522305A (en) 2019-03-26
CN109522305B CN109522305B (en) 2021-02-02

Family

ID=65794946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811488881.0A Active CN109522305B (en) 2018-12-06 2018-12-06 Big data deduplication method and device

Country Status (1)

Country Link
CN (1) CN109522305B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625523A (en) * 2020-04-20 2020-09-04 沈阳派客动力科技有限公司 Data synthesis method, device and equipment
CN112306998A (en) * 2020-10-13 2021-02-02 武汉中科通达高新技术股份有限公司 Commission data duplicate removal method, device and server
CN116846893A (en) * 2023-09-01 2023-10-03 北京钱安德胜科技有限公司 Vehicle-road-oriented cooperative automatic driving traffic big data verification method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350869A (en) * 2007-07-19 2009-01-21 中国电信股份有限公司 Method and apparatus for removing repeat of telecom charging based on index and hash
CN103064908A (en) * 2012-12-18 2013-04-24 北京讯鸟软件有限公司 Method for rapidly removing repeated list through a memory
CN105354246A (en) * 2015-10-13 2016-02-24 华南理工大学 Distributed memory calculation based data deduplication method
CN106649646A (en) * 2016-12-09 2017-05-10 北京锐安科技有限公司 Method and device for deleting duplicated data
US20170286527A1 (en) * 2015-11-06 2017-10-05 Wangsu Science & Technology Co., Ltd. Redis key management method and system
CN107832406A (en) * 2017-11-03 2018-03-23 北京锐安科技有限公司 Duplicate removal storage method, device, equipment and the storage medium of massive logs data
CN108063957A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 A kind of statistical method and device of network television user state
US20180157713A1 (en) * 2016-12-02 2018-06-07 Cisco Technology, Inc. Automated log analysis
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350869A (en) * 2007-07-19 2009-01-21 中国电信股份有限公司 Method and apparatus for removing repeat of telecom charging based on index and hash
CN103064908A (en) * 2012-12-18 2013-04-24 北京讯鸟软件有限公司 Method for rapidly removing repeated list through a memory
CN105354246A (en) * 2015-10-13 2016-02-24 华南理工大学 Distributed memory calculation based data deduplication method
US20170286527A1 (en) * 2015-11-06 2017-10-05 Wangsu Science & Technology Co., Ltd. Redis key management method and system
CN108063957A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 A kind of statistical method and device of network television user state
US20180157713A1 (en) * 2016-12-02 2018-06-07 Cisco Technology, Inc. Automated log analysis
CN106649646A (en) * 2016-12-09 2017-05-10 北京锐安科技有限公司 Method and device for deleting duplicated data
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
CN107832406A (en) * 2017-11-03 2018-03-23 北京锐安科技有限公司 Duplicate removal storage method, device, equipment and the storage medium of massive logs data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625523A (en) * 2020-04-20 2020-09-04 沈阳派客动力科技有限公司 Data synthesis method, device and equipment
CN111625523B (en) * 2020-04-20 2023-08-08 沈阳派客动力科技有限公司 Method, device and equipment for synthesizing data
CN112306998A (en) * 2020-10-13 2021-02-02 武汉中科通达高新技术股份有限公司 Commission data duplicate removal method, device and server
CN112306998B (en) * 2020-10-13 2023-11-24 武汉中科通达高新技术股份有限公司 Method, device and server for de-duplication of traffic and delegation data
CN116846893A (en) * 2023-09-01 2023-10-03 北京钱安德胜科技有限公司 Vehicle-road-oriented cooperative automatic driving traffic big data verification method and device

Also Published As

Publication number Publication date
CN109522305B (en) 2021-02-02

Similar Documents

Publication Publication Date Title
CN109522305A (en) A kind of big data De-weight method and device
JP2018139136A5 (en)
CN103500185B (en) A kind of method and system based on multi-platform data generation tables of data
US10073866B2 (en) Dynamic test case prioritization for relational database systems
US9098630B2 (en) Data selection
US20100088257A1 (en) Systems and Methods for Generating Predicates and Assertions
US11321318B2 (en) Dynamic access paths
CN113822438A (en) Machine learning model training checkpoint
CN109992515A (en) Test method and device, electronic equipment
CN116847132B (en) Video updating method and device based on time slicing, electronic equipment and storage medium
CN111523921B (en) Funnel analysis method, analysis device, electronic device, and readable storage medium
CN116009889A (en) Deep learning model deployment method and device, electronic equipment and storage medium
JP6748474B2 (en) Decision support system and decision support method
US9892010B2 (en) Persistent command parameter table for pre-silicon device testing
CN112242959B (en) Micro-service current-limiting control method, device, equipment and computer storage medium
US10437710B2 (en) Code coverage testing utilizing test-to-file maps
CN112181825A (en) Test case library construction method and device, electronic equipment and medium
CN105808621B (en) A kind of method and apparatus calculating response time search time
CN110209940A (en) Display methods, server and the computer storage medium of alternative loose-leaf
CN109840259A (en) Data query method, apparatus, electronic equipment and readable storage medium storing program for executing
CN114138320A (en) Code workload statistical method, device and equipment
US20080082471A1 (en) Resolve Trace Minimization
CN113553320B (en) Data quality monitoring method and device
CN117195568B (en) Simulation engine performance analysis method and device based on discrete event
CN110825631B (en) Test method, test device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant