CN109522305A - A kind of big data De-weight method and device - Google Patents
A kind of big data De-weight method and device Download PDFInfo
- Publication number
- CN109522305A CN109522305A CN201811488881.0A CN201811488881A CN109522305A CN 109522305 A CN109522305 A CN 109522305A CN 201811488881 A CN201811488881 A CN 201811488881A CN 109522305 A CN109522305 A CN 109522305A
- Authority
- CN
- China
- Prior art keywords
- data
- redis
- key
- deduplicated
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of big data De-weight method and devices, this method comprises: receiving data to be deduplicated, data to be deduplicated includes time of origin and data character string;According to time of origin and data character string, the corresponding Redis key-value pair of data to be deduplicated is generated;The insertion Redis server centering of Redis key-value pair is determined whether data to be deduplicated is repeated data according to Redis server to returning the result.The present invention carries out big data duplicate removal by server cluster, data operation is distributed to as far as possible on the different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, all ensured that duplicate removal operation occupies the smallest system resource from the angle of room and time.Multiple the time is closed on by expanding to the time of origin of data to be deduplicated, time close approximate data can effectively be filtered out, duplicate removal accuracy is high, precision is high, and versatility is good, and can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
Description
Technical field
The invention belongs to technical field of data processing, and in particular to a kind of big data De-weight method and device.
Background technique
Big data technology is widely applied in every field at present.Data exist in some big data application scenarios
Regular hour continuity, vehicle is crossed vehicle record by reader when vehicle passes through the reader of bayonet such as in traffic big data
It is uploaded to big data platform, vehicle, which crosses vehicle record, there is regular hour continuity, if vehicle is walked or drive slowly or static at bayonet,
The vehicle of crossing that then reader can repeat to upload the vehicle in a short time records, and causes big data platform storage many repetitions or approximate
Data.Therefore big data platform needs to carry out duplicate removal processing to the data received.
Currently, a kind of data duplicate removal method is provided in the related technology, i.e., goes in reset cycle often to receive a number at one
According to determining preset number keyword, judges that this goes in other each data received in reset cycle whether to deposit from the data
In the data comprising these keywords, if it is, the data are deleted.If it is not, then storing the data.
But simple keyword duplicate removal in the related technology can not eliminate approximate data, and the accuracy of duplicate removal is very poor, duplicate removal it
A large amount of data redundancy is still had afterwards, wastes a large amount of memory space, also will form information pollution, is covered really valuable
Information.
Summary of the invention
In order to solve the above problem, the present invention provides a kind of big data De-weight method and device, by the generation of data to be deduplicated
Time expand to it is multiple close on the time, can effectively filter out time close approximate data, duplicate removal accuracy is high, precision is high.
The present invention solves problem above by the following aspects.
In a first aspect, the embodiment of the invention provides a kind of big data De-weight methods, which comprises
Data to be deduplicated is received, the data to be deduplicated includes time of origin and data character string;
According to the time of origin and the data character string, the corresponding Redis key-value pair of the data to be deduplicated is generated;
The Redis key-value pair is inserted into the centering of Redis server, according to the Redis server to returning the result,
Determine whether the data to be deduplicated is repeated data.
With reference to first aspect, the embodiment of the invention provides the first possible implementation of above-mentioned first aspect,
In, it is described according to the time of origin and the data character string, the corresponding Redis key-value pair of the data to be deduplicated is generated,
Include:
According to the time of origin and the data character string, the corresponding Redis key of the data to be deduplicated is generated;
According to the time of origin, the corresponding key assignments of the Redis key is generated;
The Redis key and the key assignments are formed into the corresponding Redis key-value pair of the data to be deduplicated.
The possible implementation of with reference to first aspect the first, the embodiment of the invention provides the of above-mentioned first aspect
Two kinds of possible implementations, wherein it is described according to the time of origin and the data character string, generate the tuple to be gone
According to corresponding Redis key, comprising:
According to the time of origin and predetermined period length, the corresponding period indication of the data to be deduplicated is calculated;
According to the data character string and the period indication, the corresponding Redis key of the data to be deduplicated is generated.
The possible implementation of with reference to first aspect the first, the embodiment of the invention provides the of above-mentioned first aspect
Three kinds of possible implementations, wherein it is described according to the time of origin, generate the corresponding key assignments of the Redis key, comprising:
The time of origin is extended to preset number and closes on the time;
The preset number is closed on into the time and is determined as the corresponding key assignments of the Redis key.
The possible implementation of with reference to first aspect the first, the embodiment of the invention provides the of above-mentioned first aspect
Four kinds of possible implementations, wherein it is described according to the Redis server to returning the result, determine the tuple to be gone
According to whether being repeated data, comprising:
Judge the Redis server pair returns the result the number for whether being equal to the key assignments that the Redis key-value pair includes
Mesh;
If it is, determining that the data to be deduplicated is not repeated data;
If it is not, then determining that the data to be deduplicated is repeated data, the data to be deduplicated is abandoned.
With reference to first aspect, the embodiment of the invention provides the 5th kind of possible implementation of above-mentioned first aspect,
In, it is described that the Redis key-value pair is inserted into the centering of Redis server, before further include:
According to the time of origin and predetermined period length, the first border coefficient and the second boundary coefficient are calculated;
If first border coefficient is less than or equal to preset threshold, the Redis key-value pair corresponding first is generated
First boundary key-value pair is inserted into the centering of Redis server by boundary key-value pair;
If the second boundary coefficient is less than or equal to the preset threshold, it is corresponding to generate the Redis key-value pair
The second boundary key-value pair is inserted into the centering of Redis server by the second boundary key-value pair.
Second aspect, the embodiment of the invention provides a kind of big data duplicate removal device, described device includes:
Receiving module, for receiving data to be deduplicated, the data to be deduplicated includes time of origin and data character string;
Generation module, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string
Redis key-value pair;
Determining module, for the Redis key-value pair to be inserted into the centering of Redis server, according to the Redis server
To returning the result, determine whether the data to be deduplicated is repeated data.
In conjunction with second aspect, the embodiment of the invention provides the first possible implementation of above-mentioned second aspect,
In, the generation module includes:
Generation unit, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string
Redis key;According to the time of origin, the corresponding key assignments of the Redis key is generated;
Component units, for the Redis key and the key assignments to be formed the corresponding Redis key assignments of the data to be deduplicated
It is right.
The third aspect, the embodiment of the invention provides a kind of big datas to go heavy equipment, comprising:
One or more processors;
Storage device, for storing one or more programs;
One or more of programs are executed by one or more of processors, so that one or more of processors
Realize method described in any possible implementation of above-mentioned first aspect or first aspect.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer journey
Sequence, described in any possible implementation that above-mentioned first aspect or first aspect are realized when described program is executed by processor
Method.
In embodiments of the present invention, data to be deduplicated is received, data to be deduplicated includes time of origin and data character string;Root
According to time of origin and data character string, the corresponding Redis key-value pair of data to be deduplicated is generated;Redis key-value pair is inserted into Redis
Server centering determines whether data to be deduplicated is repeated data according to Redis server to returning the result.Pass through service
Device cluster carries out big data duplicate removal, data operation is distributed to as far as possible on the different nodes in cluster environment.And in duplicate removal
Using can high concurrent access key-value pair data library Redis, from the angle of room and time all ensured that duplicate removal operation occupy most
Small system resource.By by the time of origin of data to be deduplicated expand to it is multiple close on the time, can effectively filter out the time
Close approximate data, duplicate removal accuracy is high, precision is high, and versatility is good, can be applied to various data with Time Continuous
In the big data application scenarios of property feature.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of network architecture that big data De-weight method is based on signal provided by the embodiment of the present invention 1
Figure;
Fig. 2 shows a kind of flow diagrams of big data De-weight method provided by the embodiment of the present invention 1;
Fig. 3 shows the flow diagram of another kind big data De-weight method provided by the embodiment of the present invention 1.
Fig. 4 shows the flow diagram that vehicle record duplicate removal is crossed provided by the embodiment of the present invention 1.
Fig. 5 shows a kind of structural schematic diagram of big data duplicate removal device provided by the embodiment of the present invention 2.
Specific embodiment
The illustrative embodiments of the disclosure are more fully described below with reference to accompanying drawings.Although showing this public affairs in attached drawing
The illustrative embodiments opened, it being understood, however, that may be realized in various forms the disclosure without the reality that should be illustrated here
The mode of applying is limited.It is to be able to thoroughly understand the disclosure on the contrary, providing these embodiments, and can be by this public affairs
The range opened is fully disclosed to those skilled in the art.
Embodiment 1
The embodiment of the invention provides a kind of big data De-weight methods.Referring to Fig. 1, network architecture packet that this method is based on
Include data acquisition equipment, server cluster and Redis server pair.Wherein, data acquisition equipment is used to acquire data to be deduplicated,
And data to be deduplicated is uploaded to server cluster.It include multiple servers, the execution of the embodiment of the present invention in server cluster
Main body is server, duplicate removal work is distributed to as far as possible in the different nodes of cluster environment by server cluster, to obtain
Take maximum operand.Multiple groups Redis server pair is set in the embodiment of the present invention, and each Redis server is to including Redis
Primary server and Redis standby server, by Redis server to storing the intermediate data during big data duplicate removal,
And the stability of intermediate data storage is improved by the active and standby server of Redis server centering.Redis is a kind of Key-
Value storage system, in big data duplicate removal by can the key-value pair data library of high concurrent access carry out the operation of atomicity,
It all ensure that the minimum for occupying resource to greatest extent from the angle of room and time.
The embodiment of the present invention can be applied to various data in the big data scene of time continuity.Such as in the big number of traffic
Just has time continuity according to vehicle record in scene, is crossed.The reader of bayonet will be uploaded by the vehicle record of crossing of the vehicle of bayonet
To server cluster, the license plate number and number plate type, the device identification of the reader and equipment class that vehicle record includes the vehicle are crossed
Type and the reader recognize the time of origin of the vehicle.The number plate type of vehicle can be large car or compact car etc..When
When vehicle walks or drive slowly at bayonet or is static, the vehicle of crossing that the reader of bayonet can repeatedly upload the vehicle in a short time is recorded, on
Pass this it is multiple cross vehicles record only time of origin be different each other, mutually approximate data each other, therefore using of the invention real
The method of example offer is applied to identify these approximate datas, and duplicate removal is carried out to these approximate datas, only retains readding for the bayonet
First for reading the vehicle of device upload crosses vehicle record, and the subsequent vehicle record of crossing repeatedly uploaded is removed, and can effectively filter
Repeated data and approximate data, to avoid the redundant data of storing excess in server cluster.
Referring to fig. 2, this method specifically includes the following steps:
Step 101: receiving data to be deduplicated, data to be deduplicated includes time of origin and data character string.
Server receives the data to be deduplicated that data acquisition equipment uploads, which includes time of origin and data
Character string.For example, certain vehicle that the reader that server receives bayonet in traffic big data uploads crosses vehicle record, this crosses vehicle
Record include reader identify the time of origin the 241st second of the vehicle and the license plate number of the vehicle, number plate type and this read
Read device identification, the device type of device.Wherein, the license plate number of the vehicle, number plate type and the reader device identification, set
The character string of standby type composition is the data character string that data to be deduplicated includes.
In embodiments of the present invention, time of origin can be the UNIX time, i.e., when 1 day 0 January in 1970 0 divide 0 second to
Data acquisition equipment collects total number of seconds when the data to be deduplicated.Time of origin may be since some predetermined time to
Data acquisition equipment collects total number of seconds when the data to be deduplicated, and 0 divides 0 second when predetermined time can be 1 day 0 January in 2018
Or 0 divide 0 second etc. when being 1 day 0 November in 2018.
Step 102: according to time of origin and data character string, generating the corresponding Redis key-value pair of data to be deduplicated.
The operation of A1-A3 as follows of the embodiment of the present invention generates the corresponding Redis key-value pair of data to be deduplicated,
It specifically includes:
A1: according to time of origin and data character string, the corresponding Redis key of data to be deduplicated is generated.
Specifically, the time of origin and predetermined period length for including according to data to be deduplicated pass through following formula (1) and calculate
The corresponding period indication of data to be deduplicated.
CI=INT (t/CL) ... (1)
In above-mentioned formula 1, CI is period indication, and t is the time of origin of data to be deduplicated, and CL is predetermined period length,
INT () is bracket function.
In embodiments of the present invention, predetermined period length can be one day or two days etc., and the unit of predetermined period length is used
Second indicates that is, hypothesis predetermined period length is one day, then the value of predetermined period length is total number of seconds 86400 seconds of one day.
After (1) calculates the corresponding period indication of data to be deduplicated through the above way, include according to data to be deduplicated
Data character string and the period indication of above-mentioned calculating generate the corresponding Redis key of data to be deduplicated by following formula (2).
Key=" { a }-{ CI } " ... (2)
In above-mentioned formula (2), Key is Redis key, and { a } is the data to be deduplicated data character string that includes, { CI } be to
The corresponding period indication of duplicate removal data.Period indication currently removes reset cycle for identifying belonging to data to be deduplicated.
A2: according to the time of origin of data to be deduplicated, the corresponding key assignments of above-mentioned Redis key is generated.
Time of origin is extended to preset number and closes on the time;Preset number is closed on into the time and is determined as Redis key
Corresponding key assignments.
Above-mentioned preset number can be 3 or 5 etc..In order to simplify operation in the embodiment of the present invention, by the hair of data to be deduplicated
The raw time is extensive in some chronomere, converts time of origin in the chronomere especially by following formula (3)
Time indicates:
M=INT (t/c) ... (3)
In above-mentioned formula (3), M is the time of origin after time extensive operation, and t is extensive preoperative time of origin, c
For preset chronomere.Wherein, the value of c can be 10 seconds or 20 seconds etc..Go reset cycle can be in embodiments of the present invention
It for the integral multiple of chronomere c, such as is 3c or 5c around duplicate removal.
After time of origin is transformed into the expression of the time under chronomere c M, time of origin is extended to preset number and is faced
The nearly time.For example, it is assumed that preset number is 3, then time of origin can be extended to 3 and close on time (M-1), M and (M+1).It is false
If preset number is 5, then time of origin can be extended to 5 and close on time (M-2), (M-1), M, (M+1) and (M+2).
The preset number of above-mentioned extension is closed on into the time as the key assignments of the corresponding Redis key of data to be deduplicated.
A3: by the corresponding Redis key-value pair of Redis key and key assignments composition data to be deduplicated of above-mentioned generation.
Assuming that indicate Redis key with Key, the key assignments that step A2 is generated is (M-1), M and (M+1), then constitutes to duplicate removal
The corresponding Redis key-value pair of data (Key, ((M-1), M, (M+1))).
Step 103: the corresponding Redis key-value pair of data to be deduplicated being inserted into the centering of Redis server, is taken according to Redis
Device be engaged in returning the result, determines whether data to be deduplicated is repeated data.
The corresponding Redis key-value pair of data to be deduplicated is inserted by the insertion function SADD () of Redis storage system
Redis server centering.The characteristics of being inserted into function SADD () is as being inserted into one or more elements in a set, with the collection
The already present duplicate element of element will be ignored in conjunction, be only inserted in the set there is currently no element.It is inserted into function
The return value of SADD () is the number for the element being inserted into set.Therefore server cluster receives Redis server to hair
That send returns the result, and judge Redis server pair returns the result the number for whether being equal to the key assignments that Redis key-value pair includes;
If it is, determining that data to be deduplicated is not repeated data, the follow-up storage data to be deduplicated.If not, namely the return knot
Fruit is less than the number for the key assignments that Redis key-value pair includes, it is determined that data to be deduplicated is repeated data, and abandoning should tuple be gone
According to.
For example, it is assumed that the corresponding Redis key-value pair of data to be deduplicated is (Key, ((M-1), M, (M+1))), then in Redis
Server centering executes insertion operation SADD (Key, ((M-1), M, (M+1))), and three elements are inserted into as into set Key
(M-1), M and (M+1).If these three elements are not present in set Key, these three elements are entirely insertable in set Key, and
Transmission returns result to server cluster, which is 3, and server cluster returns the result determining be somebody's turn to do wait go according to this
Tuple stores the data to be deduplicated according to not being repeated data.If having existed one or more in these three elements in set Key
This is then calculated duplicate element in an element and abandoned, unduplicated element is inserted into set Key, and send return by a element
As a result server cluster is given, for the value returned the result at this time less than 3, server cluster returns the result determination according to this should tuple be gone
According to being repeated data, the data to be deduplicated is abandoned.
Due to data to be deduplicated have temporal continuity, in time response data to be deduplicated there may be across
The time of origin that the case where boundary, even data to be deduplicated include closes on the zero point in one day, then the data to be deduplicated may
The data for closing at 24 points with the previous day are repeated data or approximate data, therefore also needing to be inserted into Redis server centering should
The corresponding previous Redis key-value pair for removing reset cycle for currently removing reset cycle of data to be deduplicated.If the hair that data to be deduplicated includes
The raw time closes on 24 points in one day, then the data to be deduplicated may be with the data for closing on zero point one day after repeated data or
Approximate data, therefore also need to be inserted into that the data to be deduplicated is corresponding currently goes the latter of reset cycle to Redis server centering
Remove the Redis key-value pair of reset cycle.
Specifically, as shown in figure 3, specifically, before executing step 103, also the operation of S1-S3 comes as follows
Carry out cross-border processing, comprising:
S1: according to the time of origin of data to be deduplicated and predetermined period length, the first border coefficient and the second boundary are calculated
Coefficient.
Pass through following formula (4) and (5) respectively according to the time of origin of data to be deduplicated and predetermined period length to calculate
First border coefficient and the second boundary coefficient:
R1=t%CL ... (4)
R2=CL-t%CL ... (5)
In above-mentioned formula (4) and (5), R1For the first border coefficient, R2For the second boundary coefficient, t is time of origin, CL
For predetermined period length.
S2: judging whether the first border coefficient is less than or equal to preset threshold, if so, S3 is thened follow the steps, if not,
Then follow the steps S4.
Above-mentioned preset threshold can be c or 2c etc., and c is above-mentioned preset chronomere.If the first border coefficient be less than or
Equal to the preset threshold, then there is repetition or approximately may be used in the data for showing that the data to be deduplicated may close at 24 points with the previous day
It can property.
S3: generating the corresponding first boundary key-value pair of above-mentioned Redis key-value pair, and the first boundary key-value pair is inserted into Redis
Then server centering executes step 103.
The Redis key that the Redis key-value pair that step 102 generates includes is adjusted to currently to go the previous duplicate removal week of reset cycle
Phase corresponding Redis key.Specifically, the period indication in Redis key that the Redis key-value pair that step 102 generates includes is subtracted
One.I.e. new Redis key is " { a }-{ CI-1 } ", includes by the Redis key-value pair that new Redis key and step 102 generate
Key assignments forms the first boundary key-value pair.Assuming that the key assignments that the Redis key-value pair that step 102 generates includes is (M-1), M and (M+
1), then the first boundary key-value pair is (" { a }-{ CI-1 } ", ((M-1), M, (M+1))), and the first boundary key-value pair is inserted into Redis
Server centering.Insertion operation SADD (" { a }-{ CI-1 } ", ((M-1), M, (M+1))) is executed in Redis server pair, i.e.,
To three elements (M-1) of insertion in set " { a }-{ CI-1 } ", M and (M+1).
Since set " { a }-{ CI-1 } " goes reset cycle that deduplication operation has been carried out previous, in current duplicate removal week
Interim, by the corresponding key assignments insertion set " { a }-{ CI-1 } " of data to be deduplicated, does not need Redis server to transmission
Return result to server cluster.
S4: judging whether the second boundary coefficient is less than or equal to preset threshold, if so, S5 is thened follow the steps, if not,
Then follow the steps 103.
Above-mentioned preset threshold can be c or 2c etc., and c is above-mentioned preset chronomere.If the second boundary coefficient be less than or
Equal to the preset threshold, then shows that the data to be deduplicated may exist with the data for closing on zero point one day after and repeat or approximately may be used
It can property.
S5: generating the corresponding the second boundary key-value pair of above-mentioned Redis key-value pair, and the second boundary key-value pair is inserted into Redis
Then server centering executes step 103.
The Redis key that the Redis key-value pair that step 102 generates includes is adjusted to currently to go the latter duplicate removal week of reset cycle
Phase corresponding Redis key.Specifically, the period indication in Redis key that the Redis key-value pair that step 102 generates includes is added
One.I.e. new Redis key is " { a }-{ CI+1 } ", includes by the Redis key-value pair that new Redis key and step 102 generate
Key assignments forms the first boundary key-value pair.Assuming that the key assignments that the Redis key-value pair that step 102 generates includes is (M-1), M and (M+
1), then the first boundary key-value pair is (" { a }-{ CI+1 } ", ((M-1), M, (M+1))), and the first boundary key-value pair is inserted into Redis
Server centering.Insertion operation SADD (" { a }-{ CI+1 } ", ((M-1), M, (M+1))) is executed in Redis server pair, i.e.,
To three elements (M-1) of insertion in set " { a }-{ CI+1 } ", M and (M+1).
Due to set " { a }-{ CI+1 } " will it is latter go reset cycle to arrive when carry out deduplication operation, in current duplicate removal
Redis server only is not needed to hair by the corresponding key assignments insertion set " { a }-{ CI+1 } " of data to be deduplicated in period
It send and returns result to server cluster.
The corresponding key assignments of data to be deduplicated is that the preset number expanded from the time of origin of data to be deduplicated is closed on
Time, the characteristic using Redis storage system are being closed on identification repeated data and approximate data between the time, can effectively filtered
Approximate data in big data, duplicate removal precision is high, avoids storage mass of redundancy data, saves memory space.
To facilitate the understanding of the present invention embodiment provide method, with reference to the accompanying drawing by cross vehicle record duplicate removal for into
Row explanation.For example, enabling chronomere c=10, removing reset cycle is 3*c=30 seconds, predetermined period length CL=100, it is assumed that one
Before vehicle a rested on No. 1 equipment since the 241st second, then No. 1 equipment constantly reports vehicle a's to cross vehicle record, it is assumed that No. 1 equipment
Successively at the 241st second, 256 seconds, report within 271 seconds vehicle a's to cross vehicle record, then duplicate removal process is as shown in Figure 4.It is final to only have
It crosses vehicle record for first and is saved to database, two other crosses vehicle record by duplicate removal.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible
On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and
The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to
It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good,
Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
Embodiment 2
Referring to Fig. 5, the embodiment of the invention provides a kind of big data duplicate removal device, the device is for executing above-described embodiment
Big data De-weight method provided by 1, the device include:
Receiving module 20, for receiving data to be deduplicated, data to be deduplicated includes time of origin and data character string;
Generation module 21, for generating the corresponding Redis key assignments of data to be deduplicated according to time of origin and data character string
It is right;
Determining module 22, for Redis key-value pair to be inserted into the centering of Redis server, according to Redis server to returning
Return as a result, determining whether data to be deduplicated is repeated data.
Above-mentioned generation module 21 includes:
Generation unit, for generating the corresponding Redis key of data to be deduplicated according to time of origin and data character string;Root
According to time of origin, the corresponding key assignments of Redis key is generated;
Component units, for Redis key and key assignments to be formed the corresponding Redis key-value pair of data to be deduplicated.
Above-mentioned generation unit, for calculating the data to be deduplicated corresponding period according to time of origin and predetermined period length
Mark;According to data character string and period indication, the corresponding Redis key of data to be deduplicated is generated;And it is used for time of origin
It is extended to preset number and closes on the time;Preset number is closed on into the time and is determined as the corresponding key assignments of Redis key.
Above-mentioned determining module 22, for judge Redis server pair return the result whether be equal to Redis key-value pair include
Key assignments number;If it is, determining that data to be deduplicated is not repeated data;If it is not, then determining that data to be deduplicated is weight
Complex data abandons data to be deduplicated.
In embodiments of the present invention, Redis key-value pair is inserted into the centering of Redis server by above-mentioned determining module 22, before
Further include:
Cross-border processing module, for calculating the first border coefficient and second according to time of origin and predetermined period length
Border coefficient;If the first border coefficient is less than or equal to preset threshold, the corresponding first boundary key assignments of Redis key-value pair is generated
It is right, the first boundary key-value pair is inserted into the centering of Redis server;If the second boundary coefficient is less than or equal to preset threshold, give birth to
At the corresponding the second boundary key-value pair of Redis key-value pair, the second boundary key-value pair is inserted into the centering of Redis server.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible
On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and
The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to
It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good,
Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
Embodiment 3
The embodiment of the present invention provides a kind of big data and goes heavy equipment, which includes one or more processors, Yi Jiyi
A or multiple storage devices are stored with one or more programs in one or more of storage devices, one or more of
When program is loaded and executed by one or more of processors, big data De-weight method provided by above-described embodiment 1 is realized.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible
On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and
The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to
It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good,
Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
Embodiment 4
The embodiment of the present invention provide a kind of computer can storage medium, be stored with executable program in the storage medium, institute
Executable code processor is stated to load and realize big data De-weight method provided by above-described embodiment 1 when executing.
In embodiments of the present invention, big data duplicate removal is carried out by server cluster, data operation is distributed to as far as possible
On different nodes in cluster environment.And in duplicate removal using can high concurrent access key-value pair data library Redis, from space and
The angle of time has all ensured that duplicate removal operation occupies the smallest system resource.By the way that the time of origin of data to be deduplicated is expanded to
It is multiple to close on the time, time close approximate data can be effectively filtered out, duplicate removal accuracy is high, precision is high, and versatility is good,
Can be applied to various data has the characteristics that in the big data application scenarios of time continuity.
It should be understood that
Algorithm and display do not have intrinsic phase with any certain computer, virtual bench or other equipment provided herein
It closes.Various fexible units can also be used together with teachings based herein.As described above, this kind of device is constructed to be wanted
The structure asked is obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use each
Kind programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this
The preferred forms of invention.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects,
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect
Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself
All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors
Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice
One in the creating device of microprocessor or digital signal processor (DSP) to realize virtual machine according to an embodiment of the present invention
The some or all functions of a little or whole components.The present invention is also implemented as executing method as described herein
Some or all device or device programs (for example, computer program and computer program product).Such realization
Program of the invention can store on a computer-readable medium, or may be in the form of one or more signals.This
The signal of sample can be downloaded from an internet website to obtain, and is perhaps provided on the carrier signal or mentions in any other forms
For.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability
Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch
To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame
Claim.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art,
It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of the claim
Subject to enclosing.
Claims (10)
1. a kind of big data De-weight method, which is characterized in that the described method includes:
Data to be deduplicated is received, the data to be deduplicated includes time of origin and data character string;
According to the time of origin and the data character string, the corresponding Redis key-value pair of the data to be deduplicated is generated;
The Redis key-value pair is inserted into the centering of Redis server, according to the Redis server to returning the result, is determined
Whether the data to be deduplicated is repeated data.
2. the method according to claim 1, wherein described according to the time of origin and the data character
String, generates the corresponding Redis key-value pair of the data to be deduplicated, comprising:
According to the time of origin and the data character string, the corresponding Redis key of the data to be deduplicated is generated;
According to the time of origin, the corresponding key assignments of the Redis key is generated;
The Redis key and the key assignments are formed into the corresponding Redis key-value pair of the data to be deduplicated.
3. according to the method described in claim 2, it is characterized in that, described according to the time of origin and the data character
String, generates the corresponding Redis key of the data to be deduplicated, comprising:
According to the time of origin and predetermined period length, the corresponding period indication of the data to be deduplicated is calculated;
According to the data character string and the period indication, the corresponding Redis key of the data to be deduplicated is generated.
4. according to the method described in claim 2, generating the Redis key it is characterized in that, described according to the time of origin
Corresponding key assignments, comprising:
The time of origin is extended to preset number and closes on the time;
The preset number is closed on into the time and is determined as the corresponding key assignments of the Redis key.
5. according to the method described in claim 2, it is characterized in that, it is described according to the Redis server to returning the result,
Determine whether the data to be deduplicated is repeated data, comprising:
Judge the Redis server pair returns the result the number for whether being equal to the key assignments that the Redis key-value pair includes;
If it is, determining that the data to be deduplicated is not repeated data;
If it is not, then determining that the data to be deduplicated is repeated data, the data to be deduplicated is abandoned.
6. method according to claim 1-5, which is characterized in that described to be inserted into the Redis key-value pair
Redis server centering, before further include:
According to the time of origin and predetermined period length, the first border coefficient and the second boundary coefficient are calculated;
If first border coefficient is less than or equal to preset threshold, corresponding first boundary of the Redis key-value pair is generated
First boundary key-value pair is inserted into the centering of Redis server by key-value pair;
If the second boundary coefficient is less than or equal to the preset threshold, the Redis key-value pair corresponding second is generated
The second boundary key-value pair is inserted into the centering of Redis server by boundary key-value pair.
7. a kind of big data duplicate removal device, which is characterized in that described device includes:
Receiving module, for receiving data to be deduplicated, the data to be deduplicated includes time of origin and data character string;
Generation module, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string
Redis key-value pair;
Determining module, for the Redis key-value pair to be inserted into the centering of Redis server, according to the Redis server pair
It returns the result, determines whether the data to be deduplicated is repeated data.
8. the apparatus according to claim 1, which is characterized in that the generation module includes:
Generation unit, for it is corresponding to generate the data to be deduplicated according to the time of origin and the data character string
Redis key;According to the time of origin, the corresponding key assignments of the Redis key is generated;
Component units, for the Redis key and the key assignments to be formed the corresponding Redis key-value pair of the data to be deduplicated.
9. a kind of big data goes heavy equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
One or more of programs are executed by one or more of processors, so that one or more of processors are realized
Such as method as claimed in any one of claims 1 to 6.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed
Such as method as claimed in any one of claims 1 to 6 is realized when device executes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811488881.0A CN109522305B (en) | 2018-12-06 | 2018-12-06 | Big data deduplication method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811488881.0A CN109522305B (en) | 2018-12-06 | 2018-12-06 | Big data deduplication method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109522305A true CN109522305A (en) | 2019-03-26 |
CN109522305B CN109522305B (en) | 2021-02-02 |
Family
ID=65794946
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811488881.0A Active CN109522305B (en) | 2018-12-06 | 2018-12-06 | Big data deduplication method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109522305B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111625523A (en) * | 2020-04-20 | 2020-09-04 | 沈阳派客动力科技有限公司 | Data synthesis method, device and equipment |
CN112306998A (en) * | 2020-10-13 | 2021-02-02 | 武汉中科通达高新技术股份有限公司 | Commission data duplicate removal method, device and server |
CN116846893A (en) * | 2023-09-01 | 2023-10-03 | 北京钱安德胜科技有限公司 | Vehicle-road-oriented cooperative automatic driving traffic big data verification method and device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350869A (en) * | 2007-07-19 | 2009-01-21 | 中国电信股份有限公司 | Method and apparatus for removing repeat of telecom charging based on index and hash |
CN103064908A (en) * | 2012-12-18 | 2013-04-24 | 北京讯鸟软件有限公司 | Method for rapidly removing repeated list through a memory |
CN105354246A (en) * | 2015-10-13 | 2016-02-24 | 华南理工大学 | Distributed memory calculation based data deduplication method |
CN106649646A (en) * | 2016-12-09 | 2017-05-10 | 北京锐安科技有限公司 | Method and device for deleting duplicated data |
US20170286527A1 (en) * | 2015-11-06 | 2017-10-05 | Wangsu Science & Technology Co., Ltd. | Redis key management method and system |
CN107832406A (en) * | 2017-11-03 | 2018-03-23 | 北京锐安科技有限公司 | Duplicate removal storage method, device, equipment and the storage medium of massive logs data |
CN108063957A (en) * | 2016-11-08 | 2018-05-22 | 北京国双科技有限公司 | A kind of statistical method and device of network television user state |
US20180157713A1 (en) * | 2016-12-02 | 2018-06-07 | Cisco Technology, Inc. | Automated log analysis |
CN108241615A (en) * | 2016-12-23 | 2018-07-03 | 中国电信股份有限公司 | Data duplicate removal method and device |
-
2018
- 2018-12-06 CN CN201811488881.0A patent/CN109522305B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350869A (en) * | 2007-07-19 | 2009-01-21 | 中国电信股份有限公司 | Method and apparatus for removing repeat of telecom charging based on index and hash |
CN103064908A (en) * | 2012-12-18 | 2013-04-24 | 北京讯鸟软件有限公司 | Method for rapidly removing repeated list through a memory |
CN105354246A (en) * | 2015-10-13 | 2016-02-24 | 华南理工大学 | Distributed memory calculation based data deduplication method |
US20170286527A1 (en) * | 2015-11-06 | 2017-10-05 | Wangsu Science & Technology Co., Ltd. | Redis key management method and system |
CN108063957A (en) * | 2016-11-08 | 2018-05-22 | 北京国双科技有限公司 | A kind of statistical method and device of network television user state |
US20180157713A1 (en) * | 2016-12-02 | 2018-06-07 | Cisco Technology, Inc. | Automated log analysis |
CN106649646A (en) * | 2016-12-09 | 2017-05-10 | 北京锐安科技有限公司 | Method and device for deleting duplicated data |
CN108241615A (en) * | 2016-12-23 | 2018-07-03 | 中国电信股份有限公司 | Data duplicate removal method and device |
CN107832406A (en) * | 2017-11-03 | 2018-03-23 | 北京锐安科技有限公司 | Duplicate removal storage method, device, equipment and the storage medium of massive logs data |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111625523A (en) * | 2020-04-20 | 2020-09-04 | 沈阳派客动力科技有限公司 | Data synthesis method, device and equipment |
CN111625523B (en) * | 2020-04-20 | 2023-08-08 | 沈阳派客动力科技有限公司 | Method, device and equipment for synthesizing data |
CN112306998A (en) * | 2020-10-13 | 2021-02-02 | 武汉中科通达高新技术股份有限公司 | Commission data duplicate removal method, device and server |
CN112306998B (en) * | 2020-10-13 | 2023-11-24 | 武汉中科通达高新技术股份有限公司 | Method, device and server for de-duplication of traffic and delegation data |
CN116846893A (en) * | 2023-09-01 | 2023-10-03 | 北京钱安德胜科技有限公司 | Vehicle-road-oriented cooperative automatic driving traffic big data verification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109522305B (en) | 2021-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522305A (en) | A kind of big data De-weight method and device | |
JP2018139136A5 (en) | ||
CN103500185B (en) | A kind of method and system based on multi-platform data generation tables of data | |
US10073866B2 (en) | Dynamic test case prioritization for relational database systems | |
US9098630B2 (en) | Data selection | |
US20100088257A1 (en) | Systems and Methods for Generating Predicates and Assertions | |
US11321318B2 (en) | Dynamic access paths | |
CN113822438A (en) | Machine learning model training checkpoint | |
CN109992515A (en) | Test method and device, electronic equipment | |
CN116847132B (en) | Video updating method and device based on time slicing, electronic equipment and storage medium | |
CN111523921B (en) | Funnel analysis method, analysis device, electronic device, and readable storage medium | |
CN116009889A (en) | Deep learning model deployment method and device, electronic equipment and storage medium | |
JP6748474B2 (en) | Decision support system and decision support method | |
US9892010B2 (en) | Persistent command parameter table for pre-silicon device testing | |
CN112242959B (en) | Micro-service current-limiting control method, device, equipment and computer storage medium | |
US10437710B2 (en) | Code coverage testing utilizing test-to-file maps | |
CN112181825A (en) | Test case library construction method and device, electronic equipment and medium | |
CN105808621B (en) | A kind of method and apparatus calculating response time search time | |
CN110209940A (en) | Display methods, server and the computer storage medium of alternative loose-leaf | |
CN109840259A (en) | Data query method, apparatus, electronic equipment and readable storage medium storing program for executing | |
CN114138320A (en) | Code workload statistical method, device and equipment | |
US20080082471A1 (en) | Resolve Trace Minimization | |
CN113553320B (en) | Data quality monitoring method and device | |
CN117195568B (en) | Simulation engine performance analysis method and device based on discrete event | |
CN110825631B (en) | Test method, test device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |