CN110489405A - The method, apparatus and server of data processing - Google Patents

The method, apparatus and server of data processing Download PDF

Info

Publication number
CN110489405A
CN110489405A CN201910628181.5A CN201910628181A CN110489405A CN 110489405 A CN110489405 A CN 110489405A CN 201910628181 A CN201910628181 A CN 201910628181A CN 110489405 A CN110489405 A CN 110489405A
Authority
CN
China
Prior art keywords
data
bucket
bit
value
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910628181.5A
Other languages
Chinese (zh)
Other versions
CN110489405B (en
Inventor
张�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910628181.5A priority Critical patent/CN110489405B/en
Priority to PCT/CN2019/116646 priority patent/WO2021008024A1/en
Publication of CN110489405A publication Critical patent/CN110489405A/en
Application granted granted Critical
Publication of CN110489405B publication Critical patent/CN110489405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the present invention is suitable for technical field of data processing, the method, apparatus and server of a kind of data processing is provided, this method comprises: carrying out point bucket to the data record in database and expired time being arranged;It determines that any data records corresponding bit, the bit value of corresponding bit position is set to second value;When the storage for receiving new data is requested, target data bucket and the new data corresponding bit in target data bucket are identified, by judging whether the bit value of corresponding bit position is second value, it is determined whether write new data into target data bucket;If not having new data write-in in data bucket before reaching expired time, then data bucket and its data record are deleted when reaching expired time, otherwise, extend expired time.For the present embodiment by dividing bucket that data are sentenced weight range shorter, certain barrels of data achieve the purpose that delete data solve the problems, such as that vacation sun rate increases when data sentence weight because no longer active and gradually expired.

Description

The method, apparatus and server of data processing
Technical field
The invention belongs to technical field of data processing, method, a kind of data processing more particularly to a kind of data processing Device, a kind of server and a kind of computer readable storage medium.
Background technique
Data are sentenced to be mainly used for being judged with the presence or absence of repetition record in data again, is widely used in various businesses field Under scape.For example, being sentenced by data can be confirmed whether the account has been registered in user's registration account again.
It is mainly realized again by two ways currently, sentence to data.One is directly inquire certain in the database Item record whether there is.For example, needing to inquire in the database is when user uses " Zhang San " this user name register account number It is no that there are identical records.If having there is the record for being " Zhang San " in database, the user name of new registration does not allow Storage, if there is no can then be put in storage.It is this sentence double recipe formula be mainly used for data volume it is smaller when.If necessary to the data of storage Amount is very big, then judging that the number of inquiry will be very more, causes database overhead also very big.Therefore, for big data Weight of sentencing mainly realized by using local memory, such as in conjunction with Bloom filter etc..But due in Bloom filter Bit be not allow to reset so that Bloom filter can not stretch, sentence double recipe formula using this and can not delete data.With Data it is more and more, the memory of occupancy also can be more and more, and false sun rate can be also gradually increased.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of method, apparatus of data processing and server, it is existing to solve It combines Bloom filter to carry out when sentencing weight of big data in technology, since data can not delete, leads to that committed memory is more, false sun The problem of rate increases.
The first aspect of the embodiment of the present invention provides a kind of method of data processing, comprising:
Obtain existing data record in database;
A point bucket is carried out to the data record, and the expired time of each data bucket is set, is wrapped in each data bucket Multiple bits are included, the initial bit value of the multiple bit is the first numerical value;
Determine that any data records corresponding bit in each data bucket, each data record is corresponding The bit value of bit is set to second value;
When the storage for receiving new data is requested, the target data bucket for the new data to be written, Yi Jisuo are identified New data corresponding bit in the target data bucket is stated, by the bit value for judging the corresponding bit of the new data It whether is the second value, it is determined whether the new data is written in the target data bucket;
If not having new data write-in in the data bucket before reaching the data bucket expired time, then institute is reached The data record in the data bucket and the data bucket is deleted when stating expired time, otherwise, then extends the mistake of the data bucket Time phase.
The second aspect of the embodiment of the present invention provides a kind of device of data processing, comprising:
Module is obtained, for obtaining existing data record in database;
Divide bucket module, for carrying out a point bucket to the data record, and the expired time of each data bucket is set, it is described each It include multiple bits in a data bucket, the initial bit value of the multiple bit is the first numerical value;
Module is changed, any data records corresponding bit in each data bucket for determining, will be described each The bit value of the corresponding bit of data record is set to second value;
Determining module, for identifying the target for the new data to be written when the storage for receiving new data is requested Data bucket and the new data corresponding bit in the target data bucket, by judging that the new data is corresponding Whether the bit value of bit is the second value, it is determined whether the new data is written in the target data bucket;
Processing module, if not thering is new data to write in the data bucket for before reaching the data bucket expired time Enter, then otherwise the data record deleted when reaching the expired time in the data bucket and the data bucket then extends institute State the expired time of data bucket.
The third aspect of the embodiment of the present invention provides a kind of server, including memory, processor and is stored in institute The computer program that can be run in memory and on the processor is stated, the processor executes real when the computer program Now as described in above-mentioned first aspect the step of the method for data processing.
The fourth aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer program, which is characterized in that such as above-mentioned first party is realized when the computer program is executed by processor The step of method of data processing described in face.
Compared with prior art, the embodiment of the present invention includes following advantages:
The embodiment of the present invention can carry out point bucket simultaneously to data record by obtaining existing data record in database The expired time of each data bucket is set, thus in the storage request for receiving new data, it can be by identifying for being written The target data bucket and new data of new data corresponding bit in target data bucket, and judge that the new data is corresponding Whether the bit value of bit is special value, determines whether to need the new data to be written in target data bucket, realize Sentencing for data is handled again;Meanwhile if there is not new data write-in in the data bucket, then before reaching data bucket expired time The data record in data bucket and data bucket can be deleted when reaching expired time, otherwise reduction then may be used to the occupancy of memory To extend the expired time of the data bucket.The present embodiment by dividing bucket that data are sentenced weight range shorter, certain barrels of data because Be it is no longer active and gradually expired, achieve the purpose that delete data, without over time, causing data are permanent to deposit It is in Bloom filter, realizes the purpose for deleting data in Bloom filter.Since data record can increase always, the grand mistake of cloth Filter needs more and more bit sentence weight, and using the method for the present embodiment, the bitmap of Bloom filter be will increase Also it can reduce, achieve the purpose that flexible Bloom filter, reduce the occupancy to memory, solve vacation sun rate when data sentence weight and increase Big problem.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described.It should be evident that the accompanying drawings in the following description is only of the invention some Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.
Fig. 1 is a kind of step flow diagram of the method for data processing of one embodiment of the invention;
Fig. 2 is the step flow diagram of the method for another data processing of one embodiment of the invention;
Fig. 3 is a kind of schematic diagram of the device of data processing of one embodiment of the invention;
Fig. 4 is a kind of schematic diagram of server of one embodiment of the invention.
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed Body details, to understand thoroughly the embodiment of the present invention.It should be apparent, however, to those skilled in the art that there is no these specific thin The present invention also may be implemented in the other embodiments of section.In other cases, it omits to well-known system, device, circuit And the detailed description of method, in case unnecessary details interferes description of the invention.
Illustrate technical solution of the present invention below by specific embodiment.
It is simply introduced in order to make it easy to understand, making one to Bloom filter first.
1, liken
If likening Bloom filter to one group of electric switch, can be numbered to each switch.If there is 10000 A switch, then can be numbered from 0, until 9999.What the switch in Bloom filter referred to is exactly bit (bit Position).Opening reference bit be 1, Guan Zhidai bit is 0.Default all switches and be all set to pass, that is, all positions bit are 0。
2, the process being put in storage
Assuming that there is a character string A, being carried out Hash calculation (can be understood as character string maps being one 10000 Within numerical value), by the different available n of hash method different cryptographic Hash.For example, for character string A, by its into After row Hash calculation, available one group of cryptographic Hash is { 5,300,891,2999,7821 }.It is possible to be 5 by number, 300,891,2999,7821 position bit is 1.
3, sentence the process of weight
When carrying out sentencing weight to character string A, identical hash method obtains identical one group when storage can be used Whether cryptographic Hash { 5,300,891,2999,7821 }, the process for sentencing weight just only need to judge the position bit of this group # all for 1.
4, will appear false positive rate why
Go the cryptographic Hash for obtaining a character string that may collide using hash method.Collision is just different word Symbol string is identical by the cryptographic Hash that different hash methods may be got.For example, character string A is obtained by hash method One group of cryptographic Hash be { 5,300,891,2999,7821 }, character string B is by one group of cryptographic Hash that hash method obtains { 300,2999,5,7821,891 }, although their sequence is different, depositing in their its tangible Bloom filter Storage is consistent.When to character string B sentence weight, actually in database and this character string is not present, still Bloom filter judgement (so-called number is that the position bit of cryptographic Hash is all set to 1) is existing character string B, and here it is so-called False sun, false presence.But it is to have certain probability that this, which is collided identical, so being known as false positive rate.When the grand filtering of cloth After most of bit in device are all set to 1, false sun rate will be higher and higher.
5, why can not delete
By above description, there is a kind of viewpoint may think that, if to delete the character string A in Bloom filter, The position bit of { 5,300,891,2999,7821 } this 5 numbers is so set to 0 again can.But this does not allow.
When the cryptographic Hash that another character string C is obtained is { 5,300,1000,4521,123 }, because of the collision of Hash, What this group of cryptographic Hash and character string A were obtained has part identical.So when deleting character string A, that is, by its all Hash It is worth corresponding bit to be all set to after 0, then character string C is when sentencing weight it finds that the bit position that number is 5 and 300 It is 0, character string C will be misjudged at this time.So the position bit in Bloom filter is not allow to reset.It cannot namely delete Except data.
Therefore, the above-mentioned characteristic based on Bloom filter proposes a kind of core of the method for data processing of the present embodiment Heart design is, using this cache tools memory-based of redis, since redis has a bitmap structure just may be used With as Bloom filter come using, so, by constructing the Bloom filter of many small bitmap structures, and be arranged each The expired time of the Bloom filter of a bitmap structure is 30 days, when the Bloom filter of some bitmap structure is more than 30 days No longer write-in data when, memory headroom will be saved in this way, reach data and stretch automatically by the expired deletion of redis The purpose of contracting.
Referring to Fig.1, a kind of step flow diagram of the method for data processing of one embodiment of the invention is shown, is had Body may include steps of:
S101, existing data record in database is obtained;
It should be noted that this method can be applied to handle sentencing for big data again.
Existing data record can refer to the various types of numbers being stored in server in the database of the present embodiment According to record.For example, User ID etc..Certainly, data record is also possible to other kinds of, such as Email Accounts, URL, this Embodiment is not construed as limiting the type of data record.
Before according to this method data are carried out sentencing with weight, need first to data record existing in database at Reason.Therefore, existing data record in available database.
S102, a point bucket is carried out to the data record, and the expired time of each data bucket is set, each data bucket In include multiple bits, the initial bit value of the multiple bit is the first numerical value;
In getting database after existing data record, these data records can be divided to multiple data first In bucket.
In embodiments of the present invention, each data bucket is one and is designed based on redis timeout expirations characteristic The Bloom filter of bitmap structure.
Redis is a cache tools memory-based, has bitmap structure.The basic principle of Bitmap is exactly to use one A bit (bit) marks the corresponding value of some element, and key is the element, and bitmap structure can be grand as cloth Filter come using.Due to storing a data using a bit, memory space can be greatly saved.
In embodiments of the present invention, the Bloom filter of multiple bitmap structures, each Bloom filter can be constructed It is a data bucket for being subsequently used for storing data record.It therefore, all include multiple bits in each data bucket, it is above-mentioned The initial bit value of multiple bits all can be the first numerical value.For example, at the beginning, the value of each bit is 0.With this Meanwhile for the ease of distinguishing each data bucket, corresponding number can also be respectively set for each data bucket.For example, 0-9999 Etc..
Therefore, can according to certain rules, the data record that will acquire relatively evenly is assigned to each data bucket In.
As a kind of example of the invention, each data record can be generated respectively using default random function first whole Then numerical value calculates the remainder obtained after quantity of the integer value of each data record divided by above-mentioned data bucket, due to each number It is respectively provided with corresponding bucket number according to bucket, therefore can be respectively placed in each data record identical with the remainder being calculated Bucket is numbered in corresponding data bucket, to realize the relatively uniform distribution to data record.
In the concrete realization, character string can be generated a long by certain algorithmic procedure by crc32 function Numerical value, therefore crc32 random function can be used, a point bucket is carried out to existing data record.
By taking existing data record is User ID as an example.If there is 2,000,000,000 User ID, crc32 random function can be used will be upper It states User ID and is divided into 200000 data buckets, include 10000 User ID in each data bucket.
That is, each User ID is generated an integer numerical value using crc32 random function first, then removed using the numerical value With the quantity of data bucket, i.e., divided by 200000, then using obtained remainder as the number of data bucket, which is placed in pair In the data bucket answered.
Since crc32 random function has randomness to the integer numerical value that kinds of characters is concatenated, divide bucket using the function Data record can be made to be distributed relatively uniformly among in each data bucket.
Certainly, above-mentioned point of bucket mode is only a kind of example of the present embodiment, and those skilled in the art can be according to practical need A point bucket is carried out to the data record got using other modes.For example, the quantity that can be recorded according to total data, at random Each data record is spread evenly across in each data bucket by ground relatively, and the present embodiment is not construed as limiting this.
The expired time of each data bucket is the expired time of bitmap, that is, is TTL (the Time To in redis Live, lifetime value) expired time.After being provided with the TTL time, if user do not reset again the TTL time when It waits, it begins to countdown.When countdown is 0, this record in redis will be automatically deleted by redis, no longer be occupied Memory.
In embodiments of the present invention, the expired time of each data bucket can be determined according to business demand come specific.For example, The expired time that each data bucket can be set is 30 days.
S103, determine that any data records corresponding bit in each data bucket, by each data record The bit value of corresponding bit is set to second value;
It in embodiments of the present invention, can be the corresponding number of bit setting in each data bucket.For example, each ratio Special position can be numbered from 0, until 9999 so that each data record in the data bucket with one or more bits Position is corresponding.
In the concrete realization, the cryptographic Hash of any data record can be calculated using preset algorithm, then by bit The bit value of bit number bit identical with the cryptographic Hash is set to second value.For example, by the bit value of bit by initial Value 0 is changed to 1.
Hash calculation can be understood as being the numerical value within one 10000 by some character string maps, using different Kazakhstan The uncommon available n of method different cryptographic Hash.Therefore, cryptographic Hash corresponding to a data record also may include multiple.
It, can be by bit bit number after multiple cryptographic Hash of a data record are calculated using a variety of hash methods The bit value of multiple bits identical with above-mentioned cryptographic Hash is set to 1.
S104, when receive new data storage request when, identify the target data bucket for the new data to be written, with And the new data corresponding bit in the target data bucket, by the ratio for judging the corresponding bit of the new data Whether paricular value is the second value, it is determined whether the new data is written in the target data bucket;
According to abovementioned steps database after the processing of existing data record, each data can be based on Bucket carries out the data newly received to sentence weight.
In embodiments of the present invention, the target data bucket and the new data for new data to be written can be identified first The corresponding bit in target data bucket, by judging that the data newly received are corresponding in target data bucket to be written The bit value of bit whether be second value, to determine whether the data are existing data in database.
In the concrete realization, new data can be determined according to mode of operation when carrying out point bucket to existing data record Target data bucket which is, that is, determine which database the new data should be written into.
It is an integer numerical value that it is, for example, possible to use crc32 functions by new data transition, then using the numerical value divided by number According to the quantity of bucket, using obtained remainder as the number of target data bucket.
After identifying target data bucket, it can continue to judge in the data bucket, each bit corresponding with new data Whether the bit value of position is second value.
As a kind of example of the invention, in order to improve the accuracy that data sentence weight, the existing data in database Record can be calculated multiple cryptographic Hash using a variety of hash methods when carrying out point bucket, and by the corresponding bit of multiple cryptographic Hash The bit value of position is set to second value.Therefore, when carrying out sentencing weight to the new data received, it should also using same more Whether multiple cryptographic Hash are calculated in kind of hash method, be the by the bit value of the corresponding bit of each cryptographic Hash of determination Two numerical value complete to sentence the process of weight.
If the bit value for carrying out bit corresponding to the multiple cryptographic Hash obtained after Hash calculation to new data is Second value then can be determined that the new data is existing data record in database.At this point it is possible to refuse entering for the new data Library request.
If any one bit value is not second value in the bit value of bit corresponding to above-mentioned multiple cryptographic Hash, Existing data record in the new data and non-database can be confirmed, so as to write new data into current data bucket In.
If S105, before reaching the data bucket expired time, there is not new data write-in in the data bucket, then arrive The data record in the data bucket and the data bucket, which is deleted, when up to the expired time otherwise then extends the data bucket Expired time.
The expired time of the Bloom filter of Bitmap structure will start countdown after setting, when countdown is 0, This record will be automatically deleted by redis.It is the Bloom filter of bitmap structure by data bucket in this present embodiment, Therefore, it when the expired time of data bucket reaches, the data bucket and can be also deleted together in the data record wherein stored.
It therefore, in embodiments of the present invention, can be by whether the expired time of change data bucket decides whether to retain Data record in the data bucket and bucket.
In the concrete realization, after the expired time of setting data bucket, until the expired time reaches, if should There is not new data write-in in data bucket, then it, can be directly by the data bucket and data record therein when expired time reaches It deletes, reduces the occupancy to memory.
And in the process, if having in new data write-in data bucket, the expired time of the data bucket can be extended.
It should be noted that the expired time for extending data bucket, which can be, is written the progress of data bucket when in new data, extend Expired time be also possible to it is identical as the duration for the expired time being originally arranged.For example, if the expired time of some data bucket Be 30 days, have new data write-in in the data bucket within the 3rd day after countdown starts, then it can at this moment, by the mistake of the data bucket Time phase is extended for 30 days.That is, data bucket was written at that time in new data, the expired time of the data bucket is reset.
Certainly, the mode described above for extending expired time is only a kind of example, and those skilled in the art's exploitation uses it His mode extends the expired time of data bucket, and the present embodiment is not construed as limiting this.
In embodiments of the present invention, by obtaining existing data record in database, data record can be divided Bucket and the expired time that each data bucket is set, to can be used for by identification in the storage request for receiving new data Target data bucket and the new data corresponding bit in target data bucket of new data is written, and judges the new data pair Whether the bit value for the bit answered is special value, determines whether to need the new data to be written in target data bucket, Realization handles sentencing for data again;Meanwhile if there is not new data to write in the data bucket before reaching data bucket expired time Enter, then can delete the data record in data bucket and data bucket when reaching expired time, reduces the occupancy to memory, it is no Then, then it can extend the expired time of the data bucket.The present embodiment by divide bucket by data sentence weight range shorter, certain barrels Data achieve the purpose that delete data, without over time, leading to data forever because no longer active and gradually expired Long is present in Bloom filter, realizes the purpose for deleting data in Bloom filter.Since data record can increase always, Bloom filter needs more and more bit sentence weight, using the method for the present embodiment, the bitmap of Bloom filter Will increase can also be reduced, and achieve the purpose that flexible Bloom filter, reduce the occupancy to memory, solve vacation when data sentence weight The problem of positive rate increases.
Referring to Fig. 2, the step flow diagram of the method for another data processing of one embodiment of the invention is shown, It can specifically include following steps:
S201, existing data record in database is obtained;
It should be noted that this method can be applied to handle sentencing for big data again.
Existing data record can refer to the various types of numbers being stored in server in the database of the present embodiment According to record.For example, User ID, Email Accounts, URL etc., the present embodiment is not construed as limiting the type of data record.
In order to make it easy to understand, this implementation carries out subsequent introduction so that the data record in database is User ID as an example.
S202, a point bucket is carried out to the data record, and the expired time of each data bucket is set, each data bucket In include multiple bits, the initial bit value of the multiple bit is the first numerical value;
In getting database after existing data record, these data records can be divided to multiple data first In bucket.Each data bucket in the present embodiment is the bitmap structure designed based on redis timeout expirations characteristic Bloom filter.
Therefore, the Bloom filter of multiple bitmap structures can be constructed as the data for being subsequently used for storing data record Bucket, and identical bucket number is set for each data bucket.It include multiple bits in each data bucket, it can be understood as Then the electric switch of one group of band number can give each switch number, obtain the bit bit number of each bit.For example, If there is 10,000 switches, then can be numbered from 0, until 9999.In Bloom filter, opening for switch can refer to Bit value for bit is that the bit value of 1, Guan Zhidai bit is 0.
In embodiments of the present invention, crc32 random function can be used, a point bucket is carried out to existing data record.For example, Each User ID can be generated into an integer numerical value using crc32 random function first, then using the numerical value divided by data The User ID is placed in corresponding data bucket by the quantity of bucket then using obtained remainder as the number of data bucket.
Since crc32 random function has randomness to the integer numerical value that kinds of characters is concatenated, divide bucket using the function Data record can be made to be distributed relatively uniformly among in each data bucket.
In embodiments of the present invention, the expired time of each data bucket can be determined according to business demand come specific.For example, The expired time that each data bucket can be set is 30 days.
S203, the cryptographic Hash that any data record is calculated using preset algorithm;
In embodiments of the present invention, for having divided the data record after bucket, Hash calculation can be carried out to it first, so Afterwards according to Hash calculation as a result, the bit value of each bit is arranged.
Hash calculation can be understood as being the numerical value within one 10000 by some character string maps, using different Kazakhstan The uncommon available n of method different cryptographic Hash.Therefore, cryptographic Hash corresponding to a data record also may include multiple.
Assuming that there are a User ID, when carrying out Hash calculation to it, using different hash method available n Different cryptographic Hash.For example, available one group of cryptographic Hash is { 5,300,891,2999,7821 }.
S204, the bit value of bit bit number multiple bits identical with multiple cryptographic Hash is set to second value;
In embodiments of the present invention, the first numerical value and second value respectively indicate the switch that each bit is referred to "Off" and "ON" two states.It therefore, can will be corresponding to the cryptographic Hash after the cryptographic Hash of some User ID is calculated The bit value of bit is changed to 1. by the 0 of original state
For example, it is directed to above-mentioned one group of cryptographic Hash { 5,300,891,2999,7821 }, it can be by bit in current data bucket The bit position that number is 5,300,891,2999,7821 is 1.
It should be noted that the characteristics of according to Bloom filter, be the false positive rate of guarantee less than 0.00001, each data bucket 239627 bit can be used sentence weight, each data bucket can carry out 10000 User ID sentencing weight.
S205, when receive new data storage request when, identify the target data bucket for the new data to be written;
According to abovementioned steps database after the processing of existing data record, each data can be based on Bucket carries out the data newly received to sentence weight.
In embodiments of the present invention, target data bucket is that the new data received should be written into after sentencing weight Data bucket.New data number of targets to be written can be determined according to mode of operation when carrying out point bucket to existing data record According to bucket which is.
It is an integer numerical value that it is, for example, possible to use crc32 functions by new data transition, then using the numerical value divided by number According to the quantity of bucket, using obtained remainder as the number of target data bucket.
S206, multiple target cryptographic Hash that the new data is calculated using the preset algorithm;
When sentence weight to the new data received, should using with to divide when bucket identical hash method to be calculated more A cryptographic Hash, then by comparing the bit value of above-mentioned multiple cryptographic Hash and corresponding bit in target data bucket, confirmation should Whether new data is existing data record in target data bucket.
S207, judge bit bit number multiple bits identical with the multiple target cryptographic Hash in the target data bucket Whether the bit value of position is second value;
In embodiments of the present invention, sentence to new data again can be corresponding by determining each cryptographic Hash of new data Whether the bit value of bit is second value to realize.
If the bit value for carrying out bit corresponding to the multiple cryptographic Hash obtained after Hash calculation to new data is Second value can then execute step S208, identify that the new data is existing data record in target data bucket.At this point, can To refuse the storage request of the new data.
If any one bit value is not second value in the bit value of bit corresponding to above-mentioned multiple cryptographic Hash, It can be confirmed that existing data record writes new data into the new data and non-database so as to execute step S210 In current target data bucket.
S208, the identification new data are existing data record in the target data bucket;
If S209, before reaching the data bucket expired time, there is not new data write-in in the data bucket, then arrive The data record in the data bucket and the data bucket is deleted when up to the expired time;
It in embodiments of the present invention, can be by whether the expired time of change data bucket decides whether to retain the data Data record in bucket and bucket.
In the concrete realization, after the expired time of setting data bucket, until the expired time reaches, if should There is not new data write-in in data bucket, then it, can be directly by the data bucket and data record therein when expired time reaches It deletes, reduces the occupancy to memory.
S210, the new data is written in the target data bucket;
S211, the expired time for extending the data bucket;
If after the expired time of setting data bucket, until having new data write-in during the expired time reaches In data bucket, then it can extend the expired time of the data bucket.For example, being fallen if the expired time of some data bucket is 30 days Timing has new data write-in in the data bucket on the 3rd day after starting, then can at this moment be extended for the expired time of the data bucket 30 days.
S212, when the data record in any data bucket be more than preset quantity when, again to the data in the data bucket Record carries out a point bucket, and the expired time of each data bucket after point bucket is respectively set.
For the data bucket for thering is new data to be written, with the continuous write-in of data, the User ID that includes in each data bucket Quantity also can be more and more.In this case the positive rate of the vacation of the data bucket also can be higher and higher, that is, most of ratio occurs The case where special position is fully written.
At this point it is possible to re-start a point bucket to the data record in the data bucket, point bucket process is as described in step S202. By the deletion and reconstruction to bitmap, active data can be re-write, corpse user is also just excluded away.
It in embodiments of the present invention, can be by certain data by the way that the expired time of bitmap (Bloom filter) is arranged Record deletion.It is deleted, can be regarded as in 30 days ranges when some bitmap is expired, the data in this bitmap are without again It is written into and updates, at this moment will be deleted, and Free up Memory.When there is new data to be written to this bitmap, system is again Create this bitmap.The expired time setting of Bitmap (Bloom filter) has very important significance:
First, some users have become corpse user, and after expired, bitmap is deleted, then these corpses user is just It is no longer present in bitmap, can achieve the purpose for deleting user.
Second, bitmap are expired can to save very big memory headroom, be equivalent to Bloom filter and be emptied.Although In this short time being emptied, already present any active ues data need to be put in storage again in some databases.But this is empty Working days are not to concentrate, that is to say, that not all bitmap is emptied in the same time.
Third, bitmap are no longer written or update, and also having a kind of situation is exactly the very high situation of false positive rate, that is, In the case that most of bit is fully written.In this case, active data will be re-write by deleting and rebuilding bitmap, Corpse user is just excluded away.
It should be noted that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, The execution sequence of each process should be determined by its function and internal logic, and the implementation process without coping with the embodiment of the present invention, which is constituted, appoints What is limited.
Referring to Fig. 3, a kind of schematic diagram of the device of data processing of one embodiment of the invention is shown, specifically can wrap Include following module:
Module 301 is obtained, for obtaining existing data record in database;
Divide bucket module 302, for carrying out a point bucket to the data record, and the expired time of each data bucket is set, institute Stating in each data bucket includes multiple bits, and the initial bit value of the multiple bit is the first numerical value;
Module 303 is changed, any data records corresponding bit in each data bucket for determining, will be described each The bit value of the corresponding bit of a data record is set to second value;
Determining module 304, for identifying the mesh for the new data to be written when the storage for receiving new data is requested Data bucket and the new data corresponding bit in the target data bucket are marked, by judging that the new data is corresponding The bit value of bit whether be the second value, it is determined whether the new data is written in the target data bucket;
Processing module 305, if not having new data in the data bucket for before reaching the data bucket expired time Otherwise write-in, the then data record deleted when reaching the expired time in the data bucket and the data bucket then extend The expired time of the data bucket.
In embodiments of the present invention, described point of bucket module 302 can specifically include following submodule:
Integer value generates submodule, for each data record to be generated integer value respectively using default random function;
Remainder computational submodule, after calculating quantity of the integer value of each data record divided by the data bucket Obtained remainder, each data bucket are respectively provided with corresponding bucket number;
Data record divides bucket module, compiles for each data record to be placed in bucket identical with the remainder respectively In number corresponding data bucket.
In embodiments of the present invention, the multiple bit is respectively provided with corresponding bit bit number, the change module 303 can specifically include following submodule:
Cryptographic Hash computational submodule, for the cryptographic Hash of any data record to be calculated using preset algorithm;
Bit value changes submodule, for the bit value of bit bit number bit identical with the cryptographic Hash to be set to Second value.
In embodiments of the present invention, the cryptographic Hash for any data record being calculated includes multiple, the bit Value change submodule can specifically include such as lower unit:
Bit value changing unit, for the bit value of bit bit number multiple bits identical with multiple cryptographic Hash is equal It is set to second value.
In embodiments of the present invention, the determining module 304 can specifically include following submodule:
Integer value generates submodule, in the storage request for receiving new data, using default random function by institute It states new data and generates integer value;
Remainder computational submodule, for calculating the remainder obtained after quantity of the integer value divided by the data bucket, institute It states each data bucket and is respectively provided with corresponding bucket number;
Target data bucket identifies submodule, and it is number of targets that the bucket, which numbers data bucket identical with the remainder, for identification According to bucket.
In embodiments of the present invention, the determining module 304 can also include following submodule:
Target cryptographic Hash computational submodule, for multiple targets of the new data to be calculated using the preset algorithm Cryptographic Hash;
Bit value judging submodule, for judging bit bit number and the multiple target Hash in the target data bucket Whether the bit value for being worth identical multiple bits is second value;
Data record identifies submodule, if for bit bit number in the target data bucket and the multiple target Hash The bit value for being worth identical multiple bits is the second value, then identifies that the new data is in the target data bucket Existing data record;
Submodule is written in data record, if for bit bit number in the target data bucket and the multiple target Hash Any bit value for being worth identical multiple bits is not the second value, then the target data is written in the new data In bucket.
In embodiments of the present invention, it is more than pre- that described point of bucket module 302, which is also used to the data record in any data bucket, If when quantity, carrying out a point bucket to the data record in the data bucket again, and each data bucket after point bucket is respectively set Expired time.
For device embodiment, since it is basically similar to the method embodiment, related so describing fairly simple Place referring to embodiment of the method part explanation.
Referring to Fig. 4, a kind of schematic diagram of server of one embodiment of the invention is shown.As shown in figure 4, the present embodiment Server 400 include: processor 410, memory 420 and be stored in the memory 420 and can be in the processor The computer program 421 run on 410.The processor 410 realizes above-mentioned data processing when executing the computer program 421 The each embodiment of method in step, such as step S101 to S105 shown in FIG. 1.Alternatively, the processor 410 executes institute Realize the function of each module/unit in above-mentioned each Installation practice when stating computer program 421, for example, module 301 shown in Fig. 3 to 305 function.
Illustratively, the computer program 421 can be divided into one or more module/units, it is one or Multiple module/the units of person are stored in the memory 420, and are executed by the processor 410, to complete the present invention.Institute Stating one or more module/units can be the series of computation machine program instruction section that can complete specific function, the instruction segment It can be used for describing implementation procedure of the computer program 421 in the server 400.For example, the computer program 421 can be divided into acquisition module, divide bucket module, change module, determining module and processing module, and each module concrete function is such as Under:
Module is obtained, for obtaining existing data record in database;
Divide bucket module, for carrying out a point bucket to the data record, and the expired time of each data bucket is set, it is described each It include multiple bits in a data bucket, the initial bit value of the multiple bit is the first numerical value;
Module is changed, any data records corresponding bit in each data bucket for determining, will be described each The bit value of the corresponding bit of data record is set to second value;
Determining module, for identifying the target for the new data to be written when the storage for receiving new data is requested Data bucket and the new data corresponding bit in the target data bucket, by judging that the new data is corresponding Whether the bit value of bit is the second value, it is determined whether the new data is written in the target data bucket;
Processing module, if not thering is new data to write in the data bucket for before reaching the data bucket expired time Enter, then otherwise the data record deleted when reaching the expired time in the data bucket and the data bucket then extends institute State the expired time of data bucket.
The server 400 can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.The server 400 may include, but be not limited only to, processor 410, memory 420.It will be understood by those skilled in the art that Fig. 4 is only a kind of example of server 400, does not constitute the restriction to server 400, may include more or more than illustrating Few component perhaps combines certain components or different components, such as the server 400 can also include input and output Equipment, network access equipment, bus etc..
The processor 410 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.
The memory 420 can be the internal storage unit of the server 400, for example, server 400 hard disk or Memory.The memory 420 is also possible to the External memory equipment of the server 400, such as is equipped on the server 400 Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, Flash card (Flash Card) etc..Further, the memory 420 can also both include the inside of the server 400 Storage unit also includes External memory equipment.The memory 420 is for storing the computer program 421 and the service Other programs and data needed for device 400.The memory 420, which can be also used for temporarily storing, have been exported or will be defeated Data out.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations.Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of method of data processing characterized by comprising
Obtain existing data record in database;
A point bucket is carried out to the data record, and the expired time of each data bucket is set, includes more in each data bucket A bit, the initial bit value of the multiple bit are the first numerical value;
Determine that any data records corresponding bit in each data bucket, by the corresponding bit of each data record The bit value of position is set to second value;
When the storage for receiving new data is requested, target data bucket for the new data to be written and described new is identified Data corresponding bit in the target data bucket, by judge the corresponding bit of the new data bit value whether For the second value, it is determined whether the new data to be written in the target data bucket;
If not having new data write-in in the data bucket before reaching the data bucket expired time, then the mistake is reached The data record in the data bucket and the data bucket is deleted when time phase, otherwise, then extend the data bucket it is expired when Between.
2. the method according to claim 1, wherein the step of carrying out point bucket to data record packet It includes:
Each data record is generated respectively by integer value using default random function;
Calculate the remainder obtained after quantity of the integer value of each data record divided by the data bucket, each data Bucket is respectively provided with corresponding bucket number;
Each data record is placed in the corresponding data bucket of identical with remainder bucket number respectively.
3. being compiled the method according to claim 1, wherein the multiple bit is respectively provided with corresponding bit Number, the step of bit value by the corresponding bit of each data record is set to second value includes:
The cryptographic Hash of any data record is calculated using preset algorithm;
The bit value of bit bit number bit identical with the cryptographic Hash is set to second value.
4. according to the method described in claim 3, it is characterized in that, the cryptographic Hash packet for any data record being calculated Include it is multiple, the step of bit value by bit bit number bit identical with the cryptographic Hash is set to second value packet It includes:
The bit value of bit bit number multiple bits identical with multiple cryptographic Hash is set to second value.
5. the method according to claim 1, wherein described when the storage for receiving new data is requested, identification Include: for the step of target data bucket of the new data is written
When the storage for receiving new data is requested, the new data is generated by integer value using default random function;
The remainder obtained after quantity of the integer value divided by the data bucket is calculated, each data bucket is respectively provided with accordingly Bucket number;
Identifying that the bucket numbers data bucket identical with the remainder is target data bucket.
6. according to the method described in claim 4, it is characterized in that, described by judging the corresponding bit of the new data Whether bit value is the second value, it is determined whether includes: by the step that the new data is written in the target data bucket
Multiple target cryptographic Hash of the new data are calculated using the preset algorithm;
Judge the bit of bit bit number multiple bits identical with the multiple target cryptographic Hash in the target data bucket Whether value is second value;
If the bit value of bit bit number multiple bits identical with the multiple target cryptographic Hash in the target data bucket It is the second value, then identifies that the new data is existing data record in the target data bucket;
If any ratio of bit bit number multiple bits identical with the multiple target cryptographic Hash in the target data bucket Paricular value is not the second value, then the new data is written in the target data bucket.
7. the method according to claim 1, wherein further include:
When the data record in any data bucket is more than preset quantity, the data record in the data bucket is divided again Bucket, and the expired time of each data bucket after point bucket is respectively set.
8. a kind of device of data processing characterized by comprising
Module is obtained, for obtaining existing data record in database;
Divide bucket module, for carrying out a point bucket to the data record, and the expired time of each data bucket is set, each number According to including multiple bits in bucket, the initial bit value of the multiple bit is the first numerical value;
Module is changed, any data records corresponding bit in each data bucket for determining, by each data The bit value for recording corresponding bit is set to second value;
Determining module, for identifying the target data for the new data to be written when the storage for receiving new data is requested Bucket and the new data corresponding bit in the target data bucket, by judging the corresponding bit of the new data Whether the bit value of position is the second value, it is determined whether the new data is written in the target data bucket;
Processing module, if not having new data write-in in the data bucket, then for before reaching the data bucket expired time The data record in the data bucket and the data bucket is deleted when reaching the expired time, otherwise, then extends the number According to the expired time of bucket.
9. a kind of server, including memory, processor and storage can transport in the memory and on the processor Capable computer program, which is characterized in that the processor realizes such as claim 1 to 7 times when executing the computer program The step of method of one data processing.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In the step of realization method of data processing as described in any one of claim 1 to 7 when the computer program is executed by processor Suddenly.
CN201910628181.5A 2019-07-12 2019-07-12 Data processing method, device and server Active CN110489405B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910628181.5A CN110489405B (en) 2019-07-12 2019-07-12 Data processing method, device and server
PCT/CN2019/116646 WO2021008024A1 (en) 2019-07-12 2019-11-08 Data processing method and apparatus, and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910628181.5A CN110489405B (en) 2019-07-12 2019-07-12 Data processing method, device and server

Publications (2)

Publication Number Publication Date
CN110489405A true CN110489405A (en) 2019-11-22
CN110489405B CN110489405B (en) 2024-01-12

Family

ID=68547033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910628181.5A Active CN110489405B (en) 2019-07-12 2019-07-12 Data processing method, device and server

Country Status (2)

Country Link
CN (1) CN110489405B (en)
WO (1) WO2021008024A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035479A (en) * 2020-08-31 2020-12-04 平安医疗健康管理股份有限公司 Medicine database access method and device and computer equipment
CN112487177A (en) * 2020-12-17 2021-03-12 杭州火石数智科技有限公司 Reverse de-duplication method for self-adaptive bucket separation of massive short texts
CN113590890A (en) * 2021-08-04 2021-11-02 拉卡拉支付股份有限公司 Information storage method, information storage device, electronic apparatus, storage medium, and program product

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510474A (en) * 2022-02-18 2022-05-17 中兴通讯股份有限公司 Sample deleting method based on time attenuation, device thereof and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101666758B1 (en) * 2015-08-03 2016-10-17 성균관대학교산학협력단 Method for searching data using enhanced bloom filter
WO2017016423A1 (en) * 2015-07-29 2017-02-02 阿里巴巴集团控股有限公司 Real-time new data update method and device
CN106570025A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Data filtering method and device
CN107391034A (en) * 2017-07-07 2017-11-24 华中科技大学 A kind of duplicate data detection method based on local optimization
CN108121810A (en) * 2017-12-26 2018-06-05 北京锐安科技有限公司 A kind of data duplicate removal method, system, central server and distributed server
CN109656901A (en) * 2018-10-15 2019-04-19 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402394B (en) * 2010-09-13 2014-10-22 腾讯科技(深圳)有限公司 Hash algorithm-based data storage method and device
CN107291746B (en) * 2016-03-31 2021-08-17 阿里巴巴集团控股有限公司 Method and equipment for storing and reading data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017016423A1 (en) * 2015-07-29 2017-02-02 阿里巴巴集团控股有限公司 Real-time new data update method and device
KR101666758B1 (en) * 2015-08-03 2016-10-17 성균관대학교산학협력단 Method for searching data using enhanced bloom filter
CN106570025A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Data filtering method and device
CN107391034A (en) * 2017-07-07 2017-11-24 华中科技大学 A kind of duplicate data detection method based on local optimization
CN108121810A (en) * 2017-12-26 2018-06-05 北京锐安科技有限公司 A kind of data duplicate removal method, system, central server and distributed server
CN109656901A (en) * 2018-10-15 2019-04-19 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035479A (en) * 2020-08-31 2020-12-04 平安医疗健康管理股份有限公司 Medicine database access method and device and computer equipment
CN112487177A (en) * 2020-12-17 2021-03-12 杭州火石数智科技有限公司 Reverse de-duplication method for self-adaptive bucket separation of massive short texts
CN112487177B (en) * 2020-12-17 2022-05-10 杭州火石数智科技有限公司 Reverse de-duplication method for self-adaptive bucket separation of massive short texts
CN113590890A (en) * 2021-08-04 2021-11-02 拉卡拉支付股份有限公司 Information storage method, information storage device, electronic apparatus, storage medium, and program product
CN113590890B (en) * 2021-08-04 2024-03-26 拉卡拉支付股份有限公司 Information storage method, apparatus, electronic device, storage medium, and program product

Also Published As

Publication number Publication date
CN110489405B (en) 2024-01-12
WO2021008024A1 (en) 2021-01-21

Similar Documents

Publication Publication Date Title
CN110489405A (en) The method, apparatus and server of data processing
KR102226257B1 (en) Method and device for writing service data to a blockchain system
EP3848875B1 (en) Method, device, computer apparatus and storage medium for electing representative node apparatus
CN106096023B (en) Method for reading data, method for writing data and data server
CN102521334B (en) Data storage and query method based on classification characteristics and balanced binary tree
US20170060958A1 (en) Fast processing of path-finding queries in large graph databases
CN109271391A (en) Date storage method, server, storage medium and device
CN102546299B (en) Method for detecting deep packet under large flow
CN111966912B (en) Recommendation method and device based on knowledge graph, computer equipment and storage medium
CN110168532B (en) Data updating method and storage device
CN112074818A (en) Method and node for enabling access to past transactions in a blockchain network
US9135572B2 (en) Method and arrangement for processing data
CN111221840B (en) Data processing method and device, data caching method, storage medium and system
ES2855074T3 (en) Performing the cache update adaptation
CN111552692A (en) Plus-minus cuckoo filter
CN111858651A (en) Data processing method and data processing device
CN106569963A (en) Buffering method and buffering device
CN110059129A (en) Date storage method, device and electronic equipment
CN111641496B (en) Block chain data updating method, device, equipment, system and readable storage medium
CN114490060A (en) Memory allocation method and device, computer equipment and computer readable storage medium
CN106599247A (en) Method and device for merging data file in LSM-tree structure
CN117271571A (en) Database uniqueness constraint processing method, device, equipment and storage medium
CN110245130A (en) Data duplicate removal method, device, computer equipment and storage medium
CN109992708A (en) A kind of method, apparatus of metadata query, equipment and storage medium
CN109492020A (en) A kind of data cache method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant