CN105468699B

CN105468699B - Duplicate removal data statistical approach and equipment

Info

Publication number: CN105468699B
Application number: CN201510794267.7A
Authority: CN
Inventors: 罗文�
Original assignee: ZHUHAI DUOWAN INFORMATION TECHNOLOGY Ltd
Current assignee: ZHUHAI DUOWAN INFORMATION TECHNOLOGY Ltd
Priority date: 2015-11-18
Filing date: 2015-11-18
Publication date: 2019-06-18
Anticipated expiration: 2035-11-18
Also published as: CN105468699A

Abstract

The invention discloses a kind of duplicate removal data statistical approach and equipment, belong to Internet technical field.The described method includes: obtaining the multiple user behavior datas counted in designated time period, and obtain the mark of target data set belonging to the multiple user behavior data and target data segmental identification；Based on the multiple user behavior data, target data set mark and the target data segmental identification, determine that the storage state of the multiple user behavior data, the storage state include storage state or not stored state；From the multiple user behavior data, the data amount check of user behavior data of the statistics in not stored state；The data deduplication number data amount check being determined as in the designated time period.The present invention can quickly determine the data deduplication number in designated time period, and calculation amount is smaller, and determination process is relatively simple, can largely save the process resource of server, improve duplicate removal data statistics efficiency.

Description

Duplicate removal data statistical approach and equipment

Technical field

The present invention relates to Internet technical field, in particular to a kind of duplicate removal data statistical approach and equipment.

Background technique

With the fast development of business, also there is explosive growth in the user behavior data generated based on business, and is The problem of timely discovery business, generally requires to carry out the user behavior data number increased newly in certain time true It is fixed, and when determining the user behavior data number increased newly in this time, it can be to the user behavior generated in this time Data carry out duplicate removal data statistics, to obtain the data deduplication number in this time, when which is this section Interior newly-increased user behavior data number.

Currently, when a large amount of user behavior data generated within to certain time carries out duplicate removal data statistics, usually After a large amount of user behavior data real-time storage of generation, then by the user behavior data generated in this time and in the section The user behavior data stored before time is compared one by one, to determine the user behavior data increased newly in this time, into And determine the user behavior data number increased newly in this time.

Since the user behavior data quantity of generation is larger, by this time user behavior data with When the user behavior data stored before this time is compared one by one, calculation amount is larger, and comparison procedure is relatively complicated, thus Cause duplicate removal data statistics efficiency lower.

Summary of the invention

In order to solve problems in the prior art, the embodiment of the invention provides a kind of duplicate removal data statistical approach and equipment. The technical solution is as follows:

In a first aspect, providing a kind of duplicate removal data statistical approach, which comprises

The multiple user behavior datas counted in designated time period are obtained, and obtain the multiple user behavior data institute The target data set mark and target data segmental identification of category；

Based on the multiple user behavior data, the target data set identifies and the target data segmental identification, Determine that the storage state of the multiple user behavior data, the storage state include storage state or not stored state；

From the multiple user behavior data, the data amount check of user behavior data of the statistics in not stored state；

The data deduplication number data amount check being determined as in the designated time period.

Optionally, described based on the multiple user behavior data, target data set mark and the number of targets According to segmental identification, the storage state of the multiple user behavior data is determined, comprising:

Institute is determined by the first random generator for each user behavior data in the multiple user behavior data State target data partition identification belonging to user behavior data；

Based on the target data partition identification, determine the second multiple and different random generators, and determine third with Machine generator and target data bitmap；

Based on the user behavior data, by the plurality of different second random generator, and pass through described the Three random generators determine corresponding multiple Data Positions from the target data bitmap；

Based on target data set mark, the target data segmental identification, the target data partition identification and The multiple Data Position determines the storage state of the user behavior data.

Optionally, described based on target data set mark, the target data segmental identification, the target data Partition identification and the multiple Data Position, determine the storage state of the user behavior data, comprising:

Based on target data set mark, the target data segmental identification and the target data partition identification, Determine the target data subregion where the user behavior data；

Based on the multiple Data Position and the target data subregion, the storage shape of the user behavior data is determined State.

Optionally, described to be based on the multiple Data Position and the target data subregion, determine the user behavior number According to storage state, comprising:

Each data sectional in the multiple data sectionals for including for the target data subregion, obtains the data sectional Data bitmap described in multiple numerical value for storing on multiple Data Positions；

If the multiple numerical value is specified numerical value, it is determined that the storage state of the user behavior data is to have stored State；

If the multiple numerical value is not specified numerical value, it is determined that the user behavior data is not stored in the data In segmentation；

If the user behavior data is not stored in the multiple data sectional, it is determined that the user behavior number According to storage state be not stored state.

Optionally, the storage state of the determination user behavior data is after not stored state, further includes:

From the multiple data sectional, target data segment is obtained, the target data segment is not stored full number According to segmentation；

Obtain the multiple target values stored on multiple Data Positions described in the data bitmap of the target data segment；

Specified numerical value is set by the multiple target value, to store the user behavior data.

Second aspect, provides a kind of duplicate removal data statistics equipment, and the equipment includes:

Module is obtained, for obtaining the multiple user behavior datas counted in designated time period, and is obtained the multiple The mark of target data set belonging to user behavior data and target data segmental identification；

First determining module, for being based on the multiple user behavior data, the target data set identifies and described Target data segmental identification determines that the storage state of the multiple user behavior data, the storage state include having stored shape State or not stored state；

Statistical module, for from the multiple user behavior data, statistics to be in the user behavior number of not stored state According to data amount check；

Second determining module, the data deduplication number for being determined as the data amount check in the designated time period.

Optionally, first determining module includes:

First determination unit, for passing through first for each user behavior data in the multiple user behavior data Random generator determines target data partition identification belonging to the user behavior data；

Second determination unit determines that multiple and different second are randomly generated for being based on the target data partition identification Device, and determine third random generator and target data bitmap；

Third determination unit is randomly generated for being based on the user behavior data by the plurality of different second Device, and by the third random generator, from the target data bitmap, determine corresponding multiple Data Positions；

4th determination unit, for based on target data set mark, the target data segmental identification, the mesh Data partition identification and the multiple Data Position are marked, determines the storage state of the user behavior data.

Optionally, the 4th determination unit includes:

First determines subelement, for based on target data set mark, the target data segmental identification and institute Target data partition identification is stated, determines the target data subregion where the user behavior data；

Second determines subelement, for being based on the multiple Data Position and the target data subregion, determines the use The storage state of family behavioral data.

Optionally, described second determine that subelement is used for:

Optionally, described second determine that subelement is also used to:

In embodiments of the present invention, server obtains the multiple user behavior datas counted in designated time period, and obtains The mark of target data set belonging to multiple user behavior data and target data segmental identification are taken, and is based on multiple user Behavioral data, target data set mark and target data segmental identification, can quickly determine multiple user behavior data Storage state, that is to say can quickly determine the user behavior data increased newly in multiple user behavior data, and calculation amount is smaller, Determination process is relatively simple, can largely save the process resource of server, and server is determining multiple user After the storage state of behavioral data, the data amount check of the user behavior data in not stored state can also be counted, and should Data amount check is determined as the data deduplication number in designated time period, so as to quickly determine the use increased newly in designated time period The number of family behavioral data improves duplicate removal data statistics efficiency.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not It can the limitation present invention.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the schematic diagram of implementation environment involved in a kind of duplicate removal data statistical approach provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart of duplicate removal data statistical approach provided in an embodiment of the present invention；

Fig. 3 is the flow chart of another duplicate removal data statistical approach provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram for determining multiple Data Positions provided in an embodiment of the present invention；

Fig. 5 is the flow chart of another duplicate removal data statistical approach provided in an embodiment of the present invention；

Fig. 6 is a kind of structural schematic diagram of duplicate removal data statistics equipment provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of first determining module provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram of 4th determination unit provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Fig. 1 is the schematic diagram of implementation environment involved in a kind of duplicate removal data statistical approach provided in an embodiment of the present invention. Referring to Fig. 1, which may include server 101, and server 101 can be a server, or be saved by several The server cluster of point server composition.Client modules, first memory module, nucleus module kimonos are provided on server 101 Business module.Wherein, client modules are used to obtain user behavior data, and obtain data set belonging to the user behavior data Close mark and data fragmentation set identification.First memory module for storing data close by corresponding between set identification and port address System.Nucleus module is used to be based on user behavior data, determines the corresponding multiple Data Positions of the user behavior data.Service module Include multiple data fragmentations in each data acquisition system for storing multiple data acquisition systems, includes multiple numbers in each data fragmentation It again include multiple data sectionals in each data subregion according to subregion.

Wherein, when server 101 is the server cluster being made of several node servers, service module can be to First memory module sends heartbeat message, and it is corresponding to carry multiple data acquisition systems that the service module includes in the heartbeat message The port address of node server where data acquisition system mark and the service module, later, client modules can be deposited to this yuan It stores up module and sends address acquisition request, carried in the address acquisition request belonging to the user behavior data that the client modules obtain Data acquisition system mark, after this yuan of memory module receives the address acquisition request, available data acquisition system mark is corresponded to Port address, and the port address is sent to the client modules, can after which receives the port address To establish connection with the service module of corresponding node server based on the port address, the service of the corresponding node server After module and the client modules establish connection, nucleus module can be called, to judge whether the user behavior data is stored in In the service module, to realize duplicate removal data statistics.

Fig. 2 is a kind of flow chart of duplicate removal data statistical approach provided in an embodiment of the present invention, and referring to Fig. 1, this method is used In server, comprising the following steps:

Step 201: obtaining the multiple user behavior datas counted in designated time period, and obtain multiple user behavior The mark of target data set belonging to data and target data segmental identification.

Step 202: based on multiple user behavior datas, target data set mark and target data segmental identification, determining The storage state of multiple user behavior data, the storage state include storage state or not stored state.

Step 203: from multiple user behavior datas, the data of user behavior data of the statistics in not stored state Number.

Step 204: the data deduplication number data amount check of statistics being determined as in designated time period.

Optionally, more based on multiple user behavior datas, target data set mark and target data segmental identification, determination The storage state of a user behavior data, comprising:

User's row is determined by the first random generator for each user behavior data in multiple user behavior datas For target data partition identification belonging to data；

Based on target data partition identification, the second multiple and different random generators is determined, and determine that third produces at random Raw device and target data bitmap；

Based on user behavior data, by the second multiple and different random generators, and pass through third random generator, From target data bitmap, corresponding multiple Data Positions are determined；

Based on target data set mark, target data segmental identification, target data partition identification and multiple Data Positions, Determine the storage state of user behavior data.

Optionally, based on target data set mark, target data segmental identification, target data partition identification and multiple numbers According to position, the storage state of user behavior data is determined, comprising:

Based on target data set mark, target data segmental identification and target data partition identification, user behavior is determined Target data subregion where data；

Based on multiple Data Positions and target data subregion, the storage state of user behavior data is determined.

Optionally, it is based on multiple Data Positions and target data subregion, determines the storage state of user behavior data, is wrapped It includes:

Each data sectional in the multiple data sectionals for including for target data subregion, obtains the data bit of data sectional The multiple numerical value stored on multiple Data Positions in figure；

If multiple numerical value are specified numerical value, it is determined that the storage state of user behavior data is storage state；

If multiple numerical value are not specified numerical value, it is determined that user behavior data is not stored in data sectional；

If user behavior data is not stored in multiple data sectionals, it is determined that the storage state of user behavior data For not stored state.

Optionally it is determined that the storage state of user behavior data is after not stored state, further includes:

From multiple data sectionals, target data segment is obtained, target data segment is not stored full data sectional；

Obtain the multiple target values stored on multiple Data Positions in the data bitmap of target data segment；

Specified numerical value is set by multiple target values, to store user behavior data.

All the above alternatives, can form alternative embodiment of the invention according to any combination, and the present invention is real It applies example and this is no longer repeated one by one.

Fig. 3 is a kind of flow chart of duplicate removal data statistical approach provided in an embodiment of the present invention, referring to fig. 2, this method packet Include following steps:

Step 301: server obtains the multiple user behavior datas counted in designated time period, and obtains multiple use The mark of target data set belonging to the behavioral data of family and target data segmental identification.

It should be noted that client modules can be set on server, when the client modules are available specified Between multiple user behavior datas for counting in section, and obtain the mark of target data set belonging to multiple user behavior data With target data segmental identification.

It should be noted that designated time period can be preset, as designated time period can for before current time and From current time be 5 minutes period, period for before current time and from current time being 1 hour etc., certainly, refer to Section of fixing time may be specific certain time, such as designated time period can be 23:25~23:40,17:10~18:10 Deng the embodiment of the present invention is not specifically limited in this embodiment.

In addition, target data set mark is used for unique identification target data set, as target data set mark can Think that type of service belonging to the target data set, the type of service may include game service, banking etc., this hair Bright embodiment is not specifically limited in this embodiment.

Furthermore target data segmental identification is used for unique identification target data fragment, which can be with For subservice type belonging to the target data fragment, such as when type of service is game service, which can be Supplement subservice, chat subservice etc. with money, the embodiment of the present invention is not specifically limited in this embodiment.

It wherein, can be specified every first when server obtains the multiple user behavior datas counted in designated time period Duration, obtain designated time period in count multiple user behavior datas, alternatively, can receive data acquisition instruction when, Obtain the multiple user behavior datas counted in designated time period.

It should be noted that the first specified duration can be preset, such as the first specified duration can be 1 hour, 2 hours Etc., the embodiment of the present invention is not specifically limited in this embodiment.

In addition, data acquisition instruction is used to indicate the multiple user behavior datas for obtaining and counting in designated time period, and should Data acquisition instruction can be triggered by user by the client modules of the server, it is of course also possible to from other equipment to this Server is sent, and the embodiment of the present invention is not specifically limited in this embodiment.

Furthermore it when server obtains the multiple user behavior datas counted in designated time period, can be obtained from database The multiple user behavior datas counted in designated time period are taken, and since storage can be corresponded to when database purchase user behavior data The attribute information of the user behavior data, the attribute information may include type of service belonging to the user behavior data, sub- industry Service type etc., therefore, server, can be more based on this when obtaining the multiple user behavior datas counted in designated time period The attribute information of a user behavior data obtains the mark of target data set belonging to multiple user behavior data and number of targets According to segmental identification.

Step 302: server is based on multiple user behavior data, target data set mark and target data fragment mark Know, determine the storage state of multiple user behavior data, which includes storage state or not stored state.

Specifically, for each user behavior data in multiple user behavior data, server passes through the first random production Raw device, determines target data partition identification belonging to the user behavior data；Based on target data partition identification, determine it is multiple not The second same random generator, and determine third random generator and target data bitmap；Based on the user behavior data, lead to Multiple the second different random generator is crossed, and by third random generator, from target data bitmap, determines and correspond to Multiple Data Positions；Based on target data set mark, target data segmental identification, target data partition identification and multiple Data Position determines the storage state of the user behavior data.

Wherein, server determines target data subregion mark belonging to the user behavior data by the first random generator When knowledge, which can be based on the user behavior data, by the first random generator, generate the first numerical value, and by this One numerical value is determined as target data partition identification belonging to the user behavior data.

It should be noted that server is based on the user behavior data, by the first random generator, the first numerical value is generated Process can refer to the relevant technologies, the embodiment of the present invention is to this without elaborating.

In addition, the first random generator can be preset, the embodiment of the present invention is not specifically limited in this embodiment.

Furthermore target data partition identification is used for the unique identification target data subregion, and the target data partition identification It can be number, title of the target data subregion etc., the embodiment of the present invention is not specifically limited in this embodiment.

Wherein, server is based on target data partition identification, determines the second multiple and different random generators, and determine When third random generator and target data bitmap, target data partition identification can be based on, from the data partition identification of storage, In corresponding relationship between second random generator, third random generator and target data bitmap, multiple and different are obtained Two random generators, third random generator and target data bitmap.

It should be noted that service module can be set on server, data subregion can store in the service module Corresponding relationship between mark, the second random generator, third random generator and target data bitmap.

In addition, data bitmap can be bit array, the numerical value stored on all Data Positions in bit array is binary system Numerical value, such as data bitmap can be 0101010,000001.In addition, target data bitmap is target data partition identification pair The data bitmap answered, and target data bitmap is identical as the digit of data bitmap of multiple data sectionals in target data subregion.

For example, server can be based on target data partition identification ID1, from such as when target data partition identification is ID1 Correspondence between data partition identification, the second random generator shown in the following table 1, third random generator and target data bitmap In relationship, obtaining corresponding the second multiple and different random generator of ID1 is respectively A1, A2, A3, A4, A5, and third is randomly generated Device is A6, and target data bitmap is 001000001000000.

Table 1

Data partition identification	Second random generator	Third random generator	Target data bitmap
				ID1	A1、A2、A3、A4、A5	A6	001000001000000
ID2	B1、B2、B3、B4、B5	B6	000100
				……	……	……	……

It should be noted that in embodiments of the present invention, only at random with data partition identification shown in above-mentioned table 1, second It is illustrated for corresponding relationship between generator, third random generator and target data bitmap, above-mentioned table 1 is not to this Inventive embodiments, which are constituted, to be limited.

Wherein, server is based on the user behavior data, by multiple the second different random generator, and passes through Third random generator, from target data bitmap, when determining corresponding multiple Data Positions, which can be based on the use Family behavioral data generates corresponding multiple data informations by multiple the second different random generator, and based on multiple Data information determines corresponding multiple Data Positions by third random generator from target data bitmap.

It should be noted that nucleus module can be set on server, which can be based on the user behavior Data, by the second multiple and different random generators, and by third random generator, from target data bitmap, really Fixed corresponding multiple Data Positions.

In addition, multiple data information and the user behavior data are uniquely corresponding, multiple Data Position and multiple number It is believed that breath is unique corresponding, that is to say, the user behavior data and multiple Data Position are uniquely corresponding.

Wherein, server is based on the user behavior data, by multiple the second different random generator, generates and corresponds to Multiple data informations when, server can be based on the user behavior data, by multiple the second different random generator, Multiple and different second values is generated, and it is corresponding multiple that multiple different second value is determined as the user behavior data Data information.

It should be noted that server generates multiple and different second by multiple the second different random generator The process of numerical value can refer to the relevant technologies, and the embodiment of the present invention is to this without elaborating.

Wherein, server is based on multiple data information, by third random generator, from target data bitmap, really When fixed corresponding multiple Data Positions, multiple data information can be mapped to this by third random generator by server On multiple Data Positions in target data bitmap, and multiple Data Positions that this is mapped to are determined as the user behavior data Corresponding multiple Data Positions.

It should be noted that multiple data information is mapped to the number of targets by third random generator by server The relevant technologies can be referred to according to the process on multiple Data Positions in bitmap, the embodiment of the present invention is not specifically limited in this embodiment.

For example, as shown in figure 4, multiple the second different random generator is respectively A1, A2, A3, A4, A5, third with Machine generator is A6, and target data bitmap is 001000001000000.Server be based on the user behavior data, by A1, A2, A3, A4, A5, generating corresponding multiple data informations is respectively a1, a2, a3, a4, a5, then is based on multiple data information, By A6, determine that corresponding multiple Data Positions are 2,7,9,10,14 from target data bitmap 001000001000000.

It should be noted that the first random generator, the second random generator and third random generator can be Hash The function generator of more vairable, it is of course also possible to be other types of random generator, the embodiment of the present invention is not specifically limited in this embodiment.

Wherein, server is based on target data set mark, target data segmental identification, target data partition identification and is somebody's turn to do Multiple Data Positions, when determining the storage state of the user behavior data, the server can based on target data set mark, Target data segmental identification and target data partition identification determine the target data subregion where the user behavior data, and base In multiple Data Position and the target data subregion, the storage state of the user behavior data is determined.

Specifically, server can be identified based on target data set, determine the number of targets where the user behavior data The target where the user behavior data is determined from the target data set according to set, and based on target data segmental identification Data fragmentation, then the mesh where the user behavior data is determined from the target data fragment based on target data partition identification Data subregion is marked, and then based on multiple Data Position and the target data subregion, determines the storage shape of the user behavior data State.

It should be noted that multiple data acquisition systems can also be stored on the service module of the server, multiple data set It may include multiple data fragmentations in each data acquisition system closed, may include in each data fragmentation of multiple data fragmentation Multiple data subregions may include multiple data sectionals, multiple data point in each data subregion of multiple data subregion The digit of the data bitmap of section is identical.

In addition, in the service module can also include assembled unit, the assembled unit can position multiple data acquisition system, Logical relation between multiple data fragmentations, multiple data subregions and multiple data sectionals realizes seriation interface.

Furthermore multiple data acquisition systems of storage can also be saved in the service every the second specified duration by the service module The local of device when closing extremely to avoid the service module, causes multiple data acquisition systems of storage to be lost, realizes lasting data Change, and convenient for carrying out fast transferring to multiple data acquisition system.

It should be noted that the second specified duration can be preset, if the second specified duration can be 10 minutes, 15 Minute etc., comparison of the embodiment of the present invention is not specifically limited.

It should be noted that the server can be a clothes when the user behavior data generated based on business is less Business device, and when the user behavior data generated based on business is more, which can be for by several node server groups At server cluster, handled so as to user behavior data more to this in time.

Wherein, server is identified based on target data set, determines the target data set where the user behavior data When, when the server is a server, which can be directly based upon target data set mark, determine user's row For the target data set where data.And when the server is the server cluster being made of several node servers, The server can be identified based on the target data set, determine target port address, which is the server The port address of the node server of the target data set is stored in cluster, and then the server can be based on the target port Address is established connection with the node server, and is identified based on the target data set, and from the node server, determining should Target data set where user behavior data.

Wherein, server is identified based on the target data set, and when determining target port address, which can be based on Target data set mark obtains mesh from the corresponding relationship between the target data set of storage mark and port address Mark port address.

It should be noted that first memory module can be set on server, it can store data in this yuan of memory module Corresponding relationship between set identification and port address.

For example, target data set is identified as ID1, which is based on the target data set and identifies ID1, can be from such as In corresponding relationship between the mark of data acquisition system shown in the following table 2 and port address, obtaining target port address is 12344.

Table 2

Target data set mark	Port address
		ID1	12344
ID2	25689
		……	……

It should be noted that in embodiments of the present invention, only with the mark of data acquisition system shown in above-mentioned table 2 and port address Between corresponding relationship for be illustrated, above-mentioned table 2 not to the embodiment of the present invention constitute limit.

Further, when the server is the server cluster being made of several node servers, in conjunction with the service Multiple modules that device includes are identified the server based on target data set, determine the data where the user behavior data The process of set is explained: as shown in figure 5, client modules send address acquisition request to first memory module, which is obtained It takes and carries target data set mark in request, when first memory module receives the address acquisition request, be based on the number of targets According to set identification, from the corresponding relationship between the data acquisition system of storage mark and port address, target port address is obtained, and The target port address is sent to the client modules, after which receives the target port address, being based on should Target port address is established connection with the service module of corresponding node server, and is identified based on the target data set, from In the service module, the target data set where the user behavior data is determined.

It should be noted that as shown in figure 5, first memory module is identified based on target data set, from the data set of storage It closes in the corresponding relationship between mark and port address, before obtaining target port address, which can also be every the Three specified durations send heartbeat message to first memory module, carry multiple data sets that the service module includes in the heartbeat message The port address of the node server where corresponding data acquisition system mark and the service module is closed, so that first memory module Data acquisition system mark pass corresponding between port address can be established based on multiple data acquisition system mark and the port address System, and be based on the heartbeat message, judge the service module whether normal operation, and then when determine the server module be operating abnormally When, target port address is not sent to client modules, establishes and connect to avoid the client modules and the service module of the exception It connects, causes to determine inaccuracy to the target data set where the user behavior data.

It should be noted that third specifies duration that can preset, specifying duration such as third can be 5 minutes, 10 points Clock etc., the embodiment of the present invention are not specifically limited in this embodiment.

Wherein, server is based on multiple Data Position and the target data subregion, determines depositing for the user behavior data When storage state, each data sectional in multiple data sectionals for including for the target data subregion, server obtains the data The multiple numerical value stored on multiple Data Position in the data bitmap of segmentation, if multiple numerical value is specified numerical value, The storage state of the user behavior data is determined for storage state, if multiple numerical value is not specified numerical value, it is determined that The user behavior data is not stored in the data sectional, and if the user behavior data is not stored in the data subregion packet In the multiple data sectionals included, it is determined that the storage state of the user behavior data is not stored state.

Wherein, specified numerical value can be preset, and such as specified numerical value can be 1, and the embodiment of the present invention is not done this specifically It limits.

For example, specified numerical value is 1, the data bitmap of the data sectional is 110,010,101 00110, multiple Data Position It is 2,7,9,10,14, server obtains the numerical value stored on Data Position 2,7,9,10,14 in the data bitmap of the data sectional Respectively 1,1,1,1,1, then multiple numerical value is specified numerical value, determines that the storage state of the user behavior data is to have stored State.

For another example specified numerical value is 1, the data bitmap of the data sectional is 110010100000100, multiple data bit 2,7,9,10,14 are set to, server obtains the number stored on Data Position 2,7,9,10,14 in the data bitmap of the data sectional Value is respectively 1,1,0,0,0, then multiple numerical value is not specified numerical value, determines that the user behavior data is not stored in the data In segmentation.And when the user behavior data is not stored in multiple data sectionals that the data subregion includes, determine the use The storage state of family behavioral data is not stored state.

It should be noted that the server can be based on multiple user behavior number by the service module of the server According to, target data set mark and target data segmental identification, the storage state of multiple user behavior data is determined.

Specifically, for each user behavior data in multiple user behavior data, service module is random by first Generator determines target data partition identification belonging to the user behavior data, and is based on the target data partition identification, determines Multiple and different the second random generator, third random generator and target data bitmaps, later, service module can call core Core module, based on multiple the second different random generator, third random generator and target data bitmap, to determine the use The corresponding multiple Data Positions of family behavioral data, and based on target data set mark, target data segmental identification, target data Partition identification and multiple Data Position, determine the storage state of the user behavior data.

It further, can also be from after server determines that the storage state of the user behavior data is not stored state In multiple data sectionals that the target data subregion includes, target data segment is obtained, target data segment is not stored full Data sectional, server obtain the multiple number of targets stored on multiple Data Position in the data bitmap of target data segment Value, and specified numerical value is set by multiple target value, to store the user behavior data.

For example, specified numerical value is 1, the data bitmap of target data segment is 110010100000100, multiple data bit Be set to 2,7,9,10,14, server obtain stored on multiple Data Position in the data bitmap of target data segment it is multiple Target value is 1,1,0,0,0, and multiple target value is set specified numerical value 1 by server, be that is to say the target data Segmentation 110010100000100 is set as 110010101100110, so that the user behavior data is stored in the target data In segmentation.

Wherein, server is from multiple data sectionals that the target data subregion includes, should when obtaining target data segment Server can obtain the data sectional for being provided with target label, and will be provided with target label from the target data subregion Data sectional be determined as target data segment.

It should be noted that target label is for marking not stored full data sectional in the data subregion, and the target Label can be preset, and the embodiment of the present invention is not specifically limited in this embodiment.

Further, from the target data subregion, acquisition is provided with before the data sectional of target label server, right Each data sectional in multiple data sectionals that the target data subregion includes, the server may determine that the data sectional Whether the number of the user behavior data of upper storage is less than specified number, when the user behavior data stored on the data sectional When number is less than specified number, server determines that the data sectional is not stored full, and target label is arranged to the data sectional, when When the number of the user behavior data stored on the data sectional is equal to specified number, server determines that the data sectional has stored It is full, target label is not arranged to the data sectional, and when server determines multiple data sectionals that the target data subregion includes When having stored full, which can create a data sectional, and to the newly-built data in the target data subregion Subsection setup target label, to realize the automatic dynamic dilatation of the target data subregion.

It should be noted that the specified number can be obtained based on specified probability of miscarriage of justice.

In addition, specified probability of miscarriage of justice can be preset, the embodiment of the present invention is not specifically limited in this embodiment.

It should be noted that probability of miscarriage of justice refers to that server, will not in the storage state for determining the user behavior data The user behavior data being stored in a certain data sectional is determined as the probability being stored in the data sectional, and in order to avoid accidentally Sentence probability it is excessively high when, accuracy when server being caused to determine the storage state of the user behavior data, can in advance it is right The probability of miscarriage of justice is defined, and that is to say can preset specified probability of miscarriage of justice.

And since the number when the corresponding multiple Data Positions of user behavior data is k, the data bitmap of data sectional Digit be m and the data sectional on the number of user behavior data that stores when being n, probability of miscarriage of justice f can be by following formula (1) It determines:

Therefore, when server is based on that probability of miscarriage of justice is specified to obtain specified number, the f in above formula (1) can be set as referring to Determine probability of miscarriage of justice, and calculate the number n of the user behavior data stored on the data sectional based on above formula (1), which is specified Number.

It should also be noted that, since data sectional is to store the user behavior data by data bitmap, the number It can guarantee that the memory space shared by itself is smaller while storing a large number of users behavioral data according to segmentation, and due to server It is to be determined based on storage state of the data sectional to multiple user behavior data, therefore, server by utilizing is less Memory can complete the determination of the storage state to multiple user behavior data, the memory for improving server utilizes Rate.

Step 303: for server from multiple user behavior datas, statistics is in the user behavior data of not stored state Data amount check, and the data amount check of the statistics is determined as to the data deduplication number in designated time period.

It should be noted that the data deduplication number in designated time period refers to the user's row increased newly in the designated time period For the number of data.

Since the user behavior data in not stored state is the user behavior data increased newly in designated time period, because This, server statistics are in the data amount check of the user behavior data of not stored state, can quickly determine in designated time period Data deduplication number, improve duplicate removal data statistics efficiency.

It should be noted that when the server is server cluster, as shown in figure 5, the clothes of corresponding node server After business module determines the data deduplication number, which can be sent to the client modules of the server, when So, which can also be also sent to the storage state of multiple user behavior data the client mould of the server Block.

In embodiments of the present invention, server obtains the multiple user behavior datas counted in designated time period, for this Each user behavior data in multiple user behavior datas is determined belonging to the user behavior data by the first random generator Target data partition identification, and based on the target partition identify, determine the corresponding multiple Data Positions of the user behavior data, Later, based on the mark of target data set belonging to the user behavior data, target data segmental identification, target data subregion mark Know, determine the target data subregion where the user behavior data, and includes multiple data sectionals based on the target data subregion Data bitmap and multiple Data Position, can quickly determine the storage state of the user behavior data, that is to say can be fast Speed determines the user behavior data increased newly in multiple user behavior data, and calculation amount is smaller, and determination process is relatively simple, can be with Largely save the process resource of server.When server determines that the storage state of the user behavior data is not stored shape When state, which can be stored in the user behavior data in the target data subregion, be wrapped when in the target data subregion When the multiple data sectionals included have stored full, server can create a data sectional automatically to store the user behavior number According to realizing the automatic dynamic dilatation of the target data subregion.Again since data sectional is to store user's row by data bitmap For data, therefore, which can guarantee the memory space shared by itself while storing a large number of users behavioral data It is smaller, so that the memory for keeping server by utilizing less can be completed to the storage state of multiple user behavior data really It is fixed, improve the memory usage of server.And server may be used also after the storage state for determining multiple user behavior data To count the data amount check of the user behavior data in not stored state, and the data amount check is determined as in designated time period Data deduplication number, so as to quickly determine designated time period in increase newly user behavior data number, improve duplicate removal Data statistics efficiency.

Fig. 6 is a kind of structural schematic diagram of duplicate removal data statistics equipment provided in an embodiment of the present invention.Referring to Fig. 6, this is set Standby includes obtaining module 601, the first determining module 602, statistical module 603 and the second determining module 604.

Module 601 is obtained, for obtaining the multiple user behavior datas counted in designated time period, and the multiple use of acquisition The mark of target data set belonging to the behavioral data of family and target data segmental identification；

First determining module 602, for based on multiple user behavior datas, target data set mark and target data point Piece mark determines that the storage state of multiple user behavior datas, storage state include storage state or not stored state；

Statistical module 603, for from multiple user behavior datas, statistics to be in the user behavior data of not stored state Data amount check；

Second determining module 604, the data deduplication number for being determined as data amount check in designated time period.

Optionally, referring to Fig. 7, the first determining module 602 includes:

First determination unit 6021, for passing through first for each user behavior data in multiple user behavior datas Random generator determines target data partition identification belonging to user behavior data；

Second determination unit 6022 determines that multiple and different second are randomly generated for being based on target data partition identification Device, and determine third random generator and target data bitmap；

Third determination unit 6023, for being based on user behavior data, by the second multiple and different random generators, with And corresponding multiple Data Positions are determined from target data bitmap by third random generator；

4th determination unit 6024, for based on target data set mark, target data segmental identification, target data point Area's mark and multiple Data Positions, determine the storage state of user behavior data.

Optionally, referring to Fig. 8, the 4th determination unit 6024 includes:

First determines subelement 60241, for based on target data set mark, target data segmental identification and number of targets According to partition identification, the target data subregion where user behavior data is determined；

Second determines subelement 60242, for being based on multiple Data Positions and target data subregion, determines user behavior number According to storage state.

Optionally, second determine that subelement is used for:

Optionally, second determine that subelement is also used to:

It should be understood that duplicate removal data statistics equipment provided by the above embodiment is in duplicate removal data statistics, only more than The division progress of each functional module is stated for example, can according to need and in practical application by above-mentioned function distribution by difference Functional module complete, i.e., the internal structure of duplicate removal data statistics equipment is divided into different functional modules, with complete more than The all or part of function of description.In addition, duplicate removal data statistics equipment provided by the above embodiment and duplicate removal data statistics side Method embodiment belongs to same design, and specific implementation process is detailed in embodiment of the method, and which is not described herein again.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of duplicate removal data statistical approach, which is characterized in that the described method includes:

The multiple user behavior datas counted in designated time period are obtained, and are obtained belonging to the multiple user behavior data Target data set mark and target data segmental identification；

Based on the multiple user behavior data, target data set mark and the target data segmental identification, determine The storage state of the multiple user behavior data, the storage state include storage state or not stored state；

The data deduplication number data amount check being determined as in the designated time period；

It is described to be based on the multiple user behavior data, target data set mark and the target data segmental identification, Determine the storage state of the multiple user behavior data, comprising:

The use is determined by the first random generator for each user behavior data in the multiple user behavior data Target data partition identification belonging to the behavioral data of family, first random generator is for generating the first numerical value；

Based on the target data partition identification, the second multiple and different random generators is determined, and determine that third produces at random Raw device and target data bitmap；

Based on the user behavior data, by the plurality of different second random generator, and by the third with Machine generator determines corresponding multiple Data Positions from the target data bitmap；

Based on target data set mark, the target data segmental identification, the target data partition identification and described Multiple Data Positions determine the storage state of the user behavior data.

2. the method as described in claim 1, which is characterized in that described based on target data set mark, the target Data fragmentation mark, the target data partition identification and the multiple Data Position, determine depositing for the user behavior data Storage state, comprising:

Based on target data set mark, the target data segmental identification and the target data partition identification, determine Target data subregion where the user behavior data；

Based on the multiple Data Position and the target data subregion, the storage state of the user behavior data is determined.

3. method according to claim 2, which is characterized in that described to be based on the multiple Data Position and the target data Subregion determines the storage state of the user behavior data, comprising:

Each data sectional in the multiple data sectionals for including for the target data subregion, obtains the number of the data sectional According to the multiple numerical value stored on multiple Data Positions described in bitmap；

If the multiple numerical value is specified numerical value, it is determined that the storage state of the user behavior data is to have stored shape State；

If the multiple numerical value is not specified numerical value, it is determined that the user behavior data is not stored in the data sectional In；

If the user behavior data is not stored in the multiple data sectional, it is determined that the user behavior data Storage state is not stored state.

4. method as claimed in claim 3, which is characterized in that the storage state of the determination user behavior data is not After storage state, further includes:

From the multiple data sectional, target data segment is obtained, the target data segment is not stored full data point Section；

5. a kind of duplicate removal data statistics equipment, which is characterized in that the equipment includes:

Module is obtained, for obtaining the multiple user behavior datas counted in designated time period, and the multiple user of acquisition The mark of target data set belonging to behavioral data and target data segmental identification；

First determining module, for based on the multiple user behavior data, target data set mark and the target Data fragmentation mark, determines the storage state of the multiple user behavior data, the storage state include storage state or The not stored state of person；

Statistical module, for from the multiple user behavior data, statistics to be in the user behavior data of not stored state Data amount check；

Second determining module, the data deduplication number for being determined as the data amount check in the designated time period；

First determining module includes:

First determination unit is used for for each user behavior data in the multiple user behavior data, random by first Generator determines target data partition identification belonging to the user behavior data；

Second determination unit, for determining the second multiple and different random generators based on the target data partition identification, with And determine third random generator and target data bitmap；

Third determination unit, for being based on the user behavior data, by the plurality of different second random generator, with And by the third random generator, from the target data bitmap, corresponding multiple Data Positions are determined；

4th determination unit, for based on target data set mark, the target data segmental identification, the number of targets According to partition identification and the multiple Data Position, the storage state of the user behavior data is determined.

6. equipment as claimed in claim 5, which is characterized in that the 4th determination unit includes:

First determines subelement, for based on target data set mark, the target data segmental identification and the mesh Data partition identification is marked, determines the target data subregion where the user behavior data；

Second determines subelement, for being based on the multiple Data Position and the target data subregion, determines user's row For the storage state of data.

7. equipment as claimed in claim 6, which is characterized in that

Described second determines that subelement is used for:

8. equipment as claimed in claim 7, which is characterized in that

Described second determines that subelement is also used to: