CN115114283A

CN115114283A - Data processing method and device, computer readable medium and electronic equipment

Info

Publication number: CN115114283A
Application number: CN202210576032.0A
Authority: CN
Inventors: 张雨春
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-27

Abstract

The application discloses a data processing method, a data processing device, a computer readable medium and an electronic device, wherein the method comprises the following steps: performing hash operation processing on each data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed; overlapping the data to be processed to the data at the positions corresponding to the hash algorithms and the hash values in a preset data list to obtain overlapped data corresponding to each hash algorithm; sequencing the superposed data corresponding to each Hash algorithm in a preset data list, and acquiring target superposed data and non-target data according to a sequencing result; and calculating data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data. According to the technical scheme, the data to be processed is converted from the original value space data into the data occupying a smaller storage space, so that the requirement on the storage space during data processing is greatly reduced, and resources required by index calculation are saved.

Description

Data processing method and device, computer readable medium and electronic equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, a computer-readable medium, and an electronic device.

Background

With the advent of the big data age, more and more raw data are required to be processed when data analysis is performed. In general, the raw data is recorded in a storage mode one by one according to the data generation time, and when data processing is performed, the raw data is acquired from a corresponding data storage position for calculation, for example, the raw data is accumulated according to certain conditions, and then calculation results are stored together, so that the subsequent use is facilitated. However, the larger the amount of original data is, the larger the storage space occupied by such a storage method, and at the same time, the number of data storage locations to be accessed during data processing will also increase, which may reduce the data processing efficiency.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The present application aims to provide a data processing method, an apparatus, a computer readable medium and an electronic device, so as to optimize the problem that the storage space occupied by data processing in the related art is large.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a data processing method, including:

performing hash operation processing on each data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed;

according to at least one hash value corresponding to the data to be processed, overlapping the data to be processed to data in a preset data list at positions corresponding to the hash algorithm and the hash value to obtain overlapped data corresponding to each hash algorithm;

sorting the superposed data corresponding to each hash algorithm in the preset data list, acquiring target superposed data which corresponds to each hash algorithm and is matched with a set index according to a sorting result, and generating non-target data corresponding to each hash algorithm according to other superposed data except the target superposed data in the superposed data corresponding to each hash algorithm;

and calculating data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data so as to obtain a data processing result corresponding to the target scene.

According to an aspect of an embodiment of the present application, there is provided a data processing apparatus including:

the hash operation module is used for carrying out hash operation processing on each data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each data to be processed;

the data superposition module is used for superposing the data to be processed to data at positions corresponding to the hash algorithm and the hash value in a preset data list according to at least one hash value corresponding to the data to be processed to obtain superposed data corresponding to each hash algorithm;

the data calculation module is used for sequencing the superposed data corresponding to each hash algorithm in the preset data list, acquiring target superposed data which correspond to each hash algorithm and are matched with a set index according to a sequencing result, and generating non-target data corresponding to each hash algorithm according to other superposed data except the target superposed data in the superposed data corresponding to each hash algorithm;

and the result generation module is used for calculating data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data so as to obtain a data processing result corresponding to the target scene.

In one embodiment of the application, the data to be processed comprises data in the form of key-value pairs; the hash operation module is specifically configured to: carrying out hash operation processing on keys of each data to be processed in the target scene through at least one hash algorithm;

the data superposition module is specifically configured to: and according to at least one hash value corresponding to the data to be processed, superposing the value of the data to be processed to data at a position corresponding to the hash algorithm and the hash value in a preset data list.

In one embodiment of the present application, the hash algorithm includes a hash function operation and a modulo operation; the hash operation module comprises:

the hash calculation unit is used for carrying out hash calculation on the key of each piece of data to be processed in the target scene through at least one hash function operation to obtain at least one hash result of each piece of data to be processed;

the modular operation unit is used for carrying out modular operation on at least one Hash result of each piece of data to be processed aiming at a preset Hash bucket number, and taking the result of the modular operation as at least one Hash value corresponding to each piece of data to be processed; and the preset Hash bucket number is used for indicating the size of the storage space occupied by the preset quantity list.

In an embodiment of the application, the overlay data corresponding to the hash algorithm includes overlay data stored in a plurality of hash buckets, and one hash bucket represents one storage location in the preset data list; the data calculation module comprises:

the target superposed data generating unit is used for acquiring superposed data which correspond to each hash algorithm and are stored in a set number of hash buckets matched with set indexes according to the sequencing result; and summing the superposed data stored in the hash buckets with the set number to obtain target superposed data corresponding to each hash algorithm.

In one embodiment of the present application, the data calculation module comprises:

the numerical expectation calculation unit is used for generating a numerical expectation of non-target data according to other superposed data except the target superposed data in the superposed data corresponding to the Hash algorithm and the data volume to be processed corresponding to the other superposed data;

the quantity expectation calculation unit is used for calculating the quantity expectation of the non-target data according to the data quantity to be processed corresponding to the target superposition data;

and the non-target data calculation unit is used for generating the non-target data corresponding to the hash algorithm according to the product of the numerical value expectation of the non-target data and the quantity expectation of the non-target data.

In one embodiment of the present application, the numerical expectation calculation unit includes:

a non-target superimposed data generating subunit, configured to sum up other superimposed data, except the target superimposed data, in the superimposed data corresponding to the hash algorithm, to obtain non-target superimposed data corresponding to the hash algorithm;

the data duplication removing subunit is used for determining duplication removing numbers of the data to be processed corresponding to the other superposed data according to the data amount to be processed corresponding to the other superposed data;

and the numerical expectation calculating subunit is used for obtaining the numerical expectation of the non-target data according to the ratio of the duplication removal numbers of the data to be processed corresponding to the non-target superposed data and the other superposed data.

In an embodiment of the present application, the data deduplication subunit is specifically configured to:

according to keys of the data to be processed, carrying out duplicate removal processing on a plurality of data to be processed in the target scene to obtain a total number of duplicate removal of the data to be processed;

according to keys of the data to be processed, carrying out duplicate removal processing on a plurality of data to be processed corresponding to the target superposed data to obtain a duplicate removal number of the data to be processed corresponding to the target superposed data;

and obtaining the duplication eliminating numbers of the data to be processed corresponding to other superposition data according to the difference value between the duplication eliminating total number of the data to be processed and the duplication eliminating number of the data to be processed corresponding to the target superposition data.

In one embodiment of the present application, the quantity expectation calculation unit includes:

the first calculating subunit is configured to generate an expected amount of the target superimposed data according to a preset hash bucket number, a total deduplication amount of to-be-processed data corresponding to the target scene, and a preset fitting function;

and the second calculating subunit is used for obtaining the quantity expectation of the non-target data according to the difference between the quantity expectation of the target superposed data and the numerical value of the set index.

In one embodiment of the present application, the quantity expectation calculation unit further includes:

the fitting function construction unit is used for constructing a fitting function related to the hash bucket dividing number and the data deduplication total number to be processed according to the fitting coefficient to be determined;

the training unit is used for training the fitting function through sample data to obtain a target numerical value of the fitting coefficient to be determined; the sample hash bucket number and the sample data deduplication total number corresponding to the target scene;

and the preset fitting function generating unit is used for generating the preset fitting function according to the target numerical value of the fitting coefficient to be determined.

In an embodiment of the application, the training unit is specifically configured to:

randomly generating an initial value of the fitting coefficient to be determined;

calculating sample data through a fitting function with the undetermined fitting coefficient as an initial value to obtain the expected predicted quantity of the sample data;

and adjusting the initial value of the to-be-determined fitting coefficient according to the difference between the predicted quantity expectation of the sample data and the actual quantity expectation of the sample data until the difference is smaller than a preset threshold value, so as to obtain a target numerical value of the to-be-determined fitting coefficient.

In one embodiment of the present application, the result generation module comprises:

the data processing unit is used for obtaining data matched with the set index in the plurality of data to be processed corresponding to each hash algorithm according to the difference between the target superposition data corresponding to each hash algorithm and the non-target data;

and the statistical unit is used for performing statistical processing on data matched with the set index in the plurality of data to be processed corresponding to the hash algorithms to obtain a data processing result corresponding to the target scene.

In an embodiment of the present application, the statistical unit is specifically configured to:

and calculating expected values of data matched with the set indexes in the plurality of data to be processed corresponding to the hash algorithms to serve as data processing results corresponding to the target scene.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements a data processing method as in the above technical solutions.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein, the processor executes the executable instruction to make the electronic device execute the data processing method in the technical scheme.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method as in the above technical scheme.

In the technical scheme provided by the embodiment of the application, hash operation processing is performed on each to-be-processed data in a target scene through at least one hash algorithm to obtain at least one hash value corresponding to each to-be-processed data, and the to-be-processed data are stored in a superposition manner according to the hash algorithm and the hash value, so that the to-be-processed data are converted from original value space data into data occupying a smaller storage space through the hash algorithm, and the requirement on the storage space during data processing is greatly reduced; then, sequencing the superposed data corresponding to each Hash algorithm, acquiring target superposed data matched with the set index based on a sequencing result, and generating non-target data based on other superposed data except the target superposed data; and finally, obtaining data to be processed matched with the set index according to the target superposition data and the non-target data, and generating a data processing result, which is equivalent to realizing large-scale data processing and calculation by using a small amount of calculation resources, obtaining corresponding index data, and saving resources required by index calculation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1A schematically shows an exemplary system architecture block diagram to which the solution of the present application is applied.

Fig. 1B schematically shows a schematic diagram of an application scenario of the technical solution of the present application.

Fig. 1C schematically shows another application scenario of the present technical solution.

Fig. 2 schematically shows a flowchart of a data processing method according to an embodiment of the present application.

Fig. 3 schematically shows a flowchart of a data processing method according to an embodiment of the present application.

Fig. 4 schematically illustrates a diagram of a preset data list provided in an embodiment of the present application.

Fig. 5 schematically shows a diagram of an ordering result of the superimposed data provided by an embodiment of the present application.

Fig. 6 schematically shows a graph for constructing a preset fitting function by a linear fitting manner according to an embodiment of the present application.

Fig. 7 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present application.

FIG. 8 schematically illustrates a block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

As shown in fig. 1A, system architecture 100 may include terminal device 110, network 120, and server 130. Terminal device 110 may include a smart phone, a tablet computer, a notebook computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and so on. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, such as a wired communication link or a wireless communication link.

The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, according to implementation needs. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by both the terminal device 110 and the server 130, which is not particularly limited in this application.

In an embodiment of the present application, the data processing method provided in the embodiment of the present application is implemented by the server 130, and specifically: the server 130 performs hash operation processing on each to-be-processed data in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each to-be-processed data; a hash value is obtained from a hash algorithm on a piece of data to be processed, and the data to be processed may be sent from the terminal device 110 to the server 130. Then, the server 130 superimposes the data to be processed on the data in the preset data list at the positions corresponding to the hash algorithms and the hash values according to at least one hash value corresponding to the data to be processed, so as to obtain superimposed data corresponding to each hash algorithm; that is, the data to be processed is stored in the preset data list in a superposition storage mode. Next, the server 130 sorts the superimposed data corresponding to each hash algorithm in the preset data list, obtains target superimposed data corresponding to each hash algorithm and matched with the set index according to the sorting result, and generates non-target data corresponding to each hash algorithm according to other superimposed data except the target superimposed data in the superimposed data corresponding to each hash algorithm; the setting indicator is generally used to limit the selection range of the data, for example, the setting indicator indicates to select the data in the top five of the sequence. Finally, the server 130 calculates data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data, so as to obtain a data processing result corresponding to the target scene. The technical scheme of the application is equivalent to calculating according to the superposed storage data of the data to be processed, and obtaining the data matched with the set index in the original data to be processed. The server 130 may send the data processing result to the terminal device 110, so that the terminal device 110 visually displays the data processing result.

In an embodiment of the present application, a transfer scenario in which a target scenario is a virtual resource is taken as an example to explain an implementation process of the technical scheme of the present application, and an index is set as a resource transfer amount of an account with a virtual resource transfer total ranked top three. In the virtual resource transfer scenario, one piece of data to be processed represents a transfer record, which represents the amount of virtual resources transferred by one account in a settlement process, and may be represented as (account id, resource transfer amount), for example, if a certain account id is 123 and the resource transfer amount is 100, the data to be processed is represented as (123, 100).

In the virtual resource transfer scenario, one account may perform multiple settlement, that is, there are multiple transfer records, and then one account may correspond to multiple pieces of to-be-processed data, for example, the to-be-processed data corresponding to a certain account includes (an account identifier, a resource transfer amount 1), (an account identifier, a resource transfer amount 2), (an account identifier, a resource transfer amount 3), and the like. Recording data to be processed in a transfer scene of the virtual resources as: (account id 1, resource transfer amount 1), (account id 2, resource transfer amount 2) … (account id n, resource transfer amount n), where the number of account ids is mainly used to distinguish transfer records, and the same account id may exist in the account ids with different numbers. For example, fig. 1B schematically shows a schematic diagram of an application scenario of the technical solution of the present application. As shown in fig. 1B, on one hand, the account 1 of the virtual resource transfers a plurality of virtual resources to the account 2 of the virtual resource, and this resource transfer forms a transfer record (account identifier 1, resource transfer amount 1). On the other hand, the account 1 of the virtual resource transfers a plurality of virtual resources to the account 3 of the virtual resource, and the resource transfer forms a transfer record (account identifier 2, resource transfer amount 2). The two transfer records are both data to be processed, and it can be seen that the two transfer records are distinguished by the account identifier 1 and the account identifier 2, but the account identifier 1 and the account identifier 2 are the same account identifier and are both account identifiers of the account 1 of the virtual resource.

First, the server 130 performs hash operation processing on each piece of data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each piece of data to be processed. Assuming that f (x) represents a hash algorithm, the hash value obtained by performing the hash operation on the data to be processed can be represented as: f (account identification) is hash value. If the Hash algorithm has f in common ₁ (x)、f ₂ (x)…f _n (x) Then for the data to be processed (account identification 1, resource transfer amount 1), f ₁ (Account identifier 1) ═ hash value1, f ₂ (Account identification 1) — hash value2, …, f _n The account identifier 1 is equal to the hash value n, that is, one hash algorithm performs one hash operation on the data to be processed to obtain one hash value, and one hash value is obtained by performing the hash operation on the data to be processed by at least one hash algorithm. Illustratively, as shown in fig. 1B, the data to be processed (account id 1, resource transfer amount 1) is subjected to a hash algorithm f ₁ (x)、f ₂ (x)…f _n (x) Obtaining n hash values; the data to be processed (account identification 2, resource transfer amount 2) is processed by a Hash algorithm f ₁ (x)、f ₂ (x)…f _n (x) N hash values are obtained.

Then, the server 130 superimposes the data to be processed on the data in the preset data list at the position corresponding to the hash algorithm and the hash value according to at least one hash value corresponding to the data to be processed, so as to obtain superimposed data corresponding to each hash algorithm. Illustratively, as shown in fig. 1B, the data to be processed (account id 1, resource transfer amount 1) is subjected to a hash algorithm f ₁ (x) And if the obtained hash value is 1, the resource transfer amount 1 is superposed at the position indicated by (1, 1) in the preset data list. Assuming that the data stored at (1, 1) is 200, the resources are changedIf the amount of shift is 100, the superposition means that the data 200 stored in (1, 1) is added to the amount of resource shift 100, and the resulting superposed data is 300. For the data to be processed (account identification 2, resource transfer amount 2), as shown in FIG. 1B, since account identification 2 is actually equal to account identification 1, f ₁ (Account identifier 2) ═ f ₁ If (account identifier 1) is equal to 1, the resource transfer amount 2 is the same as the storage location of the resource transfer amount 1, that is, the resource transfer amount 2 is superimposed on the location indicated by (1, 1) in the preset data list. Assuming that the data at (1, 1) is 300 and the resource transfer amount 2 is 50 after the resource transfer amount 1 is stored, the superimposed data at (1, 1) is 350 after the resource transfer amount 2 is superimposed and stored.

Next, the server 130 sorts the superimposed data corresponding to each hash algorithm in the preset data list, obtains target superimposed data corresponding to each hash algorithm and matching with the set index according to the sorting result, and generates non-target data corresponding to each hash algorithm according to other superimposed data except the target superimposed data in the superimposed data corresponding to each hash algorithm. As in the previous example, the set index is the resource transfer amount of the account with the top three of the virtual resource transfer total, the target superimposed data is generated according to the superimposed data with the top three of the virtual resource transfer total, in the superimposed data corresponding to each hash algorithm, and the target superimposed data includes the resource transfer amount of the account with the top three of the virtual resource transfer total and the resource transfer amounts of the accounts with the virtual resource transfer total except the top three of the virtual resource transfer total. And generating non-target data according to the other superimposed data except for the superimposed data of the top three, wherein the non-target data comprises the resource transfer amount of the account of the virtual resource transfer sum except for the top three.

Finally, the server 130 calculates data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data, so as to obtain a data processing result corresponding to the target scene. And subtracting the non-target data from the target superposition data to obtain the resource transfer amount of the account with the top three of the virtual resource transfer total rank, and obtaining a data processing result. Illustratively, as shown in fig. 1B, target superimposed data and non-target data are acquired from superimposed data stored in the preset data list, and then a data processing result is generated from the target superimposed data and the non-target data.

In an embodiment of the present application, the target scenario may also be a network access scenario, and the to-be-processed data may be a record of a website and the number of accesses corresponding to the website, for example, if the to-be-processed data is represented as (website identifier, number of accesses), and for example, if an IP Address (Internet Protocol Address) of a certain website is 1.2.3.4, and the number of accesses is 50, the to-be-processed data is represented as (1.2.3.4, 50). For example, fig. 1C schematically shows a schematic diagram of an application scenario of the technical solution of the present application. As shown in fig. 1C, the number of clicks of the website is counted by day, and for website 1, the number of clicks in two days may respectively form the data to be processed: (website id 1, visit number 1) and (website id 2, visit number 2). Similarly, for the website 2, the number of clicks in two days can respectively form the data to be processed: (website id 3, visit number 3) and (website id 4, visit number 4). As can be seen, although website id 1 and website id 2 are different numbers, the specific identification information corresponding to both are the same, and both are identification information of website 1 (for example, both are IP addresses of website 1). Similarly, although the website id 3 and the website id 4 are different numbers, the specific identification information corresponding to the website id 3 and the website id 4 is the same and is the identification information of the website 2 (for example, both are the IP address of the website 2).

Suppose that the index is set to the website ranked five top in the total number of visits. Then, the server 130 processes the website access data according to the data processing method provided in the embodiment of the present application to obtain relevant data of websites with the top five total access times, and the specific implementation process may refer to the description of the relevant process in the transfer scenario of the virtual resource or refer to the description of the subsequent embodiment, which is not described herein again.

In an embodiment of the present application, the target scenario may also be an information recommendation scenario, and the to-be-processed data may be a record of the information and the number of clicks corresponding to the information, for example, the to-be-processed data is represented as (information identifier, access number), exemplarily, a certain information identifier is 111, and the number of clicks is 10, then the to-be-processed data is represented as (111, 10). And setting the index as the information of the top ten of the total click times. Then, the server 130 processes the information click data according to the data processing method provided in the embodiment of the present application to obtain relevant data of the information with the total click times ranked in the top ten, and the specific implementation process may refer to the description of the relevant process in the transfer scene of the virtual resource or refer to the description of the subsequent embodiment, which is not described herein again.

The data processing provided by the present application is described in detail below with reference to specific embodiments.

Fig. 2 schematically shows a flowchart of a data processing method according to an embodiment of the present application, and as shown in fig. 2, the method includes steps 210 to 240, which are specifically as follows:

step 210, performing hash operation processing on each to-be-processed data in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each to-be-processed data.

Specifically, the target scenario may be any scenario that requires batch data processing or analysis, such as a transfer scenario of virtual resources, a network access scenario, an information recommendation scenario, a market data analysis scenario, and the like. The data to be processed in the target scene is the original record data of the relevant information in the scene, for example, the data to be processed in the virtual resource transfer scene is the record data of each piece of resource transfer information, and the data to be processed in the network access scene is the record data of the access information of each website.

The hash algorithm can map data occupying larger space into data occupying smaller space, and the data processed by the hash algorithm is more compact, so that the occupied storage space is reduced. The hash operation processing is carried out on each data to be processed through at least one hash algorithm, which means that any data to be processed needs to be operated through various hash algorithms. A hash value is obtained by processing data to be processed through a hash algorithm, and then at least one hash value is obtained after processing through at least one hash algorithm.

In an embodiment of the present application, the data to be processed is data in a key-value pair form, and is represented in a (key, value) form, the key is a key of the data to be processed, the value is a value of the data to be processed, and generally, according to the key, a corresponding value may be found. In the embodiment of the application, the data to be processed is subjected to hash operation processing through the hash algorithm, and the key in the data to be processed is operated through the hash algorithm to obtain the corresponding hash value.

Step 220, according to at least one hash value corresponding to the data to be processed, superimposing the data to be processed on data in a preset data list at positions corresponding to the hash algorithm and the hash value to obtain superimposed data corresponding to each hash algorithm.

Specifically, after the hash value corresponding to the data to be processed is obtained, the data to be processed is stored in a preset data list in a superposition storage mode. The superposition storage refers to superposing the data to be processed and the data stored at the corresponding position in the preset list to obtain new stored data at the position.

The storage position of the data to be processed in the preset data list is determined by the Hash algorithm and the Hash value together. The predetermined data list may be regarded as a data storage table composed of a plurality of rows and a plurality of columns, and the hash algorithm and the hash value are respectively used for determining one of the rows and the columns of the storage location of the data to be processed. For example, a line of data in the preset data list is obtained by corresponding to the same hash algorithm, and the hash algorithm number may represent a line number of a storage location of the data to be processed; a column of data in the preset data list corresponds to the same hash value, and the hash value may represent a column number of a storage location of the data to be processed. Of course, the column number of the storage location of the data to be processed may also be represented by a hash algorithm number, and the row number of the storage location of the data to be processed may also be represented by a hash value.

In an embodiment of the present application, the data to be processed is data in a key value pair form, and when performing overlay storage, a value (value) in the data to be processed is overlappingly stored to a corresponding storage location in a preset data list.

In one embodiment of the present application, FIG. 4 schematically illustrates a presetSchematic of a data list. Assume a total of m hash algorithms, denoted as f ₁ (x)、f ₂ (x)…f _m (x) The hash algorithm number is used for representing the line number of the storage position of the data to be processed, the hash value is used for representing the column number of the storage position of the data to be processed, and the storage position of the data to be processed is represented as (line number, column number). Wherein, the total column number of the preset data list is preset.

As shown in FIG. 4, for the data to be processed (key, value), f ₁ If (key) ═ 4, then one storage position of the data to be processed is (1, 4), then the value of the data to be processed is stored in superposition with the data at the position of the 4 th column of the 1 st row, i.e. the data + value at the position of (1, 4). f. of ₂ If (key) ═ 2, then one storage location of the data to be processed is (2, 2), then the value of the data to be processed is stored in superposition with the data at the 2 nd column location in row 2, i.e. data + value at the (2, 2) location. f. of _m If (key) ═ 5, then one storage location of the data to be processed is (m, 5), then the value of the data to be processed is stored in superposition with the data at the position of the m-th row and the 5 th column, i.e., (m, 5) -th position + value.

It can be seen that the data to be processed (key, value) is respectively stored in a row corresponding to each hash algorithm in the preset data list through the operation of m hash algorithms, and the data corresponding to each hash algorithm in the preset data list all includes the superimposed data (i.e., multiple rows of superimposed data) at multiple storage locations.

And step 230, sequencing the superposed data corresponding to each hash algorithm in the preset data list, acquiring target superposed data corresponding to each hash algorithm and matched with the set index according to the sequencing result, and generating non-target data corresponding to each hash algorithm according to other superposed data except the target superposed data in the superposed data corresponding to each hash algorithm.

Specifically, the sorting is to sort the plurality of superimposed data corresponding to each hash algorithm, and the plurality of superimposed data corresponding to each hash algorithm may be arranged from large to small, or from small to large. The setting index is generally used to limit the selection range of data, for example, the setting index is head index data of certain data, and exemplarily, the setting index is the sum of the visit amounts of the websites ranked at the Top3 of the total visit number, which is also expressed as the total visit amount of the website of the visit number Top 3.

After the ordering, the plurality of superposed data corresponding to each hash algorithm are arranged according to the size, so that the target superposed data matched with the set index can be conveniently obtained from the plurality of superposed data corresponding to each hash algorithm. For example, a plurality of superimposed data corresponding to each hash algorithm are arranged from large to small, and then the superimposed data of the top3 is obtained according to the sorting result, so that the target superimposed data can be obtained.

In the embodiment of the present application, since the superimposed data in each storage location in the preset data list is a result of superimposed storage of a plurality of pieces of to-be-processed data, the target superimposed data includes not only the to-be-processed data matching the set index, but also a part of the to-be-processed data not matching the set index, and the part of the to-be-processed data not matching the set index is data that needs to be discarded in the data processing process. For example, the data to be processed is website access data, and the target overlay data corresponding to each hash algorithm includes the sum of the access amounts of websites with the total access times ranking first 3, and also includes the sum of the access amounts of partial websites with the total access times ranking first 3. Exemplarily, for the hash algorithm f ₁ (x) For

keys

1 and 2 to be non-identical, there may be f ₁ (key1)＝f ₁ (key2), then the corresponding value1 and value2 are stored in the same location superimposed. If key1 is a website ranked 3 top in total number of visits, the data to be processed matching the set index is value1, and the target overlay data includes the overlay data of the storage location, i.e., the sum of value1+ value 2. Therefore, the value2 needs to be omitted from the target superimposed data to obtain the to-be-processed data value1 matching the set index.

For the other superimposed data except the target superimposed data in the superimposed data corresponding to each hash algorithm, the data are data which are not matched with the set index, and then the data to be processed which are not matched with the set index in the target superimposed data can be estimated through the data which are not matched with the set index, wherein the data are non-target data.

And 240, calculating data matched with the set index in the plurality of data to be processed according to the target superposition data and the non-target data to obtain a data processing result corresponding to the target scene.

According to the analysis, the non-target data is data which needs to be discarded in the target superposition data, and then the target superposition data and the non-target data are subtracted to obtain target data, wherein the target data is to-be-processed data matched with the set index.

The superposed data corresponding to each hash algorithm is obtained by performing hash operation on all to-be-processed data corresponding to the target scene, that is, the target data obtained according to one hash algorithm is actually to-be-processed data of the target scene matching the set index, that is, a data processing result corresponding to the target scene. Therefore, when a plurality of hash algorithms are used, one of the target data obtained by the respective hash algorithms can be selected as the data processing result corresponding to the target scene. Optionally, in order to improve the data processing accuracy, the target data obtained by each hash algorithm may also be subjected to statistical processing, and then a result of the statistical processing is used as a data processing result corresponding to the target scene, for example, a mean value of the target data obtained by each hash algorithm is used as a data processing result corresponding to the target scene.

Fig. 3 schematically shows a flowchart of a data processing method provided in an embodiment of the present application, which is a further refinement of the above-described embodiment. As shown in fig. 3, the data processing method provided in the embodiment of the present application includes steps 310 to 390, which are specifically as follows:

and 310, performing hash operation processing on the key of each piece of data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each piece of data to be processed.

Specifically, the data to be processed includes data in the form of key-value pairs, that is, the data to be processed is composed of a key (key) and a value (value), which is denoted as (key, value).

Illustratively, the data to be processed of the target scene includes:

(key ₁ ，value ₁ )，(key ₂ ，value ₂ )…(key _n ，value _n )

the numbers of the keys are different, so that the two pieces of data to be processed are different data records, and the specific values of the keys can be the same. For example, in a network access scenario, a key represents a website identifier, a value represents the number of website accesses, and then a key may exist ₁ ＝key ₂ This means (key) ₁ ，value ₁ ) And (key) ₂ ，value ₂ ) Are records for the same web site, e.g., (key) ₁ ，value ₁ ) The number of visits (key) to website 1 in the previous hour is recorded ₂ ，value ₂ ) The number of visits to website 1 in the next hour is recorded.

Generally, to process data to be processed of a target scene, values with the same key in the data to be processed are accumulated to obtain an accumulated value, as shown in the following formula (1):

wherein v is _i Represents value _i ，s _k Represents the sum of values of the data to be processed whose key is k. The expression (1) means that when key ═ k, all values corresponding thereto are summed.

The sum of the top-ranked N accumulations is then taken from the plurality of accumulations (abbreviated as TOPN accumulation) and is recorded as:

where TOPN denotes the top N. The top n is accumulated as the final desired data processing result.

In an embodiment of the present application, a hash algorithm includes a hash function operation and a modulo operation, where the hash function operations corresponding to the hash algorithms are different, and the modulo operations are the same, and a specific calculation process of a hash value includes: performing hash calculation on the key of each piece of data to be processed in the target scene through at least one hash function operation to obtain a hash result corresponding to each piece of data to be processed; and performing modular operation on the preset hash bucket number according to the hash result corresponding to each piece of data to be processed to obtain at least one hash value corresponding to each piece of data to be processed.

Each hash function is noted as h ₁ (x)、h ₂ (x)…h _m (x) The modulo operation is performed on the preset hash bucket number to obtain mod b, where b represents the preset hash bucket number, and the definition of the preset hash bucket number refers to the relevant description in step 320. The hash value is calculated as shown in the following equation (2):

δ _i,j ＝h _j (key _i )mod b (2)

wherein h is _j (key _i ) Key for representing ith data to be processed by jth hash function _i After performing hash calculationAnd (6) hashing the result. Delta _i,j Key representing ith data to be processed _i After the jth hash function is calculated, a hash value delta obtained by modulo a preset hash bucket number b _i,j Also known as a sketch bit. Therefore, the hash value corresponding to the data to be processed is an integer less than or equal to the preset hash bucket number.

Since the hash functions of the hash algorithms are different and the modulo operation is the same, different hash operations can be expressed by different hash functions. For example, with a hash function h ₁ (x) Representing a hash operation f ₁ (x)。

And 320, according to at least one hash value corresponding to the data to be processed, superimposing the value of the data to be processed on the data at the position corresponding to the hash algorithm and the hash value in the preset data list to obtain superimposed data corresponding to each hash algorithm.

Specifically, the preset data list is regarded as a data storage table formed by a plurality of rows and a plurality of columns, the number of rows of the preset data list is determined by the number of hash algorithms, and the number of columns of the preset data list is determined by the number of preset hash buckets (of course, the number of columns may be determined by the hash algorithms, and the number of rows is determined by the number of preset hash buckets, which is not described herein in any detail). The preset hash sub-bucket number refers to the number of preset hash buckets, one hash bucket is equivalent to a storage chain table corresponding to a fixed hash value, and the storage chain table stores superposed data obtained through processing of each hash algorithm. Since the storage space occupied by one hash bucket is fixed, the preset hash bucket number actually reflects the size of the empty storage space occupied by the preset data list.

For example, fig. 4 schematically illustrates a diagram of a preset data list, in the preset data list structure, the preset hash bucket number is b, which is equivalent to b columns in total of the preset data list; the hash algorithm has m pieces, and is marked as f ₁ (x)、f ₂ (x)…f _m (x) This corresponds to a predetermined data list having m rows.

In one embodiment of the present application, the calculation formula of the accumulated data is shown as the following formula (3):

wherein hs is _j,l And j-th hash algorithm, wherein the hash value is the accumulated data corresponding to l, and is equivalent to the superposed data at the jth row and the ith column in the preset data list. The expression (3) means that when the jth hash algorithm calculates that the hash value is l, the values of all corresponding data to be processed are summed.

For example, for the to-be-processed data (key, value), f shown in fig. 4 ₁ (key) ═ 4, add value to the (1, 4) position; f. of ₂ (key) ═ 2, then add value to the (2, 2) position; f. of _m And (key) ═ 5, value is added to the (m, 5) position.

The data storage mode only stores intermediate results (namely, superposed data) without storing all the data to be processed one by one, so that the effect of calculating a large amount of head index data through a small amount of intermediate results (equivalent to limited storage resources) is achieved.

And 330, sequencing the superposed data corresponding to each hash algorithm in the preset data list, and acquiring the superposed data stored in a set number of hash buckets corresponding to each hash algorithm and matched with the set index according to the sequencing result.

Specifically, the step is mainly to extract the superimposed data in the TOPN hash bucket from the superimposed data corresponding to each hash algorithm. Illustratively, the pending data list shown in fig. 4, for each grid in each row, corresponds to one hash bucket of the corresponding hash algorithm. The overlapped data in the TOPN hash buckets is the overlapped data of the first N hash buckets.

For example, fig. 5 schematically shows a diagram of an ordering result of a plurality of superimposed data corresponding to a certain hash algorithm. As shown in fig. 5, each bar represents the overlay data of one hash bucket, and assuming that the TOPN is TOP3, the overlay data of 3 hash buckets matching the set index is the overlay data represented by the 3 bars within the dashed box in fig. 5.

And 340, summing the superposed data stored in the hash buckets with the set number to obtain target superposed data corresponding to each hash algorithm.

Specifically, the extracted overlay data in the TOPN hash bucket are summed to obtain target overlay data, which can be represented as:

wherein hs is _l Representing the overlay data in the ith hash bucket.

Illustratively, in the sorting result shown in fig. 5, the target overlay data corresponding to TOP3 is the sum of the data of 3 bars within the dashed box.

In the bar chart shown in fig. 5, one bar is formed of at least one square, and one square actually represents the sum of all values corresponding to one key, and one square represents the sum of the number of accesses corresponding to one web site, taking the web site access scene as an example. The keys corresponding to the squares are different. In fig. 5, the sum of data (shown by the shaded portion in fig. 5) corresponding to the largest area of the 3 squares in the dashed box is the total access amount of the website with the number of TOP3 accesses finally required, i.e., the TOP n accumulation. The data corresponding to the squares other than the 3 squares with the largest area in the dashed line frame is non-target data.

And 350, generating a numerical expectation of the non-target data according to the other superposed data except the target superposed data in the superposed data corresponding to the hash algorithm and the quantity of the data to be processed corresponding to the other superposed data.

Specifically, in the superimposed data corresponding to the hash algorithm, the data other than the target superimposed data is other superimposed data, such as data represented by a column bar except a dashed box in fig. 5. According to the method and the device, the numerical expectation of the non-target data is calculated through other superposed data and the corresponding data amount to be processed, and the numerical expectation of the non-target data refers to the expectation of value in the non-target data.

The following takes a hash algorithm as an example to describe the calculation process of the non-target data. The calculation process of the numerical expectation of the non-target data comprises the following steps: summing other superposed data except the target superposed data in the superposed data corresponding to the Hash algorithm to obtain non-target superposed data corresponding to the Hash algorithm; determining the duplication eliminating number of the data to be processed corresponding to other superposed data according to the data amount to be processed corresponding to other superposed data; and obtaining the numerical expectation of the non-target data according to the ratio of the duplication removal numbers of the data to be processed corresponding to the non-target superposed data and other superposed data.

Specifically, taking the sum of the other superimposed data as the non-target superimposed data, the non-target superimposed data may be expressed as:

wherein hs is _l Indicating the overlay data stored by the ith hash bucket. And when the ith hash bucket does not belong to the TOPN hash bucket, the superposed data of the hash bucket is other superposed data, and the superposed data which do not belong to the TOPN hash bucket are summed to obtain non-target superposed data.

The duplication removal number of the data to be processed refers to the number of the data to be processed after the duplication removal processing is performed according to the key, namely the number of the keys after the key duplication removal, and the duplication removal number of the data to be processed is equivalent to the type of the key. The duplication elimination number of the data to be processed corresponding to the other superimposed data is the number of keys obtained after the duplication elimination of the keys in the total amount of the data to be processed corresponding to the other superimposed data. For example, 100 pieces of data to be processed correspond to 100 keys, and 50 keys are obtained after the keys are deduplicated, so that the deduplication number of the data to be processed is 50.

In an embodiment of the present application, a calculation method of the deduplication number of the to-be-processed data corresponding to the other superimposed data is as follows: according to keys of the data to be processed, carrying out duplicate removal processing on a plurality of data to be processed in the target scene to obtain the total duplicate removal amount of the data to be processed; according to the key of the data to be processed, carrying out duplication elimination processing on a plurality of data to be processed corresponding to the target superposition data to obtain duplication elimination numbers of the data to be processed corresponding to the target superposition data; and obtaining the duplication removing number of the data to be processed corresponding to other superposition data according to the difference value between the duplication removing total number of the data to be processed and the duplication removing number of the data to be processed corresponding to the target superposition data.

That is to say, the duplication eliminating number of the data to be processed of the target superposition data opposite to other superposition data is obtained, and then the duplication eliminating number of the data to be processed corresponding to the target superposition data is subtracted from the total duplication eliminating number of the data to be processed in the target scene, so that the duplication eliminating number of the data to be processed corresponding to other superposition data can be obtained. This is because the deduplication number of the to-be-processed data corresponding to the other superimposed data is usually greater than the deduplication number of the to-be-processed data corresponding to the target superimposed data (because the target superimposed data is the TOPN data, and the other superimposed data is data other than the TOPN), and the deduplication number of the to-be-processed data corresponding to the other superimposed data is calculated by using the deduplication number of the to-be-processed data corresponding to the target superimposed data, which is smaller than the data amount processed by directly performing deduplication processing on the to-be-processed data corresponding to the other superimposed data, and thus the data processing speed can be increased.

For example, the desired calculation of the value of the non-target data is shown in equation (4) below:

k represents the data deduplication total number to be processed in the target scene;

representing the de-duplication number of the data to be processed corresponding to the target superposition data, namely the de-duplication number of the key in the TOPN hash bucket;

and representing the deduplication numbers of the data to be processed corresponding to other overlapped data, namely the deduplication numbers of keys in the hash buckets except for the TOPN.

And 360, calculating the quantity expectation of the non-target data according to the data quantity to be processed corresponding to the target superposed data.

Specifically, the expectation of the amount of the non-target data refers to the expectation of the amount of the to-be-processed data corresponding to the non-target data. The amount expectation of the non-target data can be obtained by the difference between the amount expectation of the target superimposed data and the set index value, that is, the amount expectation of the TOPN hash bucket minus N, and the amount expectation of the non-target data is obtained, as shown in the following formula (5):

wherein the content of the first and second substances,

the number of target superimposed data is expected, and N represents a specific numerical value of the set index.

In an embodiment of the application, the expected quantity of the target superimposed data is generated according to a preset hash bucket number, a to-be-processed data deduplication sum corresponding to a target scene and a preset fitting function. Illustratively, K represents a deduplication total number of data to be processed corresponding to the target scene; b represents a preset hash bucket number, and a preset fitting function can be expressed as

Substituting the data deduplication total number and the preset hash bucket number into the preset fitting function to obtain the quantity expectation of the target superposition data

In one embodiment of the present application, the generating process of the preset fitting function includes: according to a preset hash bucket number, the data deduplication sum to be processed corresponding to the target scene and the fitting coefficient to be determined, constructing a fitting function related to the hash bucket number and the data deduplication sum to be processed; training the fitting function through sample data to obtain a numerical value of a fitting coefficient to be determined; and generating a preset fitting function according to the preset Hash bucket number, the data deduplication total number corresponding to the target scene and the numerical value of the fitting coefficient to be determined.

Illustratively, the representation of the preset fitting function is shown in the following equation (6):

k represents the deduplication total number of the data to be processed corresponding to the target scene; b represents the hash bucket number; a is ₁ ,a ₂ ,a ₃ Is the fitting coefficient to be determined. The fitting coefficient to be determined can be obtained by fitting calculation of the sample data to the fitting function.

In one embodiment of the present application, the fitting calculation of the sample data to the fitting function comprises: randomly generating an initial value of a fitting coefficient to be determined; calculating the sample data through a fitting function with the undetermined fitting coefficient as an initial value to obtain the expected predicted quantity of the sample data; and adjusting the initial value of the fitting coefficient to be determined according to the difference between the predicted quantity expectation of the sample data and the actual quantity expectation of the sample data until the difference is smaller than a preset threshold value, and obtaining the numerical value of the fitting coefficient to be determined.

Firstly, generating a fitting coefficient a to be determined ₁ ,a ₂ ,a ₃ Then substituting the sample data into the undetermined fitting coefficient with the initial value, and calculating to obtain the expected prediction quantity corresponding to the sample data. The sample data comprises a sample hash bucket number and a sample data deduplication total number corresponding to the target scene, namely, the initial value of the fitting coefficient to be determined, the sample hash bucket number and the sample data deduplication total number corresponding to the target scene are substituted into the formula (6) to obtain the prediction quantity expectation corresponding to the sample data. In the construction stage of the preset fitting function, the actual quantity expectation corresponding to the sample data is known, the initial value of the to-be-determined fitting coefficient is adjusted according to the difference value between the predicted quantity expectation corresponding to the sample data and the actual quantity expectation, then the sample data is calculated according to the fitting function after the initial value is adjusted to obtain a new preset quantity expectation, the expected difference value is calculated, and the numerical value of the to-be-determined fitting coefficient is adjusted according to the difference value. And circularly calculating until the difference is smaller than a preset threshold value to obtain a target numerical value of the fitting coefficient to be determined. Base ofAnd obtaining a preset fitting function according to the target value of the fitting coefficient to be determined.

And substituting the target numerical value of the fitting coefficient to be determined, the preset hash bucket number and the data to be processed deduplication sum corresponding to the target scene into a formula (6) to obtain the target superposition data quantity expectation in the embodiment of the application.

Illustratively, fig. 6 schematically shows a graph in which a preset fitting function is constructed by a linear fitting manner. It can be seen that the difference between the fitting value of the preset fitting function and the actual value is very small, and therefore, the fitting value of the preset fitting function can be expected as the number of target superimposed data.

Step 370, generating the non-target data corresponding to the hash algorithm according to the product of the numerical expectation of the non-target data and the quantity expectation of the non-target data.

Specifically, the desired number is multiplied by the desired quantity to obtain non-target data, as shown in the following equation (7):

the non-target data represents the sum of the data to be processed which does not match the setting index in the target superimposed data, that is, the sum of the data to be processed which does not belong to the TOPN in the target superimposed data corresponding to the TOPN hash bucket.

And 380, obtaining data matched with the set index in the plurality of data to be processed corresponding to each hash algorithm according to the difference between the target superposition data corresponding to each hash algorithm and the non-target data.

Specifically, the non-target data is subtracted from the target superimposed data to obtain target data matched with the set index, that is, the sum of the top to-be-processed data, that is, the top is accumulated.

Step 390, performing statistical processing on data matched with the set index in the multiple pieces of data to be processed corresponding to each hash algorithm to obtain a data processing result corresponding to the target scene.

Specifically, expectation is obtained by adding top n of each hash algorithm to obtain top n accumulation in the to-be-processed data of the target scene, as shown in the following formula (8):

where H represents the total number of hash functions, which is equivalent to the total number of hash algorithms. E _h∈H Indicating the expectation of the results of all hash algorithms, E _h∈H The content in the parentheses is the calculation content of a certain hash algorithm.

According to the technical scheme, the data to be processed are stored in an overlapped mode through the hash buckets with the preset number, and the storage space occupied by the hash buckets is certain, so that the data storage mode does not need to store the data to be processed one by one, only the overlapped data obtained through middle calculation is stored, and the effect of calculating a large amount of head index data through limited storage resources is achieved. Meanwhile, in the storage process of the superposed data, the data to be processed is not discarded, which is equivalent to the preservation of the global information of the data to be processed, and the data to be processed which is more dispersed but has larger superposed data can be taken into consideration, so that the data used by index calculation is more comprehensive, and the calculation precision can not be reduced even if the occupation of storage resources is reduced.

It should be noted that although the various steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the shown steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

The following describes embodiments of an apparatus of the present application, which may be used to perform the data processing method in the above-described embodiments of the present application. Fig. 7 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 7, the data processing apparatus provided in the embodiment of the present application includes:

the hash operation module 710 is configured to perform hash operation processing on each piece of data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each piece of data to be processed;

a data stacking module 720, configured to stack the to-be-processed data onto data in a preset data list at positions corresponding to the hash algorithm and the hash value according to at least one hash value corresponding to the to-be-processed data, so as to obtain stacked data corresponding to each hash algorithm;

the data calculation module 730 is configured to sort the superimposed data corresponding to each hash algorithm in the preset data list, obtain target superimposed data corresponding to each hash algorithm and matching a set index according to a sorting result, and generate non-target data corresponding to each hash algorithm according to other superimposed data except the target superimposed data in the superimposed data corresponding to each hash algorithm;

and a result generating module 740, configured to calculate, according to the target superposition data and the non-target data, data that is matched with the set index in the multiple pieces of data to be processed, so as to obtain a data processing result corresponding to the target scene.

In one embodiment of the application, the data to be processed comprises data in the form of key-value pairs; the hash operation module 710 is specifically configured to: carrying out hash operation processing on keys of each data to be processed in the target scene through at least one hash algorithm;

the data superposition module 720 is specifically configured to: and according to at least one hash value corresponding to the data to be processed, superposing the value of the data to be processed to data at a position corresponding to the hash algorithm and the hash value in a preset data list.

In one embodiment of the present application, the hash algorithm includes a hash function operation and a modulo operation; the hash operation module 710 includes:

In an embodiment of the application, the overlay data corresponding to the hash algorithm includes overlay data stored in a plurality of hash buckets, and one hash bucket represents one storage location in the preset data list; the data calculation module 730 includes:

In one embodiment of the present application, the data calculation module 730 includes:

and the non-target data calculation unit is used for generating the non-target data corresponding to the hash algorithm according to the product of the numerical expectation of the non-target data and the quantity expectation of the non-target data.

and the numerical expectation calculation subunit is used for obtaining the numerical expectation of the non-target data according to the ratio of the deduplication numbers of the data to be processed corresponding to the non-target superposition data and the other superposition data.

and obtaining the duplication removing number of the data to be processed corresponding to other superposition data according to the difference value between the duplication removing total number of the data to be processed and the duplication removing number of the data to be processed corresponding to the target superposition data.

In one embodiment of the present application, the number expectation calculation unit includes:

the fitting function construction unit is used for constructing a fitting function related to the hash bucket number and the data deduplication total number to be processed according to the fitting coefficient to be determined;

In one embodiment of the present application, the result generation module 740 includes:

The specific details of the data processing apparatus provided in each embodiment of the present application have been described in detail in the corresponding method embodiment, and are not described herein again.

Fig. 8 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the present application.

It should be noted that the computer system 800 of the electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing unit 801 (CPU) that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory 802 (ROM) or a program loaded from a storage section 808 into a Random Access Memory 803 (RAM). In the random access memory 803, various programs and data necessary for system operation are also stored. The cpu 801, the rom 802 and the ram 803 are connected to each other via a bus 804. An Input/Output interface 805(Input/Output interface, i.e., I/O interface) is also connected to the bus 804.

The following components are connected to the input/output interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a local area network card, modem, and the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the input/output interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. When executed by the central processor 801, the computer program performs various functions defined in the system of the present application.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A data processing method, comprising:

2. The data processing method of claim 1, wherein the data to be processed comprises data in the form of key-value pairs;

the hash operation processing of each data to be processed in the target scene through at least one hash algorithm includes: carrying out hash operation processing on keys of each data to be processed in the target scene through at least one hash algorithm;

the superimposing, according to the at least one hash value corresponding to the data to be processed, the data to be processed onto data in a preset data list at a position corresponding to the hash algorithm and the hash value includes: and according to at least one hash value corresponding to the data to be processed, superposing the value of the data to be processed to data at a position corresponding to the hash algorithm and the hash value in a preset data list.

3. The data processing method of claim 2, wherein the hash algorithm comprises a hash function operation and a modulo operation; the performing hash operation processing on the key of each piece of data to be processed in the target scene through at least one hash algorithm to obtain at least one hash value corresponding to each piece of data to be processed includes:

performing hash calculation on the key of each piece of data to be processed in the target scene through at least one hash function operation to obtain at least one hash result of each piece of data to be processed;

performing modular operation on at least one hash result of each piece of data to be processed aiming at a preset hash bucket number, and taking the result of the modular operation as at least one hash value corresponding to each piece of data to be processed; and the preset Hash bucket number is used for indicating the size of the storage space occupied by the preset quantity list.

4. The data processing method according to claim 1, wherein the superimposed data corresponding to the hash algorithm includes superimposed data stored in a plurality of hash buckets, and one hash bucket represents one storage location in the preset data list; acquiring target superposition data which correspond to the hash algorithms and are matched with set indexes according to the sorting result, wherein the target superposition data comprises the following steps:

acquiring the superposed data stored in a set number of hash buckets corresponding to the hash algorithms and matched with set indexes according to the sorting result;

and summing the superposed data stored in the hash buckets with the set number to obtain target superposed data corresponding to each hash algorithm.

5. The data processing method according to claim 1, wherein generating non-target data corresponding to each hash algorithm according to other superimposed data except the target superimposed data in the superimposed data corresponding to each hash algorithm comprises:

generating a numerical expectation of non-target data according to other superimposed data except the target superimposed data in the superimposed data corresponding to the Hash algorithm and the data amount to be processed corresponding to the other superimposed data;

calculating the quantity expectation of non-target data according to the data quantity to be processed corresponding to the target superposition data;

and generating the non-target data corresponding to the hash algorithm according to the product of the numerical value expectation of the non-target data and the quantity expectation of the non-target data.

6. The data processing method according to claim 5, wherein generating a numerical expectation of non-target data according to other superimposed data except the target superimposed data in the superimposed data corresponding to the hash algorithm and the amount of data to be processed corresponding to the other superimposed data comprises:

summing other superposed data except the target superposed data in the superposed data corresponding to the Hash algorithm to obtain non-target superposed data corresponding to the Hash algorithm;

determining the duplication eliminating number of the data to be processed corresponding to the other superposed data according to the data amount to be processed corresponding to the other superposed data;

and obtaining the numerical expectation of the non-target data according to the ratio of the duplication removing numbers of the data to be processed corresponding to the non-target superposed data and the other superposed data.

7. The data processing method of claim 6, wherein the data to be processed comprises data in the form of key-value pairs; determining the duplication eliminating number of the data to be processed corresponding to the other superimposed data according to the data amount to be processed corresponding to the other superimposed data, including:

8. The data processing method of claim 5, wherein calculating the quantity expectation of the non-target data according to the data amount to be processed corresponding to the target superposition data comprises:

generating the quantity expectation of the target superimposed data according to a preset Hash bucket number, the data deduplication total number corresponding to the target scene and a preset fitting function;

and obtaining the quantity expectation of the non-target data according to the difference between the quantity expectation of the target superposed data and the value of the set index.

9. The data processing method according to claim 8, wherein before generating the desired amount of target superimposed data according to a preset hash bucket number, a total deduplication count of to-be-processed data corresponding to the target scene, and a preset fitting function, the method further comprises:

according to the fitting coefficient to be determined, constructing a fitting function related to the hash bucket number and the data deduplication total number to be processed;

training the fitting function through sample data to obtain a target numerical value of the fitting coefficient to be determined; the sample hash bucket number and the sample data deduplication total number corresponding to the target scene;

and generating the preset fitting function according to the target numerical value of the fitting coefficient to be determined.

10. The data processing method of claim 9, wherein training the fitting function with sample data to obtain a target value of the pending fit coefficient comprises:

11. The data processing method according to any one of claims 1 to 10, wherein calculating data matching the set index in the plurality of data to be processed according to the target superposition data and the non-target data to obtain a data processing result corresponding to the target scene includes:

obtaining data matched with the set index in the plurality of data to be processed corresponding to each hash algorithm according to the difference between the target superposition data corresponding to each hash algorithm and the non-target data;

and performing statistical processing on data matched with the set index in the plurality of data to be processed corresponding to each hash algorithm to obtain a data processing result corresponding to the target scene.

12. The data processing method according to claim 11, wherein performing statistical processing on data matching the set index in the plurality of data to be processed corresponding to the respective hash algorithms to obtain a data processing result corresponding to the target scene includes:

13. A data processing apparatus, comprising:

the data superposition module is used for superposing the data to be processed to data at positions corresponding to the hash algorithms and the hash values in a preset data list according to at least one hash value corresponding to the data to be processed to obtain superposed data corresponding to each hash algorithm;

the data calculation module is used for sorting the superposed data corresponding to each hash algorithm in the preset data list, acquiring target superposed data which correspond to each hash algorithm and are matched with a set index according to a sorting result, and generating non-target data corresponding to each hash algorithm according to other superposed data except the target superposed data in the superposed data corresponding to each hash algorithm;

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 12.

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein execution of the executable instructions by the processor causes the electronic device to perform the data processing method of any of claims 1 to 12.