CN111506607A

CN111506607A - Data processing method and device

Info

Publication number: CN111506607A
Application number: CN202010296755.6A
Authority: CN
Inventors: 张毅然; 耿正熙
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2020-08-07

Abstract

The invention provides a data processing method and a device, wherein the method comprises the following steps: acquiring the number of triple data in a target data set; determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data; determining a target step size for extracting the ternary group data each time according to the initial step size; extracting ternary group data from the target data set according to the initial step size and the target step size respectively, and storing the extracted ternary group data into a sampling cache; the target ternary group data in the sampling cache is distributed to a plurality of servers, and the servers process the data, so that the problem that in the related technology, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved, the intensity of processing the data by the server is reduced, and the data processing efficiency is improved.

Description

Data processing method and device

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a data processing method and apparatus.

Background

A Data Set (Data Set) is a collection of n Data elements (elements), each Element having a separate identification number (ID).

The Skewed Data Set (Skewed Data Set) differs from the Data Set in that its elements, which may be invalid, cannot be sampled.

Generally, data sets are divided into servers for processing, and data of a unified target object is acquired and allocated to one server for processing.

In the related art, representative data is extracted from a data set and distributed to server processing, so that the pressure of the server processing is reduced, but in the related art, the following sampling modes are adopted:

1, simple random sampling, namely taking a random number k, wherein the number is more than 0 and less than n of a data set; sampling an element with ID k; the first two steps are repeated until the sample buffer is filled.

And 2, sampling by the system according to the step length s-x/n, namely reading one element every s elements until the sampling buffer is full after n times.

Cluster sampling, using some means (e.g., a hash algorithm) to divide elements in the dataset into a number of clusters (the number of clusters equals to norm 38385of the hash value; and randomly selecting a group, sampling all elements in the group until the sampling buffer is filled.

Hierarchical sampling, which uniformly divides elements in the data set into a plurality of groups (for example, the remainder of dividing the element index ID by 10 is taken as a standard); simple random sampling or systematic sampling is performed on the elements in the group.

However, the above sampling method is only applicable to non-oblique data sets, and cannot satisfy oblique data sets. In particular, there is no way to meet the uniformity requirements.

Therefore, the representativeness of the data acquired by the extraction method is not very high, so that the processing of the extracted data by the server cannot represent the result of processing all the data in the data set.

In the prior art, when the data volume in the data set is large, and when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is large, and the processing intensity of the server is large, no solution is provided.

Disclosure of Invention

The embodiment of the invention provides a data processing method and device, which are used for at least solving the problem that in the related art, when the data volume in a data set is large, the data extracted from the data set by a server processing cannot represent the result of processing all data in the data set, so that the processing intensity of the server is high.

According to an embodiment of the present invention, there is provided a data processing method including:

acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;

determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data;

determining a target step size for extracting ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;

extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;

and distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.

Optionally, the allocating the target triple data in the sample buffer to a plurality of servers comprises:

distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers;

distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.

Optionally, the method further comprises:

when the extracted ternary group data is stored in the sampling cache, counting the number of each key K in the sampling cache;

and discarding the triple data corresponding to the target keys K with the number larger than the preset number.

Optionally, determining an initial step size for extracting triple data from the target data set according to the number of triple data includes:

obtaining a target element corresponding to an index of 2 at maximum from elements smaller than the number of the triple data by:

x₀n is the number of the triple data for the target element;

determining the target element as the initial step size.

Optionally, determining a target step size for extracting the triple data each time according to the initial step size includes:

determining an i-1 th target step size for extracting the ternary group of data each time according to the initial step size by the following method:

x_i-1is the target step size of the ith decimation, i is an integer greater than 1,

optionally, extracting triple data from the target data set by the initial stride and the target stride respectively, and storing the extracted triple data in a sample buffer until the remaining space of the sample buffer is 0 or the target stride is 1 includes:

extracting ternary group data for the first time from the target data set by the initial step size, and storing the extracted ternary group data into the sampling cache;

extracting the triple data except the element which can be completely divided by the initial step from the target data set for the second time by a 1 st target step, and storing the extracted triple data into a sample cache;

repeatedly extracting triple data except the element which can be completely divided by the (i-1) th target step from the target data set by the (i) th target step, and storing the extracted triple data into the sampling cache until the (i) th target step is 1 or the residual space of the sampling cache is 0;

i＝i+1。

according to another embodiment of the present invention, there is also provided a data processing apparatus including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the number of triple data in a target data set, and the triple data comprises elements, keys K and data;

the first determination module is used for determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;

a second determining module, configured to determine, according to the initial step size, a target step size for extracting triple-group data each time, where the target step size is not equal to the initial step size;

the extracting module is used for respectively extracting triple data from the target data set by the initial step size and the target step size and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;

and the distribution module is used for distributing the target ternary group data in the sampling cache to a plurality of servers and processing the data through the plurality of servers.

Optionally, the allocation module comprises:

a first distribution module, configured to distribute data corresponding to the same key K in the target triple data to a same server in the multiple servers;

and the second distribution module is used for distributing the data corresponding to the different keys K in the target triple data to different servers in the plurality of servers.

Optionally, the apparatus further comprises:

the counting module is used for counting the number of each key K in the sampling cache while storing the extracted ternary group data in the sampling cache;

and the discarding module is used for discarding the triple data corresponding to the target keys K of which the number is greater than the preset number.

Optionally, the first determining module includes:

an obtaining submodule, configured to obtain a target element corresponding to an index of 2 at maximum from among the elements smaller than the number of the triple data by:

x₀n is the number of the triple data for the target element;

a determining submodule, configured to determine the target element as the initial step size.

Optionally, the second determining module is further configured to

Determining the (i-1) th target step size of each extraction of the ternary group data according to the initial step size by the following method:

optionally, the extraction module comprises:

the first extraction submodule is used for extracting the triple-element data for the first time from the target data set by the initial step size and storing the extracted triple-element data into the sampling cache;

a second extraction submodule, configured to extract, at a 1 st target stride, triple data from the target data set, except for the element that can be exactly divided by the initial stride, and store the extracted triple data in a sample cache;

a repeating submodule, configured to repeatedly extract triple data from the target data set at an ith target stride, except that the element can be completely divided by an (i-1) th target stride, and store the extracted triple data in the sample cache until the ith target stride is 1 or a remaining space of the sample cache is 0;

i＝i+1。

according to a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method and the device, the number of the triple data in the target data set is obtained, wherein the triple data comprises elements, keys K and data; determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data; determining a target step length for extracting the ternary group data each time according to the initial step length; extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1; the target ternary group data in the sampling cache is distributed to a plurality of servers, the plurality of servers are used for processing the data, the problem that in the related technology, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved, the representativeness of the extracted target data is improved by the extraction mode with different step lengths, the data extracted from the data set by the server processing can represent the result of processing all the data in the data set, the data processing intensity of the server is reduced, and the data processing efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a data processing method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a data processing method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of variable length sampling according to an embodiment of the present invention;

fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a hardware structure block diagram of a mobile terminal of a data processing method according to an embodiment of the present invention, as shown in fig. 1, a mobile terminal 10 may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the message receiving method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Based on the above mobile terminal or network architecture, this embodiment provides a data processing method, and fig. 2 is a flowchart of the data processing method according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;

step S204, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;

further, the step S204 may specifically include:

x₀n is the number of the triple data for the target element;

determining the target element as the initial step size.

Further, the step S206 may specifically include:

step S208, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sample cache until the residual space of the sample cache is 0 or the target step size is 1;

further, the step S208 may specifically include:

i＝i+1。

and step S210, distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.

In an embodiment of the present invention, the step S210 may specifically include: distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers; distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.

Through the steps S202 to S210, the problem that in the related art, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved.

In an optional embodiment, the number of each key K in the sampling buffer is counted while the extracted triple data is stored in the sampling buffer; and discarding the triple data corresponding to the target keys K with the number larger than the preset number, and further improving the representativeness of the extracted data by filtering the data in the sampling cache.

The following examples illustrate the present invention.

FIG. 3 is a schematic diagram of variable length sampling according to an embodiment of the present invention, as shown in FIG. 1, assuming that a given data set includes n elements: n-l; let x be equal to pot (2, (int) log2(n)), in other words, let x be equal to "the index of maximum 2 less than n"; scanning and reading the data set with x as the step size, the result being that the 1 st element is read; the data set is scanned and read in steps x/2, but the element that can be divided exactly by l is ignored, resulting in reading element # (l x (x/2)); the data set is scanned and read in steps of x/4, but the elements that can be divided exactly by l/2 are ignored, resulting in the reading of element # (x (1/4)), and # (x (3/4)) … …; the data set is scanned and read in steps of l/8. But ignore elements that are divisible by l/4. The result was a reading of element # (x (1/8)), # (x (3/8)), # (x (5/8)), # (x (7/8)) … …; the loop is repeated until the sample capture buffer is filled, or the step size is l. When the step size is 1, it is clear that the entire data set will eventually be read.

If n 21793, the dataset contains 21793 elements, namely element #0,1, 2.. 21292; initializing x ═ pot (2, (int) log2(21793)) ═ pot (2,14) ═ 16384; read data set at step x, element # 16384; in x/2-8192 steps, the dataset is scanned and read, but the element that can be divisionally by 16384 is ignored, resulting in element #8192 being read; the data set was scanned and read in steps x/4 ═ 4096, but the elements that could be evenly divided by 8192 were ignored, resulting in the reading of elements #4096(4096 x), #12288(4096 x 3), #20480(4096 x 5); the data set was scanned and read in steps x/8 ═ 2048, but the elements that could be divided exactly by 4096 were ignored, resulting in reading elements #2048(2048 × x), #6144(2048 × 3), #10240(2048 × 5) #14336(2048 × 7) #18432(2048 × 9); …, respectively; if the sample buffer has not been read full (i.e., there are very few active elements in the data set), a step size of x is eventually reached and the entire data set is scanned in steps of x, but the elements that are divisible by 2 are ignored (i.e., all previously read elements are ignored).

The variable length sampling method in the embodiment of the invention has the efficiency of 0(x) for the best case (full data set); for the worst case (empty dataset) the efficiency is 0 (n). At any time the algorithm exits, the data in the sample buffer is uniform. If the data set is empty (or there are few active elements), the algorithm will eventually scan the entire data set, ensuring that no active elements are missed. By using different step sizes, the algorithm ensures that no elements are repeated without paying a performance penalty. Therefore, the representativeness of the data extracted from the data by the long and short sampling method is high, and the server can represent the result of processing all the data in the data set by processing the representative data, thereby reducing the pressure of processing the data by the server and improving the data processing efficiency.

Example 2

According to another embodiment of the present invention, there is also provided a data processing apparatus, and fig. 4 is a block diagram of the data processing apparatus according to the embodiment of the present invention, as shown in fig. 4, including:

an obtaining module 42, configured to obtain the number of triple data in a target dataset, where the triple data includes an element, a key K, and data;

a first determining module 44, configured to determine an initial step size for extracting triple-component data from the target data set according to the number of triple-component data;

a second determining module 46, configured to determine a target step size for extracting triple-group data each time according to the initial step size, where the target step size is not equal to the initial step size;

an extracting module 48, configured to extract triple data from the target data set according to the initial step size and the target step size, and store the extracted triple data in a sample buffer until a remaining space of the sample buffer is 0 or the target step size is 1;

and an allocating module 410, configured to allocate the target triple-tuple data in the sample buffer to multiple servers, where the multiple servers process the data.

Optionally, the allocating module 410 includes:

Optionally, the apparatus further comprises:

Optionally, the first determining module 44 includes:

x₀n is the number of the triple data for the target element;

Optionally, the second determining module 46 is further configured to

optionally, the extraction module 48 includes:

i＝i+1。

it should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring the number of triple data in the target dataset, wherein the triple data comprises elements, keys K and data;

s2, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;

s3, determining a target step size for extracting the ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;

s4, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling buffer until the residual space of the sampling buffer is 0 or the target step size is 1;

and S5, distributing the target ternary group data in the sampling buffer to a plurality of servers, and processing the data through the plurality of servers.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Example 4

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data processing method, comprising:

2. The method of claim 1, wherein distributing the target triple packet data in the sample buffer to a plurality of servers comprises:

3. The method of claim 1, further comprising:

4. The method of any one of claims 1 to 3, wherein determining an initial step size for extracting triple data from the target dataset according to the number of triple data comprises:

x₀n is the number of the triple data for the target element;

determining the target element as the initial step size.

5. The method of claim 4, wherein determining the target step size for each decimation of triple data according to the initial step size comprises:

6. the method of claim 4, wherein extracting triple data from the target data set in the initial stride and the target stride, respectively, and storing the extracted triple data in a sample buffer until a remaining space of the sample buffer is 0 or the target stride is 1 comprises:

i＝i+1。

7. a data processing apparatus, comprising:

8. The apparatus of claim 7, wherein the assignment module comprises:

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.