CN111506607A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN111506607A
CN111506607A CN202010296755.6A CN202010296755A CN111506607A CN 111506607 A CN111506607 A CN 111506607A CN 202010296755 A CN202010296755 A CN 202010296755A CN 111506607 A CN111506607 A CN 111506607A
Authority
CN
China
Prior art keywords
data
target
step size
triple
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010296755.6A
Other languages
Chinese (zh)
Inventor
张毅然
耿正熙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202010296755.6A priority Critical patent/CN111506607A/en
Publication of CN111506607A publication Critical patent/CN111506607A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing method and a device, wherein the method comprises the following steps: acquiring the number of triple data in a target data set; determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data; determining a target step size for extracting the ternary group data each time according to the initial step size; extracting ternary group data from the target data set according to the initial step size and the target step size respectively, and storing the extracted ternary group data into a sampling cache; the target ternary group data in the sampling cache is distributed to a plurality of servers, and the servers process the data, so that the problem that in the related technology, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved, the intensity of processing the data by the server is reduced, and the data processing efficiency is improved.

Description

Data processing method and device
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a data processing method and apparatus.
Background
A Data Set (Data Set) is a collection of n Data elements (elements), each Element having a separate identification number (ID).
The Skewed Data Set (Skewed Data Set) differs from the Data Set in that its elements, which may be invalid, cannot be sampled.
Generally, data sets are divided into servers for processing, and data of a unified target object is acquired and allocated to one server for processing.
In the related art, representative data is extracted from a data set and distributed to server processing, so that the pressure of the server processing is reduced, but in the related art, the following sampling modes are adopted:
1, simple random sampling, namely taking a random number k, wherein the number is more than 0 and less than n of a data set; sampling an element with ID k; the first two steps are repeated until the sample buffer is filled.
And 2, sampling by the system according to the step length s-x/n, namely reading one element every s elements until the sampling buffer is full after n times.
Cluster sampling, using some means (e.g., a hash algorithm) to divide elements in the dataset into a number of clusters (the number of clusters equals to norm 38385of the hash value; and randomly selecting a group, sampling all elements in the group until the sampling buffer is filled.
Hierarchical sampling, which uniformly divides elements in the data set into a plurality of groups (for example, the remainder of dividing the element index ID by 10 is taken as a standard); simple random sampling or systematic sampling is performed on the elements in the group.
However, the above sampling method is only applicable to non-oblique data sets, and cannot satisfy oblique data sets. In particular, there is no way to meet the uniformity requirements.
Therefore, the representativeness of the data acquired by the extraction method is not very high, so that the processing of the extracted data by the server cannot represent the result of processing all the data in the data set.
In the prior art, when the data volume in the data set is large, and when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is large, and the processing intensity of the server is large, no solution is provided.
Disclosure of Invention
The embodiment of the invention provides a data processing method and device, which are used for at least solving the problem that in the related art, when the data volume in a data set is large, the data extracted from the data set by a server processing cannot represent the result of processing all data in the data set, so that the processing intensity of the server is high.
According to an embodiment of the present invention, there is provided a data processing method including:
acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;
determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data;
determining a target step size for extracting ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.
Optionally, the allocating the target triple data in the sample buffer to a plurality of servers comprises:
distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers;
distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.
Optionally, the method further comprises:
when the extracted ternary group data is stored in the sampling cache, counting the number of each key K in the sampling cache;
and discarding the triple data corresponding to the target keys K with the number larger than the preset number.
Optionally, determining an initial step size for extracting triple data from the target data set according to the number of triple data includes:
obtaining a target element corresponding to an index of 2 at maximum from elements smaller than the number of the triple data by:
Figure BDA0002452473970000031
x0n is the number of the triple data for the target element;
determining the target element as the initial step size.
Optionally, determining a target step size for extracting the triple data each time according to the initial step size includes:
determining an i-1 th target step size for extracting the ternary group of data each time according to the initial step size by the following method:
Figure BDA0002452473970000032
xi-1is the target step size of the ith decimation, i is an integer greater than 1,
Figure BDA0002452473970000033
optionally, extracting triple data from the target data set by the initial stride and the target stride respectively, and storing the extracted triple data in a sample buffer until the remaining space of the sample buffer is 0 or the target stride is 1 includes:
extracting ternary group data for the first time from the target data set by the initial step size, and storing the extracted ternary group data into the sampling cache;
extracting the triple data except the element which can be completely divided by the initial step from the target data set for the second time by a 1 st target step, and storing the extracted triple data into a sample cache;
repeatedly extracting triple data except the element which can be completely divided by the (i-1) th target step from the target data set by the (i) th target step, and storing the extracted triple data into the sampling cache until the (i) th target step is 1 or the residual space of the sampling cache is 0;
i=i+1。
according to another embodiment of the present invention, there is also provided a data processing apparatus including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the number of triple data in a target data set, and the triple data comprises elements, keys K and data;
the first determination module is used for determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
a second determining module, configured to determine, according to the initial step size, a target step size for extracting triple-group data each time, where the target step size is not equal to the initial step size;
the extracting module is used for respectively extracting triple data from the target data set by the initial step size and the target step size and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and the distribution module is used for distributing the target ternary group data in the sampling cache to a plurality of servers and processing the data through the plurality of servers.
Optionally, the allocation module comprises:
a first distribution module, configured to distribute data corresponding to the same key K in the target triple data to a same server in the multiple servers;
and the second distribution module is used for distributing the data corresponding to the different keys K in the target triple data to different servers in the plurality of servers.
Optionally, the apparatus further comprises:
the counting module is used for counting the number of each key K in the sampling cache while storing the extracted ternary group data in the sampling cache;
and the discarding module is used for discarding the triple data corresponding to the target keys K of which the number is greater than the preset number.
Optionally, the first determining module includes:
an obtaining submodule, configured to obtain a target element corresponding to an index of 2 at maximum from among the elements smaller than the number of the triple data by:
Figure BDA0002452473970000051
x0n is the number of the triple data for the target element;
a determining submodule, configured to determine the target element as the initial step size.
Optionally, the second determining module is further configured to
Determining the (i-1) th target step size of each extraction of the ternary group data according to the initial step size by the following method:
Figure BDA0002452473970000052
xi-1is the target step size of the ith decimation, i is an integer greater than 1,
Figure BDA0002452473970000053
optionally, the extraction module comprises:
the first extraction submodule is used for extracting the triple-element data for the first time from the target data set by the initial step size and storing the extracted triple-element data into the sampling cache;
a second extraction submodule, configured to extract, at a 1 st target stride, triple data from the target data set, except for the element that can be exactly divided by the initial stride, and store the extracted triple data in a sample cache;
a repeating submodule, configured to repeatedly extract triple data from the target data set at an ith target stride, except that the element can be completely divided by an (i-1) th target stride, and store the extracted triple data in the sample cache until the ith target stride is 1 or a remaining space of the sample cache is 0;
i=i+1。
according to a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the method and the device, the number of the triple data in the target data set is obtained, wherein the triple data comprises elements, keys K and data; determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data; determining a target step length for extracting the ternary group data each time according to the initial step length; extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1; the target ternary group data in the sampling cache is distributed to a plurality of servers, the plurality of servers are used for processing the data, the problem that in the related technology, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved, the representativeness of the extracted target data is improved by the extraction mode with different step lengths, the data extracted from the data set by the server processing can represent the result of processing all the data in the data set, the data processing intensity of the server is reduced, and the data processing efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a data processing method according to an embodiment of the invention;
FIG. 3 is a schematic diagram of variable length sampling according to an embodiment of the present invention;
fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Example 1
The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a hardware structure block diagram of a mobile terminal of a data processing method according to an embodiment of the present invention, as shown in fig. 1, a mobile terminal 10 may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the message receiving method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
Based on the above mobile terminal or network architecture, this embodiment provides a data processing method, and fig. 2 is a flowchart of the data processing method according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;
step S204, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
further, the step S204 may specifically include:
obtaining a target element corresponding to an index of 2 at maximum from elements smaller than the number of the triple data by:
Figure BDA0002452473970000081
x0n is the number of the triple data for the target element;
determining the target element as the initial step size.
Further, the step S206 may specifically include:
determining an i-1 th target step size for extracting the ternary group of data each time according to the initial step size by the following method:
Figure BDA0002452473970000082
xi-1is the target step size of the ith decimation, i is an integer greater than 1,
Figure BDA0002452473970000083
step S208, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sample cache until the residual space of the sample cache is 0 or the target step size is 1;
further, the step S208 may specifically include:
extracting ternary group data for the first time from the target data set by the initial step size, and storing the extracted ternary group data into the sampling cache;
extracting the triple data except the element which can be completely divided by the initial step from the target data set for the second time by a 1 st target step, and storing the extracted triple data into a sample cache;
repeatedly extracting triple data except the element which can be completely divided by the (i-1) th target step from the target data set by the (i) th target step, and storing the extracted triple data into the sampling cache until the (i) th target step is 1 or the residual space of the sampling cache is 0;
i=i+1。
and step S210, distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.
In an embodiment of the present invention, the step S210 may specifically include: distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers; distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.
Through the steps S202 to S210, the problem that in the related art, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved.
In an optional embodiment, the number of each key K in the sampling buffer is counted while the extracted triple data is stored in the sampling buffer; and discarding the triple data corresponding to the target keys K with the number larger than the preset number, and further improving the representativeness of the extracted data by filtering the data in the sampling cache.
The following examples illustrate the present invention.
FIG. 3 is a schematic diagram of variable length sampling according to an embodiment of the present invention, as shown in FIG. 1, assuming that a given data set includes n elements: n-l; let x be equal to pot (2, (int) log2(n)), in other words, let x be equal to "the index of maximum 2 less than n"; scanning and reading the data set with x as the step size, the result being that the 1 st element is read; the data set is scanned and read in steps x/2, but the element that can be divided exactly by l is ignored, resulting in reading element # (l x (x/2)); the data set is scanned and read in steps of x/4, but the elements that can be divided exactly by l/2 are ignored, resulting in the reading of element # (x (1/4)), and # (x (3/4)) … …; the data set is scanned and read in steps of l/8. But ignore elements that are divisible by l/4. The result was a reading of element # (x (1/8)), # (x (3/8)), # (x (5/8)), # (x (7/8)) … …; the loop is repeated until the sample capture buffer is filled, or the step size is l. When the step size is 1, it is clear that the entire data set will eventually be read.
If n 21793, the dataset contains 21793 elements, namely element #0,1, 2.. 21292; initializing x ═ pot (2, (int) log2(21793)) ═ pot (2,14) ═ 16384; read data set at step x, element # 16384; in x/2-8192 steps, the dataset is scanned and read, but the element that can be divisionally by 16384 is ignored, resulting in element #8192 being read; the data set was scanned and read in steps x/4 ═ 4096, but the elements that could be evenly divided by 8192 were ignored, resulting in the reading of elements #4096(4096 x), #12288(4096 x 3), #20480(4096 x 5); the data set was scanned and read in steps x/8 ═ 2048, but the elements that could be divided exactly by 4096 were ignored, resulting in reading elements #2048(2048 × x), #6144(2048 × 3), #10240(2048 × 5) #14336(2048 × 7) #18432(2048 × 9); …, respectively; if the sample buffer has not been read full (i.e., there are very few active elements in the data set), a step size of x is eventually reached and the entire data set is scanned in steps of x, but the elements that are divisible by 2 are ignored (i.e., all previously read elements are ignored).
The variable length sampling method in the embodiment of the invention has the efficiency of 0(x) for the best case (full data set); for the worst case (empty dataset) the efficiency is 0 (n). At any time the algorithm exits, the data in the sample buffer is uniform. If the data set is empty (or there are few active elements), the algorithm will eventually scan the entire data set, ensuring that no active elements are missed. By using different step sizes, the algorithm ensures that no elements are repeated without paying a performance penalty. Therefore, the representativeness of the data extracted from the data by the long and short sampling method is high, and the server can represent the result of processing all the data in the data set by processing the representative data, thereby reducing the pressure of processing the data by the server and improving the data processing efficiency.
Example 2
According to another embodiment of the present invention, there is also provided a data processing apparatus, and fig. 4 is a block diagram of the data processing apparatus according to the embodiment of the present invention, as shown in fig. 4, including:
an obtaining module 42, configured to obtain the number of triple data in a target dataset, where the triple data includes an element, a key K, and data;
a first determining module 44, configured to determine an initial step size for extracting triple-component data from the target data set according to the number of triple-component data;
a second determining module 46, configured to determine a target step size for extracting triple-group data each time according to the initial step size, where the target step size is not equal to the initial step size;
an extracting module 48, configured to extract triple data from the target data set according to the initial step size and the target step size, and store the extracted triple data in a sample buffer until a remaining space of the sample buffer is 0 or the target step size is 1;
and an allocating module 410, configured to allocate the target triple-tuple data in the sample buffer to multiple servers, where the multiple servers process the data.
Optionally, the allocating module 410 includes:
a first distribution module, configured to distribute data corresponding to the same key K in the target triple data to a same server in the multiple servers;
and the second distribution module is used for distributing the data corresponding to the different keys K in the target triple data to different servers in the plurality of servers.
Optionally, the apparatus further comprises:
the counting module is used for counting the number of each key K in the sampling cache while storing the extracted ternary group data in the sampling cache;
and the discarding module is used for discarding the triple data corresponding to the target keys K of which the number is greater than the preset number.
Optionally, the first determining module 44 includes:
an obtaining submodule, configured to obtain a target element corresponding to an index of 2 at maximum from among the elements smaller than the number of the triple data by:
Figure BDA0002452473970000121
x0n is the number of the triple data for the target element;
a determining submodule, configured to determine the target element as the initial step size.
Optionally, the second determining module 46 is further configured to
Determining the (i-1) th target step size of each extraction of the ternary group data according to the initial step size by the following method:
Figure BDA0002452473970000122
xi-1is the target step size of the ith decimation, i is an integer greater than 1,
Figure BDA0002452473970000123
optionally, the extraction module 48 includes:
the first extraction submodule is used for extracting the triple-element data for the first time from the target data set by the initial step size and storing the extracted triple-element data into the sampling cache;
a second extraction submodule, configured to extract, at a 1 st target stride, triple data from the target data set, except for the element that can be exactly divided by the initial stride, and store the extracted triple data in a sample cache;
a repeating submodule, configured to repeatedly extract triple data from the target data set at an ith target stride, except that the element can be completely divided by an (i-1) th target stride, and store the extracted triple data in the sample cache until the ith target stride is 1 or a remaining space of the sample cache is 0;
i=i+1。
it should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 3
Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring the number of triple data in the target dataset, wherein the triple data comprises elements, keys K and data;
s2, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
s3, determining a target step size for extracting the ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
s4, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling buffer until the residual space of the sampling buffer is 0 or the target step size is 1;
and S5, distributing the target ternary group data in the sampling buffer to a plurality of servers, and processing the data through the plurality of servers.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Example 4
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring the number of triple data in the target dataset, wherein the triple data comprises elements, keys K and data;
s2, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
s3, determining a target step size for extracting the ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
s4, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling buffer until the residual space of the sampling buffer is 0 or the target step size is 1;
and S5, distributing the target ternary group data in the sampling buffer to a plurality of servers, and processing the data through the plurality of servers.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A data processing method, comprising:
acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;
determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data;
determining a target step size for extracting ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.
2. The method of claim 1, wherein distributing the target triple packet data in the sample buffer to a plurality of servers comprises:
distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers;
distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.
3. The method of claim 1, further comprising:
when the extracted ternary group data is stored in the sampling cache, counting the number of each key K in the sampling cache;
and discarding the triple data corresponding to the target keys K with the number larger than the preset number.
4. The method of any one of claims 1 to 3, wherein determining an initial step size for extracting triple data from the target dataset according to the number of triple data comprises:
obtaining a target element corresponding to an index of 2 at maximum from elements smaller than the number of the triple data by:
Figure FDA0002452473960000021
x0n is the number of the triple data for the target element;
determining the target element as the initial step size.
5. The method of claim 4, wherein determining the target step size for each decimation of triple data according to the initial step size comprises:
determining an i-1 th target step size for extracting the ternary group of data each time according to the initial step size by the following method:
Figure FDA0002452473960000022
xi-1is the target step size of the ith decimation, i is an integer greater than 1,
Figure FDA0002452473960000023
6. the method of claim 4, wherein extracting triple data from the target data set in the initial stride and the target stride, respectively, and storing the extracted triple data in a sample buffer until a remaining space of the sample buffer is 0 or the target stride is 1 comprises:
extracting ternary group data for the first time from the target data set by the initial step size, and storing the extracted ternary group data into the sampling cache;
extracting the triple data except the element which can be completely divided by the initial step from the target data set for the second time by a 1 st target step, and storing the extracted triple data into a sample cache;
repeatedly extracting triple data except the element which can be completely divided by the (i-1) th target step from the target data set by the (i) th target step, and storing the extracted triple data into the sampling cache until the (i) th target step is 1 or the residual space of the sampling cache is 0;
i=i+1。
7. a data processing apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the number of triple data in a target data set, and the triple data comprises elements, keys K and data;
the first determination module is used for determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
a second determining module, configured to determine, according to the initial step size, a target step size for extracting triple-group data each time, where the target step size is not equal to the initial step size;
the extracting module is used for respectively extracting triple data from the target data set by the initial step size and the target step size and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and the distribution module is used for distributing the target ternary group data in the sampling cache to a plurality of servers and processing the data through the plurality of servers.
8. The apparatus of claim 7, wherein the assignment module comprises:
a first distribution module, configured to distribute data corresponding to the same key K in the target triple data to a same server in the multiple servers;
and the second distribution module is used for distributing the data corresponding to the different keys K in the target triple data to different servers in the plurality of servers.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.
CN202010296755.6A 2020-04-15 2020-04-15 Data processing method and device Pending CN111506607A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010296755.6A CN111506607A (en) 2020-04-15 2020-04-15 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010296755.6A CN111506607A (en) 2020-04-15 2020-04-15 Data processing method and device

Publications (1)

Publication Number Publication Date
CN111506607A true CN111506607A (en) 2020-08-07

Family

ID=71877554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010296755.6A Pending CN111506607A (en) 2020-04-15 2020-04-15 Data processing method and device

Country Status (1)

Country Link
CN (1) CN111506607A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013050494A1 (en) * 2011-10-05 2013-04-11 St-Ericsson Sa Simd memory circuit and methodology to support upsampling, downsampling and transposition
CN108989383A (en) * 2018-05-31 2018-12-11 阿里巴巴集团控股有限公司 Data processing method and client
US20190236479A1 (en) * 2018-01-31 2019-08-01 The Johns Hopkins University Method and apparatus for providing efficient testing of systems by using artificial intelligence tools
CN110730000A (en) * 2018-07-17 2020-01-24 珠海格力电器股份有限公司 Method and device for extracting key data from sampling data
US10599621B1 (en) * 2015-02-02 2020-03-24 Amazon Technologies, Inc. Distributed processing framework file system fast on-demand storage listing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013050494A1 (en) * 2011-10-05 2013-04-11 St-Ericsson Sa Simd memory circuit and methodology to support upsampling, downsampling and transposition
US10599621B1 (en) * 2015-02-02 2020-03-24 Amazon Technologies, Inc. Distributed processing framework file system fast on-demand storage listing
US20190236479A1 (en) * 2018-01-31 2019-08-01 The Johns Hopkins University Method and apparatus for providing efficient testing of systems by using artificial intelligence tools
CN108989383A (en) * 2018-05-31 2018-12-11 阿里巴巴集团控股有限公司 Data processing method and client
CN110730000A (en) * 2018-07-17 2020-01-24 珠海格力电器股份有限公司 Method and device for extracting key data from sampling data

Similar Documents

Publication Publication Date Title
CN107667503B (en) Resource management techniques for heterogeneous resource clouds
CN111683144B (en) Method and device for processing access request, computer equipment and storage medium
CN104424331A (en) Data sampling method and device
CN109597804B (en) Customer merging method and device based on big data, electronic equipment and storage medium
CN115039091A (en) Multi-key-value command processing method and device, electronic equipment and storage medium
CN110569129A (en) Resource allocation method and device, storage medium and electronic device
CN111506607A (en) Data processing method and device
CN111833276A (en) Image median filtering processing method and device
CN112332854A (en) Hardware implementation method and device of Huffman coding and storage medium
CN111324621A (en) Event processing method, device, equipment and storage medium
US20140337572A1 (en) Noncontiguous representation of an array
CN105634999A (en) Aging method and device for medium access control address
CN106156169B (en) Discrete data processing method and device
CN113590322A (en) Data processing method and device
CN110046040B (en) Distributed task processing method and system and storage medium
CN112788768A (en) Communication resource allocation method and device
CN112835932A (en) Batch processing method and device of service table and nonvolatile storage medium
CN111340114A (en) Image matching method and device, storage medium and electronic device
CN111143161A (en) Log file processing method and device, storage medium and electronic equipment
CN110751204A (en) Data fusion method and device, storage medium and electronic device
CN112711588A (en) Multi-table connection method and device
CN110276212B (en) Data processing method and device, storage medium and electronic device
CN110874308A (en) Method and device for generating unique value
CN118193604A (en) Blockchain transaction statistics method, device, electronic equipment and storage medium
CN107729058A (en) A kind of method of automatic parsing VAT invoice recognition result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20231027