CN111506607A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN111506607A CN111506607A CN202010296755.6A CN202010296755A CN111506607A CN 111506607 A CN111506607 A CN 111506607A CN 202010296755 A CN202010296755 A CN 202010296755A CN 111506607 A CN111506607 A CN 111506607A
- Authority
- CN
- China
- Prior art keywords
- data
- target
- step size
- triple
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 12
- 238000005070 sampling Methods 0.000 claims abstract description 62
- 238000000034 method Methods 0.000 claims abstract description 27
- 238000004590 computer program Methods 0.000 claims description 18
- 239000000523 sample Substances 0.000 claims description 12
- 239000012723 sample buffer Substances 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data processing method and a device, wherein the method comprises the following steps: acquiring the number of triple data in a target data set; determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data; determining a target step size for extracting the ternary group data each time according to the initial step size; extracting ternary group data from the target data set according to the initial step size and the target step size respectively, and storing the extracted ternary group data into a sampling cache; the target ternary group data in the sampling cache is distributed to a plurality of servers, and the servers process the data, so that the problem that in the related technology, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved, the intensity of processing the data by the server is reduced, and the data processing efficiency is improved.
Description
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a data processing method and apparatus.
Background
A Data Set (Data Set) is a collection of n Data elements (elements), each Element having a separate identification number (ID).
The Skewed Data Set (Skewed Data Set) differs from the Data Set in that its elements, which may be invalid, cannot be sampled.
Generally, data sets are divided into servers for processing, and data of a unified target object is acquired and allocated to one server for processing.
In the related art, representative data is extracted from a data set and distributed to server processing, so that the pressure of the server processing is reduced, but in the related art, the following sampling modes are adopted:
1, simple random sampling, namely taking a random number k, wherein the number is more than 0 and less than n of a data set; sampling an element with ID k; the first two steps are repeated until the sample buffer is filled.
And 2, sampling by the system according to the step length s-x/n, namely reading one element every s elements until the sampling buffer is full after n times.
Cluster sampling, using some means (e.g., a hash algorithm) to divide elements in the dataset into a number of clusters (the number of clusters equals to norm 38385of the hash value; and randomly selecting a group, sampling all elements in the group until the sampling buffer is filled.
Hierarchical sampling, which uniformly divides elements in the data set into a plurality of groups (for example, the remainder of dividing the element index ID by 10 is taken as a standard); simple random sampling or systematic sampling is performed on the elements in the group.
However, the above sampling method is only applicable to non-oblique data sets, and cannot satisfy oblique data sets. In particular, there is no way to meet the uniformity requirements.
Therefore, the representativeness of the data acquired by the extraction method is not very high, so that the processing of the extracted data by the server cannot represent the result of processing all the data in the data set.
In the prior art, when the data volume in the data set is large, and when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is large, and the processing intensity of the server is large, no solution is provided.
Disclosure of Invention
The embodiment of the invention provides a data processing method and device, which are used for at least solving the problem that in the related art, when the data volume in a data set is large, the data extracted from the data set by a server processing cannot represent the result of processing all data in the data set, so that the processing intensity of the server is high.
According to an embodiment of the present invention, there is provided a data processing method including:
acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;
determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data;
determining a target step size for extracting ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.
Optionally, the allocating the target triple data in the sample buffer to a plurality of servers comprises:
distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers;
distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.
Optionally, the method further comprises:
when the extracted ternary group data is stored in the sampling cache, counting the number of each key K in the sampling cache;
and discarding the triple data corresponding to the target keys K with the number larger than the preset number.
Optionally, determining an initial step size for extracting triple data from the target data set according to the number of triple data includes:
obtaining a target element corresponding to an index of 2 at maximum from elements smaller than the number of the triple data by:
determining the target element as the initial step size.
Optionally, determining a target step size for extracting the triple data each time according to the initial step size includes:
determining an i-1 th target step size for extracting the ternary group of data each time according to the initial step size by the following method:
optionally, extracting triple data from the target data set by the initial stride and the target stride respectively, and storing the extracted triple data in a sample buffer until the remaining space of the sample buffer is 0 or the target stride is 1 includes:
extracting ternary group data for the first time from the target data set by the initial step size, and storing the extracted ternary group data into the sampling cache;
extracting the triple data except the element which can be completely divided by the initial step from the target data set for the second time by a 1 st target step, and storing the extracted triple data into a sample cache;
repeatedly extracting triple data except the element which can be completely divided by the (i-1) th target step from the target data set by the (i) th target step, and storing the extracted triple data into the sampling cache until the (i) th target step is 1 or the residual space of the sampling cache is 0;
i=i+1。
according to another embodiment of the present invention, there is also provided a data processing apparatus including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the number of triple data in a target data set, and the triple data comprises elements, keys K and data;
the first determination module is used for determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
a second determining module, configured to determine, according to the initial step size, a target step size for extracting triple-group data each time, where the target step size is not equal to the initial step size;
the extracting module is used for respectively extracting triple data from the target data set by the initial step size and the target step size and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and the distribution module is used for distributing the target ternary group data in the sampling cache to a plurality of servers and processing the data through the plurality of servers.
Optionally, the allocation module comprises:
a first distribution module, configured to distribute data corresponding to the same key K in the target triple data to a same server in the multiple servers;
and the second distribution module is used for distributing the data corresponding to the different keys K in the target triple data to different servers in the plurality of servers.
Optionally, the apparatus further comprises:
the counting module is used for counting the number of each key K in the sampling cache while storing the extracted ternary group data in the sampling cache;
and the discarding module is used for discarding the triple data corresponding to the target keys K of which the number is greater than the preset number.
Optionally, the first determining module includes:
an obtaining submodule, configured to obtain a target element corresponding to an index of 2 at maximum from among the elements smaller than the number of the triple data by:
a determining submodule, configured to determine the target element as the initial step size.
Optionally, the second determining module is further configured to
Determining the (i-1) th target step size of each extraction of the ternary group data according to the initial step size by the following method:
optionally, the extraction module comprises:
the first extraction submodule is used for extracting the triple-element data for the first time from the target data set by the initial step size and storing the extracted triple-element data into the sampling cache;
a second extraction submodule, configured to extract, at a 1 st target stride, triple data from the target data set, except for the element that can be exactly divided by the initial stride, and store the extracted triple data in a sample cache;
a repeating submodule, configured to repeatedly extract triple data from the target data set at an ith target stride, except that the element can be completely divided by an (i-1) th target stride, and store the extracted triple data in the sample cache until the ith target stride is 1 or a remaining space of the sample cache is 0;
i=i+1。
according to a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the method and the device, the number of the triple data in the target data set is obtained, wherein the triple data comprises elements, keys K and data; determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data; determining a target step length for extracting the ternary group data each time according to the initial step length; extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1; the target ternary group data in the sampling cache is distributed to a plurality of servers, the plurality of servers are used for processing the data, the problem that in the related technology, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved, the representativeness of the extracted target data is improved by the extraction mode with different step lengths, the data extracted from the data set by the server processing can represent the result of processing all the data in the data set, the data processing intensity of the server is reduced, and the data processing efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a data processing method according to an embodiment of the invention;
FIG. 3 is a schematic diagram of variable length sampling according to an embodiment of the present invention;
fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Example 1
The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a hardware structure block diagram of a mobile terminal of a data processing method according to an embodiment of the present invention, as shown in fig. 1, a mobile terminal 10 may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the message receiving method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
Based on the above mobile terminal or network architecture, this embodiment provides a data processing method, and fig. 2 is a flowchart of the data processing method according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;
step S204, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
further, the step S204 may specifically include:
obtaining a target element corresponding to an index of 2 at maximum from elements smaller than the number of the triple data by:
determining the target element as the initial step size.
Further, the step S206 may specifically include:
determining an i-1 th target step size for extracting the ternary group of data each time according to the initial step size by the following method:
step S208, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sample cache until the residual space of the sample cache is 0 or the target step size is 1;
further, the step S208 may specifically include:
extracting ternary group data for the first time from the target data set by the initial step size, and storing the extracted ternary group data into the sampling cache;
extracting the triple data except the element which can be completely divided by the initial step from the target data set for the second time by a 1 st target step, and storing the extracted triple data into a sample cache;
repeatedly extracting triple data except the element which can be completely divided by the (i-1) th target step from the target data set by the (i) th target step, and storing the extracted triple data into the sampling cache until the (i) th target step is 1 or the residual space of the sampling cache is 0;
i=i+1。
and step S210, distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.
In an embodiment of the present invention, the step S210 may specifically include: distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers; distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.
Through the steps S202 to S210, the problem that in the related art, when the data volume in the data set is large, the data extracted from the data set by the server processing cannot represent the result of processing all the data in the data set, so that the processing intensity of the server is high can be solved.
In an optional embodiment, the number of each key K in the sampling buffer is counted while the extracted triple data is stored in the sampling buffer; and discarding the triple data corresponding to the target keys K with the number larger than the preset number, and further improving the representativeness of the extracted data by filtering the data in the sampling cache.
The following examples illustrate the present invention.
FIG. 3 is a schematic diagram of variable length sampling according to an embodiment of the present invention, as shown in FIG. 1, assuming that a given data set includes n elements: n-l; let x be equal to pot (2, (int) log2(n)), in other words, let x be equal to "the index of maximum 2 less than n"; scanning and reading the data set with x as the step size, the result being that the 1 st element is read; the data set is scanned and read in steps x/2, but the element that can be divided exactly by l is ignored, resulting in reading element # (l x (x/2)); the data set is scanned and read in steps of x/4, but the elements that can be divided exactly by l/2 are ignored, resulting in the reading of element # (x (1/4)), and # (x (3/4)) … …; the data set is scanned and read in steps of l/8. But ignore elements that are divisible by l/4. The result was a reading of element # (x (1/8)), # (x (3/8)), # (x (5/8)), # (x (7/8)) … …; the loop is repeated until the sample capture buffer is filled, or the step size is l. When the step size is 1, it is clear that the entire data set will eventually be read.
If n 21793, the dataset contains 21793 elements, namely element #0,1, 2.. 21292; initializing x ═ pot (2, (int) log2(21793)) ═ pot (2,14) ═ 16384; read data set at step x, element # 16384; in x/2-8192 steps, the dataset is scanned and read, but the element that can be divisionally by 16384 is ignored, resulting in element #8192 being read; the data set was scanned and read in steps x/4 ═ 4096, but the elements that could be evenly divided by 8192 were ignored, resulting in the reading of elements #4096(4096 x), #12288(4096 x 3), #20480(4096 x 5); the data set was scanned and read in steps x/8 ═ 2048, but the elements that could be divided exactly by 4096 were ignored, resulting in reading elements #2048(2048 × x), #6144(2048 × 3), #10240(2048 × 5) #14336(2048 × 7) #18432(2048 × 9); …, respectively; if the sample buffer has not been read full (i.e., there are very few active elements in the data set), a step size of x is eventually reached and the entire data set is scanned in steps of x, but the elements that are divisible by 2 are ignored (i.e., all previously read elements are ignored).
The variable length sampling method in the embodiment of the invention has the efficiency of 0(x) for the best case (full data set); for the worst case (empty dataset) the efficiency is 0 (n). At any time the algorithm exits, the data in the sample buffer is uniform. If the data set is empty (or there are few active elements), the algorithm will eventually scan the entire data set, ensuring that no active elements are missed. By using different step sizes, the algorithm ensures that no elements are repeated without paying a performance penalty. Therefore, the representativeness of the data extracted from the data by the long and short sampling method is high, and the server can represent the result of processing all the data in the data set by processing the representative data, thereby reducing the pressure of processing the data by the server and improving the data processing efficiency.
Example 2
According to another embodiment of the present invention, there is also provided a data processing apparatus, and fig. 4 is a block diagram of the data processing apparatus according to the embodiment of the present invention, as shown in fig. 4, including:
an obtaining module 42, configured to obtain the number of triple data in a target dataset, where the triple data includes an element, a key K, and data;
a first determining module 44, configured to determine an initial step size for extracting triple-component data from the target data set according to the number of triple-component data;
a second determining module 46, configured to determine a target step size for extracting triple-group data each time according to the initial step size, where the target step size is not equal to the initial step size;
an extracting module 48, configured to extract triple data from the target data set according to the initial step size and the target step size, and store the extracted triple data in a sample buffer until a remaining space of the sample buffer is 0 or the target step size is 1;
and an allocating module 410, configured to allocate the target triple-tuple data in the sample buffer to multiple servers, where the multiple servers process the data.
Optionally, the allocating module 410 includes:
a first distribution module, configured to distribute data corresponding to the same key K in the target triple data to a same server in the multiple servers;
and the second distribution module is used for distributing the data corresponding to the different keys K in the target triple data to different servers in the plurality of servers.
Optionally, the apparatus further comprises:
the counting module is used for counting the number of each key K in the sampling cache while storing the extracted ternary group data in the sampling cache;
and the discarding module is used for discarding the triple data corresponding to the target keys K of which the number is greater than the preset number.
Optionally, the first determining module 44 includes:
an obtaining submodule, configured to obtain a target element corresponding to an index of 2 at maximum from among the elements smaller than the number of the triple data by:
a determining submodule, configured to determine the target element as the initial step size.
Optionally, the second determining module 46 is further configured to
Determining the (i-1) th target step size of each extraction of the ternary group data according to the initial step size by the following method:
optionally, the extraction module 48 includes:
the first extraction submodule is used for extracting the triple-element data for the first time from the target data set by the initial step size and storing the extracted triple-element data into the sampling cache;
a second extraction submodule, configured to extract, at a 1 st target stride, triple data from the target data set, except for the element that can be exactly divided by the initial stride, and store the extracted triple data in a sample cache;
a repeating submodule, configured to repeatedly extract triple data from the target data set at an ith target stride, except that the element can be completely divided by an (i-1) th target stride, and store the extracted triple data in the sample cache until the ith target stride is 1 or a remaining space of the sample cache is 0;
i=i+1。
it should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 3
Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring the number of triple data in the target dataset, wherein the triple data comprises elements, keys K and data;
s2, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
s3, determining a target step size for extracting the ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
s4, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling buffer until the residual space of the sampling buffer is 0 or the target step size is 1;
and S5, distributing the target ternary group data in the sampling buffer to a plurality of servers, and processing the data through the plurality of servers.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Example 4
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring the number of triple data in the target dataset, wherein the triple data comprises elements, keys K and data;
s2, determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
s3, determining a target step size for extracting the ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
s4, extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling buffer until the residual space of the sampling buffer is 0 or the target step size is 1;
and S5, distributing the target ternary group data in the sampling buffer to a plurality of servers, and processing the data through the plurality of servers.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A data processing method, comprising:
acquiring the number of triple data in a target data set, wherein the triple data comprises elements, keys K and data;
determining an initial step size for extracting the ternary group data from the target data set according to the number of the ternary group data;
determining a target step size for extracting ternary group data each time according to the initial step size, wherein the target step size is not equal to the initial step size;
extracting triple data from the target data set by the initial step size and the target step size respectively, and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and distributing the target ternary group data in the sampling cache to a plurality of servers, and processing the data through the plurality of servers.
2. The method of claim 1, wherein distributing the target triple packet data in the sample buffer to a plurality of servers comprises:
distributing the data corresponding to the same key K in the target triple data to the same server in the plurality of servers;
distributing data corresponding to different keys K in the target triple data to different servers in the plurality of servers.
3. The method of claim 1, further comprising:
when the extracted ternary group data is stored in the sampling cache, counting the number of each key K in the sampling cache;
and discarding the triple data corresponding to the target keys K with the number larger than the preset number.
4. The method of any one of claims 1 to 3, wherein determining an initial step size for extracting triple data from the target dataset according to the number of triple data comprises:
obtaining a target element corresponding to an index of 2 at maximum from elements smaller than the number of the triple data by:
determining the target element as the initial step size.
5. The method of claim 4, wherein determining the target step size for each decimation of triple data according to the initial step size comprises:
determining an i-1 th target step size for extracting the ternary group of data each time according to the initial step size by the following method:
6. the method of claim 4, wherein extracting triple data from the target data set in the initial stride and the target stride, respectively, and storing the extracted triple data in a sample buffer until a remaining space of the sample buffer is 0 or the target stride is 1 comprises:
extracting ternary group data for the first time from the target data set by the initial step size, and storing the extracted ternary group data into the sampling cache;
extracting the triple data except the element which can be completely divided by the initial step from the target data set for the second time by a 1 st target step, and storing the extracted triple data into a sample cache;
repeatedly extracting triple data except the element which can be completely divided by the (i-1) th target step from the target data set by the (i) th target step, and storing the extracted triple data into the sampling cache until the (i) th target step is 1 or the residual space of the sampling cache is 0;
i=i+1。
7. a data processing apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the number of triple data in a target data set, and the triple data comprises elements, keys K and data;
the first determination module is used for determining the initial step size of extracting the ternary group data from the target data set according to the number of the ternary group data;
a second determining module, configured to determine, according to the initial step size, a target step size for extracting triple-group data each time, where the target step size is not equal to the initial step size;
the extracting module is used for respectively extracting triple data from the target data set by the initial step size and the target step size and storing the extracted triple data into a sampling cache until the residual space of the sampling cache is 0 or the target step size is 1;
and the distribution module is used for distributing the target ternary group data in the sampling cache to a plurality of servers and processing the data through the plurality of servers.
8. The apparatus of claim 7, wherein the assignment module comprises:
a first distribution module, configured to distribute data corresponding to the same key K in the target triple data to a same server in the multiple servers;
and the second distribution module is used for distributing the data corresponding to the different keys K in the target triple data to different servers in the plurality of servers.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010296755.6A CN111506607A (en) | 2020-04-15 | 2020-04-15 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010296755.6A CN111506607A (en) | 2020-04-15 | 2020-04-15 | Data processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111506607A true CN111506607A (en) | 2020-08-07 |
Family
ID=71877554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010296755.6A Pending CN111506607A (en) | 2020-04-15 | 2020-04-15 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111506607A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013050494A1 (en) * | 2011-10-05 | 2013-04-11 | St-Ericsson Sa | Simd memory circuit and methodology to support upsampling, downsampling and transposition |
CN108989383A (en) * | 2018-05-31 | 2018-12-11 | 阿里巴巴集团控股有限公司 | Data processing method and client |
US20190236479A1 (en) * | 2018-01-31 | 2019-08-01 | The Johns Hopkins University | Method and apparatus for providing efficient testing of systems by using artificial intelligence tools |
CN110730000A (en) * | 2018-07-17 | 2020-01-24 | 珠海格力电器股份有限公司 | Method and device for extracting key data from sampling data |
US10599621B1 (en) * | 2015-02-02 | 2020-03-24 | Amazon Technologies, Inc. | Distributed processing framework file system fast on-demand storage listing |
-
2020
- 2020-04-15 CN CN202010296755.6A patent/CN111506607A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013050494A1 (en) * | 2011-10-05 | 2013-04-11 | St-Ericsson Sa | Simd memory circuit and methodology to support upsampling, downsampling and transposition |
US10599621B1 (en) * | 2015-02-02 | 2020-03-24 | Amazon Technologies, Inc. | Distributed processing framework file system fast on-demand storage listing |
US20190236479A1 (en) * | 2018-01-31 | 2019-08-01 | The Johns Hopkins University | Method and apparatus for providing efficient testing of systems by using artificial intelligence tools |
CN108989383A (en) * | 2018-05-31 | 2018-12-11 | 阿里巴巴集团控股有限公司 | Data processing method and client |
CN110730000A (en) * | 2018-07-17 | 2020-01-24 | 珠海格力电器股份有限公司 | Method and device for extracting key data from sampling data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107667503B (en) | Resource management techniques for heterogeneous resource clouds | |
CN111683144B (en) | Method and device for processing access request, computer equipment and storage medium | |
CN104424331A (en) | Data sampling method and device | |
CN109597804B (en) | Customer merging method and device based on big data, electronic equipment and storage medium | |
CN115039091A (en) | Multi-key-value command processing method and device, electronic equipment and storage medium | |
CN110569129A (en) | Resource allocation method and device, storage medium and electronic device | |
CN111506607A (en) | Data processing method and device | |
CN111833276A (en) | Image median filtering processing method and device | |
CN112332854A (en) | Hardware implementation method and device of Huffman coding and storage medium | |
CN111324621A (en) | Event processing method, device, equipment and storage medium | |
US20140337572A1 (en) | Noncontiguous representation of an array | |
CN105634999A (en) | Aging method and device for medium access control address | |
CN106156169B (en) | Discrete data processing method and device | |
CN113590322A (en) | Data processing method and device | |
CN110046040B (en) | Distributed task processing method and system and storage medium | |
CN112788768A (en) | Communication resource allocation method and device | |
CN112835932A (en) | Batch processing method and device of service table and nonvolatile storage medium | |
CN111340114A (en) | Image matching method and device, storage medium and electronic device | |
CN111143161A (en) | Log file processing method and device, storage medium and electronic equipment | |
CN110751204A (en) | Data fusion method and device, storage medium and electronic device | |
CN112711588A (en) | Multi-table connection method and device | |
CN110276212B (en) | Data processing method and device, storage medium and electronic device | |
CN110874308A (en) | Method and device for generating unique value | |
CN118193604A (en) | Blockchain transaction statistics method, device, electronic equipment and storage medium | |
CN107729058A (en) | A kind of method of automatic parsing VAT invoice recognition result |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20231027 |