CN116089367A

CN116089367A - Dynamic barrel dividing method, device, electronic equipment and medium

Info

Publication number: CN116089367A
Application number: CN202310318919.4A
Authority: CN
Inventors: 曾钊创; 徐文玉; 彭顺求; 林嘉文
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-05-09

Abstract

The dynamic barrel dividing method and the device can be applied to the technical field of distributed computing and the technical field of big data. The method comprises the following steps: the method comprises the steps of obtaining an original data table of a to-be-divided bucket, obtaining data quantity of the to-be-divided bucket, obtaining cluster resources, and obtaining hash values of n data values of a sub-bucket column of the original data table to form a hash data set. And carrying out cluster analysis on n hash values in the hash data set by using a cluster analysis method to divide the n hash values into k1 classes, determining k1 as a first barrel number, wherein k1 is a positive integer and k1 is smaller than n. And determining a second fractional bucket number k2 based on the data quantity to be barreled and the cluster resource, and determining a target fractional bucket number based on the first fractional bucket number k1 and the second fractional bucket number k 2. And executing the barrel dividing operation on the original data table.

Description

Dynamic barrel dividing method, device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of distributed computing technology and the field of big data technology, and more particularly, to a dynamic barreling method, apparatus, electronic device, and medium using the computer technology field and the big data technology field.

Background

With the continuous development of computer technology, more and more enterprises choose to use the data barrel dividing function in a large data platform to realize the effect of efficiently inquiring data, and banks are no exception.

The traditional big data platform (e.g. hive) only supports static sub-buckets, i.e. the sub-bucket number can only be fixedly written into a program in a program writing stage, for example, a temporary sub-bucket table needs to be directly defined in a certain section of data analysis program, the sub-bucket number is specified in the process of table establishment, for example, a temporary sub-bucket table containing 4 buckets is specified in a certain section of program, the section of program generally runs at a fixed frequency when being on line to a production environment, and the fixed sub-bucket number cannot be dynamically adjusted according to the data quantity of each batch, the CPU, the memory and other resources of a cluster.

That is, in the present technology, the number of sub-buckets is generally fixed and written into a program in the writing stage of a big data analysis program, for example, a temporary sub-bucket table containing 4 sub-buckets is fixedly created in a section of program, the sub-bucket number is dynamically adjusted when the program is not supported, and when the dynamically changing data amount and the flexible clusters (i.e. the total amount of resources such as a CPU and a memory are dynamically changed), the situation that the sub-bucket parameters do not meet the calculation requirement of the actual data amount or do not meet the actual cluster resources may occur, thereby causing the problems of low operation efficiency, cluster resource waste or shortage and finally causing the program interrupt risk.

Disclosure of Invention

In view of the foregoing, according to a first aspect of the present disclosure, an embodiment of the present disclosure provides a dynamic barreling method, the method including:

obtaining an original data table of a to-be-divided barrel, wherein the original data table comprises a designated barrel dividing column, the barrel dividing column comprises n data values, and n is a positive integer greater than or equal to 2;

acquiring the data quantity of the barrels to be divided according to the original data table of the barrels to be divided;

obtaining cluster resources, wherein the cluster resources comprise computing resources and storage resources which can be allocated to provide services for executing a barrel-dividing operation on the original data table;

obtaining hash values of n data values of the sub-bucket columns of the original data table to form a hash data set;

performing cluster analysis on n hash values in the hash data set by using a cluster analysis method to divide the n hash values into k1 classes, and determining k1 as a first barrel number, wherein k1 is a positive integer and k1 is smaller than n;

determining a second sub-bucket number k2 based on the data amount to be sub-bucket and the cluster resource, wherein the second sub-bucket number k2 is the maximum sub-bucket number of the data amount to be sub-bucket under the cluster resource, and k2 is a positive integer and k2 is smaller than n;

Determining a target fractional barrel number based on the first fractional barrel number k1 and the second fractional barrel number k2, wherein the target fractional barrel number is the first fractional barrel number k1 or the second fractional barrel number k2; and

and executing the barrel dividing operation on the original data table according to the determined target barrel dividing number so as to form a plurality of barrel dividing files.

According to some exemplary embodiments, the clustering analysis method is used to perform clustering analysis on n hash values in the hash data set to divide the n hash values into k1 classes, and specifically includes:

calculating Euclidean distance between any two of n hash values in the hash data set to obtain a distance matrix, wherein the distance matrix is a matrix of n rows and n columns; and

clustering: determining a plurality of clustering centers, aiming at each data in the hash data set, acquiring the distance between the data and the plurality of clustering centers from the distance matrix, and dividing the data into classes corresponding to the clustering centers with the smallest distance to form a plurality of classes.

According to some exemplary embodiments, the clustering analysis method is used to perform clustering analysis on n hash values in the hash data set to divide the n hash values into k1 classes, and the method further specifically includes:

Comparing the number of data in the j-th class in the plurality of classes with a preset sample number threshold value in the class, wherein j is a positive integer and is smaller than or equal to k1; and

and in response to the number of data in the jth class being less than the sample number threshold in the class, removing the jth class and repartitioning the data in the jth class into the remaining classes.

recalculating the clustering centers of the classes based on the data in each of the formed classes; and

and iteratively executing the clustering step based on the recalculated cluster centers of the various classes.

and when the clustering step is executed for the first time, randomly selecting k0 data from the hash data set as initial clustering centers, wherein k0 is the number of preset clustering centers, k0 is a positive integer and k0 is smaller than k1.

According to some exemplary embodiments, in iteratively performing the clustering step, if the number of classes formed is less than or equal to one half of k0, performing a splitting sub-step; and/or the number of the groups of groups,

in the process of iteratively executing the clustering step, if the number of formed classes is greater than or equal to twice k0, executing a merging sub-step.

According to some exemplary embodiments, the splitting sub-step comprises:

calculating a variance of each of the plurality of classes to form a plurality of variances;

acquiring a maximum variance from the plurality of variances; and

if the maximum variance is greater than a preset variance threshold and the number of data in the class corresponding to the maximum variance is greater than or equal to 2 times of the sample number threshold in the class, splitting the class corresponding to the maximum variance into 2 classes.

According to some exemplary embodiments, the splitting the class corresponding to the largest variance into 2 classes includes: and determining the cluster centers of the split 2 classes based on the cluster center of the class corresponding to the maximum variance and the maximum variance.

According to some exemplary embodiments, the merging sub-step comprises:

comparing the Euclidean distance between the clustering centers of any two classes in the plurality of classes with a preset distance threshold;

If the Euclidean distance between the cluster centers of two classes is smaller than the distance threshold value, combining the two classes into a new class.

According to some exemplary embodiments, the merging the two classes into a new class comprises: and determining the clustering centers of the new class based on the clustering centers of the two classes and the number of data in the two classes.

According to some example embodiments, the obtaining cluster resources includes:

obtaining CPU resources capable of being allocated to perform a barreling operation on the original data table, wherein the CPU resources comprise available CPU resources and available CPU resources;

obtaining memory resources capable of being allocated to perform a barreling operation on the original data table, wherein the memory resources comprise available memory resources and available memory resources; and

disk resources capable of being allocated to perform a binning operation on the original data table are obtained, wherein the disk resources comprise available disk resources.

According to some example embodiments, the acquiring the data volume comprises:

acquiring the size of one line of data in an original data table;

acquiring the data line number of an original data table; and

The size of one line of data is multiplied by the number of lines of data to obtain the amount of data.

According to some exemplary embodiments, the determining the second bucket number k2 based on the data amount to be barreled and the cluster resource specifically includes:

and determining the second barrel division number k2 by utilizing a multiple linear regression model based on the data quantity to be barrel divided, the CPU resource, the memory resource and the disk resource, wherein the multiple linear regression model is obtained by training in advance based on historical pressure measurement data.

According to some exemplary embodiments, the determining the target fractional barrel number based on the first fractional barrel number k1 and the second fractional barrel number k2 specifically includes:

if the first fractional bucket number k1 is larger than the second fractional bucket number k2, determining the second fractional bucket number k2 as a target fractional bucket number; and

and if the first fractional barrel number k1 is less than or equal to the second fractional barrel number k2, determining the first fractional barrel number k1 as a target fractional barrel number.

According to a second aspect of the present disclosure, there is also provided a dynamic barrel splitting apparatus, the apparatus comprising:

the first acquisition module is used for acquiring an original data table of a barrel to be divided, wherein the original data table comprises a designated barrel dividing column, the barrel dividing column comprises n data values, and n is a positive integer greater than or equal to 2;

The second acquisition module is used for acquiring the data quantity of the barrels to be divided according to the original data table of the barrels to be divided;

a third acquisition module, configured to acquire cluster resources, where the cluster resources include computing resources and storage resources that can be allocated to perform a barreling operation on the original data table to provide services;

a fourth obtaining module, configured to obtain hash values of n data values of the sub-bucket columns of the original data table, so as to form a hash data set;

the analysis module is used for carrying out cluster analysis on n hash values in the hash data set by utilizing a cluster analysis method so as to divide the n hash values into k1 classes, and determining k1 as a first barrel number, wherein k1 is a positive integer and k1 is smaller than n;

the first determining module is used for determining a second fractional bucket number k2 based on the data amount of the to-be-segmented bucket and the cluster resource, wherein the second fractional bucket number k2 is the maximum fractional bucket number of the data amount of the to-be-segmented bucket under the cluster resource, and k2 is a positive integer and k2 is smaller than n;

a second determining module, configured to determine a target fractional barrel number based on the first fractional barrel number k1 and the second fractional barrel number k2, where the target fractional barrel number is the first fractional barrel number k1 or the second fractional barrel number k2; and

And the execution module is used for executing the barrel dividing operation on the original data table according to the determined target barrel dividing number so as to form a plurality of barrel dividing files.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a storage device for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as described above.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method as described above.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an application scenario diagram of a dynamic barreling method according to an embodiment of the present disclosure.

Fig. 2 schematically illustrates a flow chart of a dynamic barreling method according to an embodiment of the present disclosure.

Fig. 3 is a flow chart of acquiring data amounts in a method according to some example embodiments of the present disclosure.

Fig. 4 is a flow chart of acquiring cluster resources in a method according to some example embodiments of the present disclosure.

Fig. 5 is a flow chart of cluster analysis in a method according to some exemplary embodiments of the present disclosure.

Fig. 6 schematically illustrates a block diagram of a dynamic barrel splitting apparatus according to an embodiment of the present disclosure.

Fig. 7 schematically illustrates a block diagram of a second acquisition module in a dynamic barrel splitting device according to an embodiment of the present disclosure.

Fig. 8 schematically illustrates a block diagram of a third acquisition module in a dynamic barrel splitting apparatus according to an embodiment of the present disclosure.

Fig. 9 schematically illustrates a block diagram of the analysis module in the dynamic barrel splitting apparatus according to an embodiment of the present disclosure.

Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a dynamic barreling method according to an embodiment of the present disclosure.

Fig. 11 schematically illustrates a schematic diagram suitable for implementing a sub-bucket operation in a dynamic sub-bucket method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.

First, technical terms described herein are explained and illustrated as follows.

Data sub-barrels: the sub-buckets divide the whole data content into 3 buckets according to the hash value of a certain column of attribute values, namely, the hash value of the attribute value of a designated sub-bucket column is subjected to modulo 3, and the data is sub-bucket according to the modulo result. If the data record with the modulo result of 0 is stored in a file, the data with the modulo result of 1 is stored in a file, the data with the modulo result of 2 is stored in a file, and so on.

The sub-bucket is a column of a specified sub-bucket table, and the column data is randomly and uniformly distributed into each bucket file according to a hash modular method. Because a binning operation requires a hash modulo operation on a column of specific data, the assigned binning column must be based on a certain column (field) in the table. The data storage mode is changed by the barrel division, and the data lines with the same hash modes or in a certain interval are placed in the same barrel file.

Barrel separation parameters: mainly refers to the barrel number of the sub-barrels, and the parameter can influence the data volume of each barrel, thereby influencing the subsequent operation speed, the resource utilization rate and the like.

K-Means algorithm: setting a parameter k, randomly selecting k initialization center points, dividing n points into k clusters by calculating Euclidean distances between the n points and the k initialization center stores, dividing the n points into k clusters, and re-calculating the center points (generally taking a mean value as the center point) in the clusters after each round of calculation, wherein each point belongs to the cluster corresponding to the nearest mean value (namely the cluster center) by multiple rounds of iteration, and finally realizing k cluster division of the n points. The algorithm is commonly used in 'unsupervised learning', and aims to reveal the intrinsic properties and rules of data through learning of unlabeled training samples, so as to provide a basis for further data analysis.

Multiple linear regression algorithm (multiple linear regression): multiple linear regression is an extension of the unified linear regression, which can represent f (x) _i )＝ωx _i +b, dependent variable f (x _i ) By the argument x only _i The weight ω and the constant term b. The purpose of the multiple linear regression model is to construct a regression equation that estimates the dependent variable using multiple independent variables to interpret and predict the value of the dependent variable, which can be expressed as f (X) =ω ₀ +ω ₁ x ₁ +ω ₂ x ₂ +...+ω _k x _k +b, the dependent variable f (X) is derived from the independent variable X ₁ 、x ₂ 、...、x _k Weight omega ₀ 、ω ₁ 、...、ω _k And constant term b.

Embodiments of the present disclosure provide a dynamic barreling method, the method including: and obtaining an original data table of a to-be-divided bucket, wherein the original data table comprises a designated sub-bucket column, the sub-bucket column comprises n data values, and n is a positive integer greater than or equal to 2. And acquiring the data quantity of the barrel to be divided according to the original data table of the barrel to be divided. Cluster resources are obtained, the cluster resources including computing resources and storage resources capable of being deployed to provide services for performing a barreling operation on the original data table. And obtaining hash values of n data values of the sub-bucket columns of the original data table to form a hash data set. And carrying out cluster analysis on n hash values in the hash data set by using a cluster analysis method to divide the n hash values into k1 classes, determining k1 as a first barrel number, wherein k1 is a positive integer and k1 is smaller than n. And determining a second barrel dividing number k2 based on the data quantity to be divided and the cluster resource, wherein the second barrel dividing number k2 is the maximum barrel dividing number of the data quantity to be divided under the cluster resource, and k2 is a positive integer and k2 is smaller than n. A target fractional barrel number is determined based on the first fractional barrel number k1 and the second fractional barrel number k2, wherein the target fractional barrel number is the first fractional barrel number k1 or the second fractional barrel number k2. And executing the barrel dividing operation on the original data table according to the determined target barrel dividing number so as to form a plurality of barrel dividing files.

In the embodiment of the disclosure, through the dynamic barrel dividing method, the appropriate barrel dividing number is dynamically evaluated according to the data quantity to be operated in the round when the program runs, and the barrel dividing number is adapted to the current cluster resource (such as a CPU, a memory, a magnetic disk and the like) so as to realize efficient calculation. Therefore, the problem that the traditional barrel dividing method only can statically specify barrel dividing numbers to cause low operation efficiency of barrel dividing programs and is not suitable for cluster resources to cause resource waste or deficiency is solved, and the risk of program interruption is avoided.

Fig. 1 schematically illustrates an application scenario diagram of a dynamic barreling method according to an embodiment of the present disclosure. It should be noted that fig. 1 is merely an example of a scenario in which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, the application scenario 100 according to this embodiment may include a plurality of application terminals and application servers. For example, the plurality of application terminals includes an application terminal 101, an application terminal 102, an application terminal 103, and the like. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the application server 105 via the network 104 using the

application terminal devices

101, 102, 103 to receive or send messages or the like. Various application programs such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the

application terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, a dynamic barreling method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, a dynamic barreling apparatus provided by embodiments of the present disclosure may be generally disposed in the server 105. A dynamic barreling method provided by embodiments of the present disclosure may also be performed by a server or a cluster of servers other than the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, a dynamic barreling apparatus provided by embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The dynamic barreling method provided by the disclosed embodiments will be described in detail below with reference to the scenario described in fig. 1 through fig. 2 to 5.

As shown in fig. 2, the dynamic barreling method 200 of this embodiment may include operations S210 to S280.

It should be noted that in some exemplary embodiments, the methods may be performed in the order illustrated. However, embodiments of the present disclosure are not limited thereto, and in other embodiments, some steps in the dynamic barreling method may be performed in parallel, or may be performed in a different order than illustrated, without conflict.

In operation S210, an original data table of a to-be-divided bucket is obtained, where the original data table includes a designated divided bucket column, the divided bucket column includes n data values, and n is a positive integer greater than or equal to 2.

In operation S220, the data amount of the to-be-divided bucket is obtained according to the original data table of the to-be-divided bucket.

In an embodiment of the present disclosure, acquiring the data volume 300 includes: operations S310 to S330.

In operation S310, the size of one line of data in the original data table is acquired.

For example, units of each field are acquired, such as 4 bytes for int type, 8 bytes for double type, and 10 bytes for varchar (10) type. And summing the space sizes occupied by the fields of one row of data to obtain the size of one row of data.

In operation S320, the number of data lines of the original data table is acquired.

For example, an initial value of 0 is set, the full table is scanned, 1 is added to each scanned row, and the last row is scanned to represent that the evaluation is completed.

In operation S330, the size of one line of data is multiplied by the number of data lines to acquire the data amount.

For example, after receiving the acquisition instruction, the data line size, the data line number, and the like, the data total size is obtained by multiplying the data line size by the data line number.

In embodiments of the present disclosure, the target fractional bucket number may be enabled to adapt to the data amount by acquiring the data amount.

Referring back to fig. 2, in operation S230, cluster resources including computing resources and storage resources that can be provisioned to provide services for performing a barreling operation on the original data table are acquired.

In an embodiment of the present disclosure, the obtaining cluster resources 400 includes: operations S410 to S430.

In operation S410, CPU resources capable of being allocated to perform a binning operation on the raw data table are acquired, wherein the CPU resources include dominant CPU resources and contendable CPU resources.

In operation S420, memory resources capable of being allocated to perform a barreling operation on the original data table are obtained, wherein the memory resources include available memory resources and available memory resources.

In operation S430, disk resources capable of being allocated to perform a binning operation on the original data table are acquired, wherein the disk resources include available disk resources.

In the embodiment of the disclosure, the cluster resources are acquired after use, so that the target fractional bucket number can be adapted to CPU resources, memory resources and disk resources.

Referring back to fig. 2, in operation S240, hash values of n data values of the churn of the original data table are acquired to form a hash data set.

For example, three initial thresholds are set using the K-Means algorithm: minimum number of samples n per class _min Maximum variance Sigma, minimum allowed distance d between two cluster centers _min 。

The Data value of the attribute column to be divided is converted into hash, for example, hash_func (bucket_column) may be adopted to perform operation, so as to obtain a hash Data set Data.

In operation S250, n hash values in the hash data set are subjected to a cluster analysis using a cluster analysis method to divide the n hash values into k1 classes, and k1 is determined as a first fractional bucket number, k1 is a positive integer and k1 is less than n.

In an embodiment of the present disclosure, the performing cluster analysis 500 on n hash values in the hash data set by using a cluster analysis method to divide the n hash values into k1 classes specifically includes:

in operation S510, a euclidean distance between any two of the n hash values in the hash data set is calculated, so as to obtain a distance matrix, where the distance matrix is a matrix of n rows and n columns.

For example, the distance between hash values (i.e., hash_value) is calculated every two, for example, distance_func (hash_value) _i ，hash_value _j ) An operation is performed to obtain an n×n distance matrix D, where D (i, i) =0.

In operation S520, the clustering step: determining a plurality of clustering centers, aiming at each data in the hash data set, acquiring the distance between the data and the plurality of clustering centers from the distance matrix, and dividing the data into classes corresponding to the clustering centers with the smallest distance to form a plurality of classes.

For example, for each sample in the dataset, hash value _i Reading a sample hash_value in a distance matrix D _i The distances to k0 cluster centers are divided into classes corresponding to the cluster centers with the smallest distances.

In some exemplary embodiments, the clustering analysis method is used to perform clustering analysis on n hash values in the hash data set to divide the n hash values into k1 classes, and further specifically includes:

For example, k0 samples are randomly selected from the dataset Data as initial cluster centersThereby forming a set c= { C of cluster centers ₁ ，c ₂ ，...，c _k And (c), where c ₁ ，c ₂ ，...，c _k For each cluster center in the dataset Data.

The number of the initial clustering centers is preset, so that the number of iterations can be reduced as much as possible, and reasonable arrangement of the termination conditions of the iterations is facilitated.

In some exemplary embodiments, in iteratively performing the clustering step, if the number of classes formed is less than or equal to one half of k0, then a splitting sub-step is performed.

For example, if it is current

The number of the current categories is too small, and the splitting sub-algorithm is executed.

If the current p1 is not less than 2k ₀ And (5) indicating that the number of the current categories is too large, and executing a merging sub-algorithm.

In some exemplary embodiments, the splitting sub-step includes:

the variance of each of the plurality of classes is calculated to form a plurality of variances.

For example, the variance σ of all samples under each dimension for each category is calculated.

The maximum variance is obtained from the plurality of variances.

For example, the maximum variance σ is chosen for all variances of each category _max 。

In some exemplary embodiments, the splitting the class corresponding to the largest variance into 2 classes includes: and determining the cluster centers of the split 2 classes based on the cluster center of the class corresponding to the maximum variance and the maximum variance.

For example, if sigma of a certain class _max Sigma and the number of samples n comprised by the class _i ≥2n _min A splitting operation may be performed that splits the class meeting the condition into two subclasses and let p1=p1+1,

wherein m is _i For the center of the class before splitting, +.>

And->

Is the center of the two sub-classes split; if the above condition is not satisfied, the split operation is exited.

By the splitting step, the occurrence of more data-dispersed classes can be avoided.

In some exemplary embodiments, the combining substep includes:

and comparing the Euclidean distance between the clustering centers of any two classes in the plurality of classes with a preset distance threshold.

In some exemplary embodiments, the merging the two classes into a new class comprises: and determining the clustering centers of the new class based on the clustering centers of the two classes and the number of data in the two classes.

For example, for distance matrix D (i, j) < D _min The two categories (i not equal to j) need to be combined to become a new category, and the cluster center positions of the category are as follows:

n in the above _i And n _j Representing the number of samples in the two classes, m _i And m _j Representing the centers of the two classes, the new cluster center can be seen as a weighted sum of the two classes. If one class contains a larger number of samples, the new class synthesized is more biased toward it.

By the splitting step, by the merging step, the situation that more concentrated data are distributed in 2 classes can be avoided.

In some exemplary embodiments, it is advantageous for the amount of data in each class to remain within a reasonable range through the classification step and the merging step, and for the distance of the data in each class from the cluster center to be within a reasonable range.

Through the operation, the number of clusters can be determined according to the cluster characteristics of the barrel-separated data, so that the calculated barrel-separated data can be suitable for the cluster characteristics of the data.

In an embodiment of the present disclosure, the performing cluster analysis on n hash values in the hash data set by using a cluster analysis method to divide the n hash values into k1 classes, further specifically includes:

in operation S530, the number of data in the j-th class of the plurality of classes is compared with a preset threshold of the number of samples in the class, j being a positive integer and j being less than or equal to k1.

For example, it is determined whether the number of data in each class is smaller than the sample number threshold N _min 。

In operation S540, in response to the number of data in the j-th class being less than the sample number threshold in the class, the j-th class is removed and the data in the j-th class is repartitioned into the remaining classes.

For example, if less than the sample number threshold N _min It is necessary to discard the class, let k1=k1-1, and reassign the samples in the class to the class with the smallest distance among the remaining classes.

Through the operation, the classes with smaller data numbers can be avoided, and the data volume in each class can be kept in a reasonable range.

in operation S550, the cluster centers of the respective classes are recalculated based on the data in each of the formed respective classes.

For example, for the number c in each category _i Recalculating its cluster center

(i.e., the cluster center of all samples belonging to the class), where x is the hash value of each data in the class.

In operation S560, the clustering step is iteratively performed based on the recalculated cluster centers of the respective classes.

For example, if

Then the process terminates, otherwise execution continues back to S520.

Through the operation, the calculated clustering center can be gradually fitted with the clustering characteristics of the barrel-divided column data by an iterative mode.

Referring back to fig. 2, in operation S260, a second fractional bucket number k2 is determined based on the data amount to be barreled and the cluster resource, wherein the second fractional bucket number k2 is a maximum fractional bucket number of the data amount to be barreled under the cluster resource, and k2 is a positive integer and k2 is less than n.

In an embodiment of the present disclosure, the determining the second bucket number k2 based on the data amount to be barreled and the cluster resource specifically includes:

For example: the first step judges whether an algorithm model is built, if so, the fourth step is skipped to carry out evaluation, otherwise, the second step is executed to build the algorithm model.

The second step of algorithm modeling is performed, and a quaternary linear regression model y=f (x) =omega is built according to four independent variables of a cluster CPU, a memory, a disk, a data volume and the like ₀ +ω ₁ x ₁ +ω ₂ x ₂ +ω ₃ x ₃ +ω ₄ x ₄ Wherein x is ₁ Is CPU, x ₂ Is a memory, x ₃ Is a magnetic disk x ₄ Is the data volume; omega is a weight matrix, omega ₀ Is constant and omega ₁ Weights, omega, for CPU ₂ Is the weight, omega of the memory ₃ Is the weight of the magnetic disk, omega ₄ Is the weight of the data volume.

Thirdly, according to historical pressure measurement data of the split barrels, a maximum barrel division number matrix Y is obtained, four independent variable matrices X of CPU, memory, magnetic disk, data quantity and the like are clustered, and according to omega= (X) ^T X) ^-1 X ^T Obtaining a weight matrix omega by Y, and determining the quaternary linear regression model y=f (x) =omega ₀ +ω ₁ x ₁ +ω ₂ x ₂ +ω ₃ x ₃ +ω ₄ x ₄ 。

Step three, obtaining the CPU, the memory and the magnetic disk which can be used by the calculation of the current cluster and the data quantity of the data to be divided into barrels of the current cluster, and setting the data quantity as x respectively ₁ 、x ₂ 、x ₃ 、x ₄ Using a modeled quaternary linear regression y=f (x) =ω ₀ +ω ₁ x ₁ +ω ₂ x ₂ +ω ₃ x ₃ +ω ₄ x ₄ The maximum fractional bucket number k2 for this round is calculated.

In operation S270, a target fractional barrel number is determined based on the first fractional barrel number k1 and the second fractional barrel number k2, wherein the target fractional barrel number is the first fractional barrel number k1 or the second fractional barrel number k2.

In an embodiment of the disclosure, the determining the target fractional barrel number based on the first fractional barrel number k1 and the second fractional barrel number k2 specifically includes:

if the first fractional bucket number k1 is greater than the second fractional bucket number k2, the second fractional bucket number k2 is determined to be a target fractional bucket number.

For example, it is determined whether the optimal fractional bucket number K1 is greater than the maximum fractional bucket number K2, and if so, K is taken _final =k2, otherwise take K _final ＝k1。

In the embodiment of the disclosure, the data clustering characteristic of the sub-bucket is taken as the optimized sub-bucket number, the cluster resource and the data volume are restricted to the maximum sub-bucket number, and the relation between the two aspects can be balanced, so that the target sub-bucket number can be calculated efficiently under the condition of adapting to the current cluster resource (such as CPU, memory and the like) and the data volume.

In operation S280, a bucket division operation is performed on the original data table according to the determined target bucket division number, so as to form a plurality of bucket division files.

The method specifically comprises the following steps: and converting the data value of the attribute column to be divided into hash, and performing operation by using hash_func (bucket_column) to obtain a hash_value.

Calculating a bucket sequence number to which the data is distributed, and according to the bucket_num=hash_value% K _final The operation is performed as shown in fig. 11.

And distributing the data to be calculated into the data barrel of the barrel sequence number to which the data to be calculated belongs.

In the embodiment of the disclosure, according to factors such as the data volume of the data to be calculated, available cluster resources and the like as references, a proper barrel-dividing number injection program is evaluated before calculation, so that the problems of low running efficiency and cluster resource waste or deficiency caused by the fact that barrel-dividing parameters do not meet the actual data volume calculation requirements or do not meet the actual cluster resources are solved, potential program interruption risks are avoided, the barrel-dividing program realizes the most reasonable barrel division by utilizing the resources such as the current cluster CPU, the memory and the like, and the highest efficiency operation is exerted.

Based on the dynamic barrel dividing method, the embodiment of the disclosure also provides a dynamic barrel dividing device. The device will be described in detail below in connection with fig. 6.

As shown in fig. 6, the dynamic sub-bucket apparatus 600 according to this embodiment includes a first acquisition module 610, a second acquisition module 620, a third acquisition module 630, a fourth acquisition module 640, an analysis module 650, a first determination module 660, a second determination module 670, and an execution module 680.

The first acquisition module 610 may be configured to: and obtaining an original data table of a to-be-divided bucket, wherein the original data table comprises a designated sub-bucket column, the sub-bucket column comprises n data values, and n is a positive integer greater than or equal to 2. In an embodiment, the first obtaining module 610 may be configured to perform the operation S210 described above, which is not described herein.

The second acquisition module 620 may be configured to: and acquiring the data quantity of the barrel to be divided according to the original data table of the barrel to be divided. In an embodiment, the second obtaining module 610 may be configured to perform the operation S220 described above, which is not described herein.

The third acquisition module 630 may be configured to: cluster resources are obtained, the cluster resources including computing resources and storage resources capable of being deployed to provide services for performing a barreling operation on the original data table. In an embodiment, the third obtaining module 630 may be configured to perform the operation S230 described above, which is not described herein.

The fourth acquisition module 640 may be configured to: and obtaining hash values of n data values of the sub-bucket columns of the original data table to form a hash data set. In an embodiment, the fourth obtaining module 640 may be configured to perform the operation S240 described above, which is not described herein.

The analysis module 650 may be configured to: and carrying out cluster analysis on n hash values in the hash data set by using a cluster analysis method to divide the n hash values into k1 classes, determining k1 as a first barrel number, wherein k1 is a positive integer and k1 is smaller than n. In an embodiment, the analysis module 650 may be configured to perform the operation S250 described above, which is not described herein.

The first determination module 660 may be for: and determining a second barrel dividing number k2 based on the data quantity to be divided and the cluster resource, wherein the second barrel dividing number k2 is the maximum barrel dividing number of the data quantity to be divided under the cluster resource, and k2 is a positive integer and k2 is smaller than n. In an embodiment, the first determining module 660 may be configured to perform the operation S260 described above, which is not described herein.

The second determination module 670 may be configured to: a target fractional barrel number is determined based on the first fractional barrel number k1 and the second fractional barrel number k2, wherein the target fractional barrel number is the first fractional barrel number k1 or the second fractional barrel number k2. In an embodiment, the second determining module 670 may be used to perform the operation S270 described above, which is not described herein.

The execution module 680 may be configured to: and executing the barrel dividing operation on the original data table according to the determined target barrel dividing number so as to form a plurality of barrel dividing files. In an embodiment, the executing module 680 may be configured to execute the operation S280 described above, which is not described herein.

As shown in fig. 7, the second acquisition module 620 includes a fifth acquisition module 710, a sixth acquisition module 720, and a seventh acquisition module 730.

The fifth acquisition module 710 may be configured to: the size of one line of data in the original data table is obtained. In an embodiment, the fifth obtaining module 710 may be configured to perform the operation S310 described above, which is not described herein.

The sixth acquisition module 720 may be configured to: and acquiring the data line number of the original data table. In an embodiment, the sixth obtaining module 720 may be configured to perform the operation S320 described above, which is not described herein.

The seventh acquisition module 730 may be configured to: the size of one line of data is multiplied by the number of lines of data to obtain the amount of data. In an embodiment, the seventh obtaining module 730 may be configured to perform the operation S330 described above, which is not described herein.

As shown in fig. 8, the third acquisition module 630 includes an eighth acquisition module 810, a ninth acquisition module 820, and a tenth acquisition module 830.

The eighth acquisition module 810 may be configured to: CPU resources capable of being allocated to perform a binning operation on the raw data table are obtained, wherein the CPU resources comprise available CPU resources and available CPU resources. In an embodiment, the eighth obtaining module 810 may be configured to perform the operation S410 described above, which is not described herein.

The ninth acquisition module 820 may be configured to: memory resources capable of being allocated to perform a barreling operation on the original data table are obtained, wherein the memory resources comprise available memory resources and available memory resources. In an embodiment, the ninth obtaining module 820 may be configured to perform the operation S420 described above, which is not described herein.

The tenth acquisition module 830 may be configured to: disk resources capable of being allocated to perform a binning operation on the original data table are obtained, wherein the disk resources comprise available disk resources. In an embodiment, the tenth obtaining module 830 may be configured to perform the operation S430 described above, which is not described herein.

As shown in fig. 9, the analysis module 650 includes a first calculation module 910, a clustering module 920, a comparison module 930, a partitioning module 940, a second calculation module 950, and an iteration module 960.

The first calculation module 910 may be configured to: and calculating Euclidean distance between any two of n hash values in the hash data set to obtain a distance matrix, wherein the distance matrix is a matrix of n rows and n columns. In an embodiment, the first computing module 910 may be configured to perform the operation S510 described above, which is not described herein.

The clustering module 920 may be configured to: clustering: determining a plurality of clustering centers, aiming at each data in the hash data set, acquiring the distance between the data and the plurality of clustering centers from the distance matrix, and dividing the data into classes corresponding to the clustering centers with the smallest distance to form a plurality of classes. In an embodiment, the clustering module 920 may be configured to perform the operation S520 described above, which is not described herein.

The comparison module 930 may be configured to: comparing the number of data in the j-th class in the plurality of classes with a preset sample number threshold value in the class, wherein j is a positive integer and is smaller than or equal to k1. In an embodiment, the comparing module 930 may be configured to perform the operation S530 described above, which is not described herein.

The partitioning module 940 may be used to: and in response to the number of data in the jth class being less than the sample number threshold in the class, removing the jth class and repartitioning the data in the jth class into the remaining classes. In an embodiment, the partitioning module 940 may be configured to perform the operation S540 described above, which is not described herein.

The second calculation module 950 may be configured to: based on the data in each of the formed classes, the cluster centers of the classes are recalculated. In an embodiment, the second computing module 950 may be configured to perform the operation S550 described above, which is not described herein.

The iteration module 960 may be used to: and iteratively executing the clustering step based on the recalculated cluster centers of the various classes. In an embodiment, the iteration module 960 may be used to perform the operation S560 described above, which is not described herein.

According to an embodiment of the present disclosure, any of the first acquisition module 610, the second acquisition module 620, the third acquisition module 630, the fourth acquisition module 640, the analysis module 650, the first determination module 660, the second determination module 670, and the execution module 680 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first acquisition module 610, the second acquisition module 620, the third acquisition module 630, the fourth acquisition module 640, the analysis module 650, the first determination module 660, the second determination module 670, and the execution module 680 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware implementations. Alternatively, at least one of the first acquisition module 610, the second acquisition module 620, the third acquisition module 630, the fourth acquisition module 640, the analysis module 650, the first determination module 660, the second determination module 670, and the execution module 680 may be at least partially implemented as a computer program module, which may perform the corresponding functions when being executed.

As shown in fig. 10, an electronic device 1000 according to an embodiment of the present disclosure includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. The processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1001 may also include on-board memory for caching purposes. The processor 1001 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.

In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are stored. The processor 1001, ROM1002, and RAM 1003 are connected to each other through a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiment of the present disclosure by executing programs in the ROM1002 and/or the RAM 1003. Note that the program may be stored in one or more memories other than the ROM1002 and the RAM 1003. The processor 1001 may also perform various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 1000 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to the bus 1004. The electronic device 1000 may also include one or more of the following components connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM1002 and/or RAM1003 and/or one or more memories other than ROM1002 and RAM1003 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1001. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of signals on a network medium, distributed, and downloaded and installed via the communication section 1009, and/or installed from the removable medium 1011. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1001. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A dynamic barreling method, the method comprising:

2. The dynamic barreling method according to claim 1, wherein said clustering of n hash values in said hash dataset by using a cluster analysis method to divide the n hash values into k1 classes, specifically comprises:

3. The dynamic barreling method according to claim 2, wherein the clustering analysis is performed on n hash values in the hash data set by using a clustering analysis method to divide the n hash values into k1 classes, and further specifically comprises:

4. A dynamic barreling method according to claim 2 or 3, wherein said clustering of n hash values in said hash dataset by using a cluster analysis method to divide the n hash values into k1 classes, further specifically comprises:

5. The dynamic barreling method according to claim 4, wherein said clustering of n hash values in said hash dataset by using a cluster analysis method to divide the n hash values into k1 classes, further specifically comprises:

6. The dynamic barreling method according to claim 5, wherein in the process of iteratively performing said clustering step, if the number of formed classes is less than or equal to one half of k0, performing a splitting sub-step; and/or the number of the groups of groups,

7. The dynamic barreling method as claimed in claim 6, wherein the splitting sub-step comprises:

Acquiring a maximum variance from the plurality of variances; and

8. The dynamic barreling method according to claim 7, wherein said splitting the class corresponding to the largest variance into 2 classes comprises: and determining the cluster centers of the split 2 classes based on the cluster center of the class corresponding to the maximum variance and the maximum variance.

9. The dynamic barreling method as claimed in claim 6, wherein the combining sub-step comprises:

10. The dynamic barreling method according to claim 9, wherein said merging said two classes into a new class comprises: and determining the clustering centers of the new class based on the clustering centers of the two classes and the number of data in the two classes.

11. The dynamic barreling method as claimed in any one of claims 1 to 3 and 5 to 10, wherein said obtaining cluster resources comprises:

12. The dynamic barreling method according to any one of claims 1-3 and 5-10, wherein said acquiring a data volume comprises:

acquiring the size of one line of data in an original data table;

acquiring the data line number of an original data table; and

13. The dynamic barreling method according to claim 11, wherein said determining a second barreling number k2 based on the data amount to be barreled and the cluster resource, specifically comprises:

14. The dynamic barreling method according to any one of claims 1-3 and 5-10, wherein said determining a target barrel fraction based on said first barrel fraction k1 and said second barrel fraction k2, in particular comprises:

15. A dynamic barrel splitting device, the device comprising:

16. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-14.

17. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1 to 14.

18. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 14.