CN111966495B

CN111966495B - Data processing method and device

Info

Publication number: CN111966495B
Application number: CN202010849131.2A
Authority: CN
Inventors: 李雷孝; 邓丹; 王慧; 王洪彬; 李�杰; 王永生
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2022-02-01
Anticipated expiration: 2040-08-21
Also published as: CN111966495A

Abstract

The invention discloses a data processing method and a data processing device. The data processing method comprises the following steps: acquiring a data distribution algorithm of a distributed system for realizing load balance based on particle swarm optimization; calculating the optimal positions of the load-balanced particles in the distributed system according to a data distribution algorithm for realizing load balance based on particle swarm optimization; acquiring a data distribution algorithm for realizing optimized storage based on particle swarm optimization of a distributed system; and optimizing the storage space required by each node in the distributed system according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and the optimal particle position for load balancing. The invention solves the technical problems of long calculation time and low efficiency caused by the fact that the existing full-comparison calculation research adopts a branch-and-bound method to complete data distribution of full-comparison calculation.

Description

Data processing method and device

Technical Field

The invention relates to the field of computer technology application, in particular to a data processing method and device.

Background

Full-comparison computation is a typical computation model that is used to solve a class of computations associated with two data files. Full-comparison calculations, as a special class of calculation modes, frequently occur in many disciplines, such as: bioinformatics, biometrics, traditional machine learning, natural language processing, and traffic big data. In the field of bioinformatics, a typical full comparison calculation exists for nucleic acid sequence alignments versus protein sequence alignments. In the field of biometrics, a common measurement task is face recognition, and a fully-comparable calculated silhouette can also be seen in fingerprint comparison. In the field of traditional machine learning, full comparison calculations appear in classification algorithms and clustering algorithms in the form of similarity matrices. In recent years, natural language processing has raised a wave in the field of artificial intelligence, semantic similarity calculation exists as a key step in natural language processing, and the calculation mode used by the semantic similarity calculation is still full comparison calculation. In the field of traffic big data, path planning is always a research hotspot, and calculating the distance between any two places in a data set in path planning work to form a distance matrix is a more typical full comparison calculation.

The study of scholars at home and abroad aiming at the full comparison calculation is always carried out, and the study is one of the hotspots. Overseas, scholars have replicated all of the data required for the full comparison task in a copy at each compute node in the distributed cluster. The distribution method is suitable for the condition of small data volume, and serious network congestion and waste of storage space are caused when massive data is faced. Some have used the Hadoop Distributed storage File System (HDFS) to store the data needed to perform the full compare task. The HDFS adopts a distributed copy storage scheme, and the component defaults to adopt a storage scheme with the copy number of 3. Although this data storage method can save storage space, it cannot guarantee complete localization of data when performing comparison task. Chaudhary et al set up a heterogeneous computing platform when analyzing biological sequences, and in order to achieve load balancing of the entire system, they distribute tasks according to hardware configurations of nodes, and in terms of data distribution, they divide a database and then distribute the database to each node, and although computing is performed using the heterogeneous computing platform, it is still unavoidable to request data schemes from other nodes in a cluster. In the related art, a graph coverage mode is used for data distribution of full comparison calculation, and the method cannot be applied to a scene with the number of data files different from the number of nodes. And in another prior art, a branch-and-bound method is adopted to complete data distribution of full comparison calculation, and although an optimized data distribution scheme can be obtained, a certain solution time needs to be sacrificed.

Aiming at the problems of long calculation time and low efficiency caused by the fact that the existing full-comparison calculation research adopts a branch-and-bound method to complete data distribution of full-comparison calculation, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a data processing method and a data processing device, which are used for at least solving the technical problems of long calculation time and low efficiency caused by the fact that the existing full-comparison calculation research adopts a branch-and-bound method to complete data distribution of full-comparison calculation.

According to an aspect of an embodiment of the present invention, there is provided a data processing method including: acquiring a data distribution algorithm of a distributed system for realizing load balance based on particle swarm optimization; calculating the optimal positions of the load-balanced particles in the distributed system according to a data distribution algorithm for realizing load balance based on particle swarm optimization; acquiring a data distribution algorithm for realizing optimized storage based on particle swarm optimization of a distributed system; and optimizing the storage space required by each node in the distributed system according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and the optimal particle position for load balancing.

Optionally, the data distribution algorithm for obtaining the load balance of the distributed system based on the particle swarm optimization includes: acquiring initialized particle swarm parameters; and optimizing the preset calculation formula according to the initialized particle swarm parameters to obtain a data distribution algorithm for realizing load balance based on particle swarm optimization.

Further, optionally, initializing particle swarm parameters includes: maximum iteration number, particle population size, particle dimension, inertia weight, first acceleration coefficient, second acceleration coefficient and particle velocity.

Optionally, the particle dimension is a product of the number of nodes in the distributed system and the number of comparison tasks.

Optionally, the first acceleration coefficient and the second acceleration coefficient are respectively used for adjusting the maximum step length of the flight in the direction of the individual optimal position and the global optimal position.

Optionally, calculating the optimal positions of the load balancing particles in the distributed system according to a data distribution algorithm for realizing load balancing based on particle swarm optimization includes: performing iterative computation on a data distribution algorithm for realizing load balancing based on particle swarm optimization according to a preset updating rule to obtain the optimal positions of the load balancing particles in the distributed system; wherein, predetermine the update rule and include: an inertial weight update rule, a particle flight velocity update rule, and a particle position update rule.

Optionally, optimizing the storage space required by each node in the distributed system according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and a particle optimal position for load balancing includes: initializing according to the optimal particle position, the number of nodes, a file size list, a task list and a node calculation amount of load balancing to obtain initialized particle swarm parameters; performing iterative optimization according to initialized particle swarm parameters, and updating the positions of the particles when a position updating rule is met; and recording the fitness corresponding to the optimal position obtained by each iteration, and coding the iteration records to obtain an optimization scheme of the storage space required by each node in the distributed system.

According to an aspect of an embodiment of the present invention, there is provided a data processing apparatus including: the first acquisition module is used for acquiring a data distribution algorithm of a distributed system for realizing load balance based on particle swarm optimization; the first calculation module is used for calculating the optimal positions of the load-balanced particles in the distributed system according to a data distribution algorithm for realizing load balance based on particle swarm optimization; the second acquisition module is used for acquiring a data distribution algorithm of the distributed system for realizing optimized storage based on particle swarm optimization; and the second calculation module is used for optimizing the storage space required by each node in the distributed system according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and the optimal particle position for load balancing.

Optionally, the first obtaining module includes: the acquisition unit is used for acquiring initialization particle swarm parameters; and the optimization unit is used for optimizing the preset calculation formula according to the initialized particle swarm parameters to obtain a data distribution algorithm for realizing load balance based on particle swarm optimization.

Optionally, the first calculation module includes: the first computing unit is used for performing iterative computation on a data distribution algorithm for realizing load balancing based on particle swarm optimization according to a preset updating rule to obtain the optimal position of the load balancing particles in the distributed system; wherein, predetermine the update rule and include: an inertial weight update rule, a particle flight velocity update rule, and a particle position update rule.

Optionally, the second calculating module includes: the initialization unit is used for initializing according to the optimal particle position, the number of nodes, a file size list, a task list and node calculation amount of load balancing to obtain initialized particle swarm parameters; the updating unit is used for carrying out iterative optimization according to the initialized particle swarm parameters and updating the positions of the particles when a position updating rule is met; and the second calculation unit is used for recording the fitness corresponding to the optimal position obtained by each iteration and obtaining the optimization scheme of the storage space required by each node in the distributed system by encoding the iteration records.

In the embodiment of the invention, a data distribution algorithm for realizing load balance based on particle swarm optimization of a distributed system is obtained; calculating the optimal positions of the load-balanced particles in the distributed system according to a data distribution algorithm for realizing load balance based on particle swarm optimization; acquiring a data distribution algorithm for realizing optimized storage based on particle swarm optimization of a distributed system; according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and the optimal particle position for load balancing, the storage space required by each node in a distributed system is optimized, the technical effects of shortening the calculation time and improving the calculation efficiency are realized, and the technical problems of long calculation time and low efficiency caused by the fact that the existing full-comparison calculation research adopts a branch-and-bound method to complete data distribution of full-comparison calculation are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a full comparison calculation of 5 files in accordance with an embodiment of the present invention;

FIG. 2 is a flow diagram of a data processing method according to an embodiment of the invention;

FIG. 3 is a schematic diagram of Hadoop experimental storage savings in a data processing method according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a Hadoop experimental data localization case in a data processing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a scenario 1 load balancing scenario in a data processing method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a scenario 2 load balancing scenario in a data processing method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a scenario 3 load balancing scenario in a data processing method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a scenario 4 load balancing scenario in a data processing method according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the application, in the data set of the full comparison calculation, the comparison calculation must occur only once between every two data. In a distributed cluster with m data files and n nodes, the specific comparison algorithm is set to C (i, j), where i and j are two data files in the data set. The full comparison calculation can be formally described as:

M_i,j＝{C(i,j)|i＜j,i＝1,2,...,m-1,j＝2,3,...,m} (1)

wherein M is_i,jIs the result of the comparison operation C (i, j), all M_i,jConstituting the final result of the full comparison calculation. The relationship between the comparison task and the related data file when m is 5 is shown in fig. 1. The file numbers are f in sequence₁，f₂，f₃，f₄，f₅. The comparison task of the full comparison calculation is an element corresponding to a mark in the figure, the column number represents the number of a first input file in the comparison calculation, and the row number represents the number of a second input file in the comparison calculation. The circled element in the figure represents the first comparison task of a full comparison computation, whose first input file is f₁The second data file is f2, and the result of the comparison operation is M_1,2And (4) showing.

The embodiment of the application provides data distribution work under a homogeneous distributed system, and software and hardware configuration of each node in the distributed system is assumed to be consistent. In nucleic acid sequence alignment, the size of the data files is the same or approximately the same. Assume that there are m data files in a dataset, each data file having a size of

s

_i1, 2.. times.m. The formalization of the data files to be of the same or approximately the same size is described as follows:

the data distribution work of the full-comparison calculation needs to balance the load among the nodes and reduce the overall storage space usage of the distributed system. Meanwhile, each comparison task is ensured to use the data with the localization attribute, and the storage space on each computing node is reduced as much as possible.

When the comparison operation of two data files is carried out, the sum of the sizes of the two data files is in direct proportion to the calculation amount of the comparison operation. Let c_ijRepresents the amount of computation of the comparison task C (i, j). Then c is_ijThe following relationships are associated with data file i and data file j:

s_i+s_j∝c_ij i＜j,i＝1,2,...,m-1,j＝2,3,...,m (3)

all c can be found from equation (2)_ijThe numerical values of (a) will be the same or approximately the same. Suppose that K comparison tasks are distributed on the node p, and the calculation amount of each task is

i and j are the numbers of the data files distributed to the node p, and the task size p on the node p_cComprises the following steps:

let t_countFor the total number of tasks in the full comparison calculation, one comparison task is associated with two different data files, and when there are m data files in the data set, t_countComprises the following steps:

using x_kpTo indicate whether the comparison task k is assigned to the node p, x is known from formula (1)_kpWith uniqueness, each comparison task can and must be assigned to a compute node when task assignment is performed. x is the number of_kpThe values of (1) are expressed by using 0 and 1, 1 represents that the comparison task k is distributed to the computing node p, and 0 represents that the distribution is not carried out.

x_kp＝0 or 1 k＝1,2,...,t_count,p＝1,2,...,n (6)

By c_kRepresenting the size of the kth task, then node p in equation (4)Size p of task amount on_cCan be redefined as:

in a distributed system having n computing nodes, using c_kRepresenting the size of task k, the average calculated amount c of nodes of full comparison calculation_avgComprises the following steps:

the calculation amount of the node p relative to the c can be obtained from the formula (7) and the formula (8)_avgThe degree of deviation fp.

f_p＝|p_c-c_avg| (9)

Where | is a function of absolute value. Then the load balancing of the distributed system can be described as:

let w_pRepresenting the total size of the file distributed to node p. Suppose the number of files distributed to node p is m '(m' ≦ m), w_pThe size of the middle file i is r_i. Then there are:

let up (

p

1,2, …, n) be the upper limit of storage space of node p, and the sum of file sizes distributed to node p cannot exceed up.

w_p≤u_p (12)

Taking the formula (10) as a solving target, a load balance solving model of the full comparison calculation in the distributed system is obtained as follows.

The scheduling of the full comparison task in the distributed system according to the formula (13) enables load balancing among the computing nodes, and data locality of data files required by each comparison task can be guaranteed. On this basis, the model is optimized, and a data distribution scheme which minimizes the required storage space in the distributed system is expected to be found.

A many-to-many relationship exists between the data files and the comparison tasks, namely the same data file is related to the comparison tasks, and two data files are needed for one comparison task. Based on the many-to-many relationship, task allocation is recombined, computing nodes for executing the comparison task k are adjusted, and a data file distribution scheme is modified to obtain a data distribution scheme and a task scheduling scheme.

The calculation amount of each node in the load balancing state can be obtained through the formula (13) and is marked as g_i(i＝1,2,...,n)。g_iThere may be a difference in value between them. In the optimization algorithm, the calculated amount on each calculation node does not exceed g_iIs measured. The description of the calculation amount of the node is obtained by the formula (7), and the relationship between the task allocation condition on the node p at a certain time and the calculation amount is as follows:

in a distributed system having n compute nodes, m data files are distributed. Each file has at most one backup on a computing node, using y_jpIndicating whether to distribute the jth data file to the pth node, y_jpWhen the value is 1, the j data file is distributed to the p node, and y_jpA value of 0 indicates no distribution. Therefore, the optimized stored data distribution scheme corresponding to the data distribution strategy of the full comparison calculation under the distribution system can be described as the following equation.

The optimized full-comparison calculation data distribution model under the distributed system is obtained according to the formula (13), the formula (14) and the formula (15) and is shown in the formula (16).

And (3) performing data distribution of full comparison calculation according to the formula (16), wherein the obtained data distribution result reduces the storage space of the distributed system, and realizes complete data localization and load balancing. In the following sections, the established data distribution model is subjected to algorithm design and model realization by combining a particle swarm algorithm, and the effectiveness of the model is verified through experiments.

The Particle Swarm Optimization (PSO) provided by the embodiment of the application is a group of heuristic algorithms, which was proposed by American society psychologist James Kenedy and Electrical Engineer Russell Ebert in 1995. The optimization process of the particle swarm algorithm is roughly described as follows: initializing a particle swarm in a given solution space, wherein the dimension of the solution space is determined by the variable number of a problem to be optimized, and the initial position, the initial speed and the dynamic inertia weight value range and the iteration times of the given particle are determined. In each iteration, the update of the particle position and the flight speed are determined by the individual optimal value and the global optimal value. And after the iteration stop condition is met, the global optimal position of the particle is the solution of the optimization problem.

The solving of the full-comparison calculation data distribution model by adopting particle swarm optimization mainly has the following advantages:

1) particle swarm optimization is easy to realize, and a plurality of operators similar to genetic algorithm are not available.

2) The particle swarm optimization adopts random initialization of the population, and uses the fitness value to evaluate the quality degree of the individual particles and perform certain random search.

3) In the particle swarm optimization, a plurality of particles are simultaneously optimized, suboptimal solutions can be temporarily received in a mode of searching for individual optimal solutions during optimization, and finally, global optimal solutions are found by comparing the individual optimal solutions in each generation.

4) The full-comparison computation data distribution model takes whether a comparison task is arranged on a computation node as a decision variable, and the definition domain comprises two elements: 0 and 1, so the model can be solved using a discrete particle swarm algorithm.

Example 1

According to an aspect of the embodiment of the present invention, there is provided a data processing method, and fig. 2 is a schematic flow chart of the data processing method according to the embodiment of the present invention, as shown in fig. 2, including:

step S202, acquiring a data distribution algorithm of a distributed system for realizing load balance based on particle swarm optimization;

optionally, the data distribution algorithm for realizing load balancing based on particle swarm optimization for the distributed system obtained in step S202 includes: acquiring initialized particle swarm parameters; and optimizing the preset calculation formula according to the initialized particle swarm parameters to obtain a data distribution algorithm for realizing load balance based on particle swarm optimization.

Specifically, as can be seen from theoretical analysis of the problem of fully-comparative computational data distribution, the DDBPSO model consists of two parts. The first part is to find a data distribution scheme and a task scheduling scheme for load balancing of the distributed system. And performing optimization solution based on the load balancing result of the first part, and reducing the use of storage space in the distributed system on the premise of ensuring load balancing. For convenience of description, the first part of the DDBPSO model is referred to as a Data Distribution algorithm (DDBPSOLB) for Load balancing Based on Particle Swarm Optimization.

The DDBPSOLB algorithm is designed according to a formula (13) to obtain a data distribution scheme of a load balancing state under a distributed system. And optimizing the problem by using a particle swarm optimization algorithm, wherein some parameters need to be specified in the optimization process. The parameter settings are shown in table 1.

TABLE 1DDBPSOLB parameter settings

The maximum number of iterations T of initializing the particle swarm parameters is 100, and the particle swarm size N is 100. Considering that the comparison task scale of the full comparison problem is increased after the number of the data files is increased, the full comparison calculation is related to the number of nodes in the distributed system, and therefore the particle dimension D is dynamically adjusted.

Specifying a dimension of a single particle as a number of nodes n and a number of comparison tasks t in a distributed system_cThe product of (a).

The inertia weight is a very important control parameter in the particle swarm algorithm and can be used for controlling the development and exploration capacity of the solving algorithm. In the optimization process, the DDBPSOLB algorithm acquires a feasible solution with a better fitness value generation by generation. Therefore, it is desirable that the particles are more active, i.e. faster, when the optimization is started, and the motion state of the particles gradually becomes stable as the optimization progresses, so that a linear decreasing weight strategy is adopted in the DDBPSOLB algorithm.

The acceleration coefficients c1 and c2 respectively adjust the maximum step size of flying towards the direction of the individual optimal position and the direction of the global optimal position, and the two coefficients respectively determine the influence of the individual experience and the group experience of the particles on the moving track of the particles. Minimum speed value of V_minMaximum speed V ═ 10_max＝10。

Step S204, calculating the optimal particle position for load balancing in the distributed system according to a data distribution algorithm for realizing load balancing based on particle swarm optimization;

optionally, in step S204, calculating the optimal positions of the load-balanced particles in the distributed system according to a data distribution algorithm for realizing load balancing based on particle swarm optimization includes: performing iterative computation on a data distribution algorithm for realizing load balancing based on particle swarm optimization according to a preset updating rule to obtain the optimal positions of the load balancing particles in the distributed system; wherein, predetermine the update rule and include: an inertial weight update rule, a particle flight velocity update rule, and a particle position update rule.

Specifically, several update rules involved in the DDBPSOLB algorithm are explained. The method comprises an inertia weight updating rule, a particle flight speed updating rule and a particle position updating rule.

Let w denote the inertial weight of the current particle, w_minAnd w_maxRespectively representing the minimum and maximum values of the inertial weight, T_maxFor the maximum number of iterations, t is the current number of iterations of the DDBPSOLB algorithm. Then w can be formally expressed as:

the flight speed of the ith dimension of the particle in the t iteration is v_ij(t)，c₁、c₂For the acceleration factor, RAND is a random number between 0 and 1, x_ij(t) is the j-th dimension of the medium particle, p_ij(t) is the individual optimal position of particle i, and g (t) is the global optimal position from the t-th iteration. The flight velocity v of the particle i in the jth dimension in the t +1 th iteration_ij(t +1) is:

v_ij(t+1)＝wv_ij(t)+c₁RAND[p_ij(t)-x_ij(t)]+c₂RAND[g(t)-x_ij(t)] (18)

whether to update the location is determined by the probability q, let variable x_ijIndicates whether the probability, x, is updated_ijWhen the value is 1, the updating is indicated, x_ijAnd when the value is 0, the updating is not carried out. It is formally described as formula(19) Equation (20):

wherein v is_ijR represents a random number between 0 and 1 for the current velocity of the jth dimension of the particle i. When x is_ijWhen the value is 1, the position is updated. Before updating the position, the encoding state x of the particle i is backed up to be x'. Two different nodes are randomly determined, and two comparison tasks are randomly selected from the two nodes. Exchanging two comparison tasks on x ', calculating the fitness value fit' of x 'by using a fitness function, and comparing the fit' with the individual optimal position p of the particle i_ij(t) and if fit 'obtains a better fitness value, updating the code of the particle i to x'. Although the updating mode cannot ensure that x 'is better than the fitness value of x every time, the fit' is better than the individual optimal position of the particle i, the idea of receiving suboptimal solution with certain probability is expressed, and the calculation is prevented from falling into the local optimal solution.

The fitness function of the DDBPSOLB algorithm is used for evaluating the load balance degree of the distributed system corresponding to the particle codes, and the mathematical principle of the fitness function follows the formula (9). The theoretical minimum value of the load balancing degree of the distributed system is 0, namely, the complete load balancing of the distributed system is realized.

The pseudo code of the DDBPSOLB algorithm is shown in Table 2, the 2 nd line of the algorithm defines the input parameters of the algorithm, the number n of nodes required to be input into the distributed system is n, and the upper limit u of the node storage is_tAnd a list s of sizes of files required for full comparison calculations. Line 4 defines that the output parameters of the algorithm are a task list task of full comparison calculation, total calculated quantities nodeLoads of each node in the distributed system, and a global optimal position gbest. These three parameters are all used as input parameters for the DDBPSOBS algorithm. Line 6 checks input parameters, e.g. judging calculation sectionsWhether a point has the ability to store one complete piece of data required for a full compare calculation.

The 9 th line of the algorithm sets relevant parameters required by particle swarm optimization, such as: the maximum number of iterations T is 100. Iterative particle swarm optimization is performed from line 11 to line 30. And the 31 st line analyzes the load condition of each node from the global optimal solution gbest and stores the load condition into nodeLoads.

TABLE 2 DDBPSOLB Algorithm pseudo-code description

Step S206, acquiring a data distribution algorithm of the distributed system for realizing optimized storage based on particle swarm optimization;

specifically, the second part of the DDBPSO model is referred to as a Data Distribution algorithm (DDBPSOBS) for realizing optimized Storage Based on Particle Swarm Optimization.

And S208, optimizing the storage space required by each node in the distributed system according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and the optimal particle position for load balancing.

Optionally, in step S208, optimizing the storage space required by each node in the distributed system according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and a particle optimal position for load balancing includes: initializing according to the optimal particle position, the number of nodes, a file size list, a task list and a node calculation amount of load balancing to obtain initialized particle swarm parameters; performing iterative optimization according to initialized particle swarm parameters, and updating the positions of the particles when a position updating rule is met; and recording the fitness corresponding to the optimal position obtained by each iteration, and coding the iteration records to obtain an optimization scheme of the storage space required by each node in the distributed system.

In particular, the DDBPSOBS algorithm is used for ensuring that each node in the distributed system realizes load balanceUnder the condition, the size of the storage space required to be provided by each node in the distributed system is optimized. After the DDBPSOLB algorithm is calculated, a full-comparison calculation task scheduling scheme for the distributed system to realize a load balancing state and the calculation amount required to be born by each node in the distributed system can be obtained. Selecting the maximum value C of the calculated amount of the node from the calculated amount of each node_max。

Similar to the DDBPSOLB algorithm, the DDBPSOBS algorithm is to solve by using the idea of particle swarm optimization. The particle swarm optimization-related parameter settings, except for the particle swarm size N, were kept consistent with table 1. The value of the particle population size N in the DDBPSOBS algorithm is 10, and the purpose of doing so is to hopefully improve the operational efficiency of the algorithm by reducing the population size.

Since the optimal positions of the particles for the system to realize load balancing are obtained in the DDBPSOLB algorithm, in the population initialization work in the DDBPSOLB algorithm, only N copies of the codes are needed, and the codes of the particles are adjusted by satisfying the formula (14), so that the diversity of the initial population is realized. The fitness value corresponding to the DDBPSOBS algorithm is the size of the storage space of the distributed system corresponding to the current encoding of the particle.

The DDBPSOBS algorithm uses a formula (16) to develop the algorithm, the updating rule related to particle swarm optimization is basically consistent with the DDBPSOLB algorithm, and the specific position updating rule is changed. After determining that the code of the particle i is to be modified by the probability q, the position code x of the particle i is backed up as y. Randomly selecting task set on two computing nodes₁And set₂. Slave set₁Randomly selecting a task₁Calculating the size ts of task 1₁. Calculation of set₂Each task in (1) and task₁The task with the minimum difference is selected as the exchange task₂. Exchanging tasks on y₁And task₂If the task volumes on both nodes are in C_maxThen, the fitness value fit "corresponding to the code y is calculated. If the fitness value of fit "is better than the individual optimum of particle i, x is updated with y.

Table 3 shows the DDBPSOBS algorithm step descriptions, the parameters required to be input by the algorithm include the number of nodes n, the file size list s, the task list task, the node calculated quantities nodeLoads, and the global optimal position gbest of the DDBPSOLB algorithm₁. After the calculation is completed, the final output result of the DDBPSO algorithm, namely the task scheduling scheme and the file distribution scheme of the full-comparative calculation under the distributed system, is obtained. The definition of the input and output parameters corresponds to 1-4 lines of the algorithm. The initialization work before the iterative optimization of the particle swarm is carried out from the 6 th line to the 22 th line, and the initialization is mainly carried out on parameter information required by the particle swarm optimization. Line 9 with gbest₁The value is assigned to x by the repmat function. Initializing the particle swarm from line 10 to line 22, exchanging two tasks on two nodes from the second particle, and requiring that neither node participating in the exchange exceed the maximum calculation C of the node found in line 8_maxAs shown in algorithm line 18. Iterative optimization is performed from line 24 to line 50, and when the position update rule is satisfied, the particle position is updated using the contents defined from line 34 to line 45. And the 49 th row records the fitness value corresponding to the optimal position obtained by each iteration, so that the fitness evolution condition of the algorithm can be conveniently checked at the later stage. After the iterative optimization is completed, the 51 st line analyzes the task scheduling scheme taskagsign and the file distribution scheme fileAssign which are calculated by full comparison in the distributed system from the global optimal position code gbest of the DDBPSOBS algorithm.

TABLE 3 DDBPSOBS Algorithm pseudo-code description

In summary, the embodiment of the present application analyzes and evaluates the DDBPSO model by using 4 evaluation indexes, i.e., the theoretical load balance value, the saving rate of storage, the localization rate of data, and the model calculation time.

And (4) load balancing theoretical values. Load balancing is a primary target of DDBPSO model development, and a task scheduling scheme and a data distribution scheme with the same calculated amount on each node in a distributed system are expected to be obtained through the DDBPSO model.

A storage saving rate. Reducing the memory usage of the distributed system is a second goal of DDBPSO model development, and the computation of the memory savings rate will take as the denominator the memory required by the distributed system when distributing the entire data required for the full comparison computation to each node in the distributed system. m data files, n nodes, and hr as a storage saving rate, wherein the calculation formula of hr is as follows:

in the formula (21), s_kDenotes the size, s, of the file k_ijIs the size of the file j distributed to i, and m' represents the number of files distributed to node i.

Data localization rate. When the distributed cluster is used for calculation, if a large number of calculation nodes need to acquire data through a network during calculation, network congestion is caused. Thus, prior to performing the full comparison calculation, the data file is distributed to the designated node and the comparison task capable of locally acquiring the data is assigned to the designated node. The data distribution in such a way can perfectly avoid the network conflict problem in the calculation.

The model calculates the time. In a real application scenario, a user may want the program to be used to execute as fast as possible and the resulting solution to be as accurate as possible. The two methods are often difficult to satisfy simultaneously in big data calculation, so the DDBPSO model proposed by the embodiment of the present application is expected to improve the execution speed of the program while sacrificing the accuracy of some solutions. For this purpose, model computation time is used for quantification.

Consider whether the file sizes are identical and whether the number of comparison tasks for a full comparison calculation can be divided evenly by the number of nodes in the distributed system. For the combination of these two conditions, four experimental protocols can be derived as follows.

1) The file sizes are completely the same, and the number of comparison tasks can be an integer of the number of nodes.

2) The file sizes are completely the same, and the number of comparison tasks cannot be divided by the number of nodes.

3) The file sizes are not completely the same, and the number of comparison tasks can be integer by the number of nodes.

4) The file sizes are not completely the same, and the number of comparison tasks cannot be divided by the number of nodes.

Experimental data. A gene sequence file downloaded from the national center for biotechnology is segmented to facilitate the development of DDBPSO model verification experiments and Hadoop comparison experiments. The gene sequence file is divided into 12 small files with different sizes, and 10 data files are selected from the small files for experiment each time. After segmentation, the sizes of the data files are respectively as follows:

[9.7MB,9.7MB,9.7MB,9.7MB,9.7MB,9.7MB,9.7MB,9.7MB,9.7MB,9.7MB,8.1MB,12.1MB]。

before carrying out DDBPSO model verification experiments, Hadoop is used for executing four groups of data distribution experiments, and a MapReduce program is written for reading file loading operation in a data file simulation sequence comparison. And recording the data distribution scheme and the task scheduling condition of the full comparison calculation in each group of experiments, and comparing the data distribution scheme and the task scheduling condition with the experiment result of the DDBPSO model. The files and node information for the four sets of experiments are shown in table 4:

table 4 design of the experimental protocol

Hadoop data distribution experiment:

in the embodiment of the application, the data distribution result obtained by using the Hadoop is compared with the data distribution result obtained by using the DDBPSO model on three evaluation indexes of load balancing degree, storage saving condition and data localization condition. The Hadoop cluster configuration scheme is as follows:

host machine configuration: intel i7-8750CPU, 16GBRAM, 1TBSATA +256GBSSD, window10 operating system. Virtualization software and version: VMWare WorkStation 10.

Virtual machine configuration: the CPU core number is 2, 2GB memory, 20GB disk space, Centos6.8 operating system, and 5 virtual machines are constructed. The version corresponding to Hadoop is Hadoop 2.7.2.

The Hadoop data distribution experiment is carried out on a host machine, data required by one-time full comparison calculation is distributed to the HDFS from the host machine, and position statistics is carried out on each piece of uploaded data through a visual interface provided by Hadoop. The number of file copies in the HDFS is not modified and a default value of 3 is used.

The Hadoop-based fully-comparative calculation data distribution scheme and the fully-comparative calculation task execution conditions of the Hadoop obtained by sequentially completing four groups of experiments are shown in the following table 5. Statistics of the data distribution scheme are acquired based on a Web interface provided by Hadoop, and task execution conditions are obtained according to a simulated full-comparison calculation program.

TABLE 5Hadoop test results

Compared with a data distribution scheme that all data required by the full comparison calculation are distributed to all nodes once, the storage saving condition of the hadoop cluster can be obtained. The storage savings for the four sets of experiments are shown in fig. 3. The number of the Hadoop default copies is 3 in the whole experiment, the number of the calculation nodes of the experiment 2 and the experiment 4 is 4, and the storage saving rate of the distributed system is 25%; when the number of the computing nodes is 5, the saving rate of the distributed system storage of the experiment 1 and the experiment 3 is 40%.

Analyzing the data distribution scheme and the task execution condition of the table 3, and combining the data transmission characteristics of Hadoop when requesting data from the HDFS: when the client requests data from the HDFS, the NameNode sends the data on the DataNode which meets the condition and is closest to the client. When the full comparison calculation is carried out, whether the data requested by each comparison task is local or other nodes can be analyzed. The data local rates of the nodes in the four experiments are shown in fig. 4. It can be seen from the figure that when Hadoop is used for full comparison calculation and the number of copies of a data file is 3, a distributed system cannot realize complete data localization.

DDBPSO model experiment

(1)5 computing nodes, 10 data files with identical size

First, experiment is performed on the experimental scheme 1 designed in section 4.2, the load balancing situation of the distributed system is shown in fig. 5, and the data distribution scheme and the task scheduling scheme of the full-comparative calculation in the distributed system are shown in table 6.

Table 6 scheme 1 experimental results

It can be seen from an intuitive observation of fig. 5 that the task scheduling scheme provided by the DDBPSO model enables load balancing among nodes in the distributed system. As can be seen from table 6, when the file sizes are identical, the number of comparison tasks can be divided by the number of nodes. Compared with a data distribution scheme of Hadoop, the DDBPSO model can enable all nodes in a distributed system to achieve 100% of data localization, and only one node with the highest data localization rate of Hadoop is 70% of the node. In the aspect of saving storage rate, the saving storage rate of the scheme 1 is calculated according to the formula 21, and the saving storage rate of the DDBPSO model is 42% which is slightly higher than that of Hadoop. The solution of the DDBPSO model was performed using Matlab with a computation time of only 29 seconds.

(2)4 computing nodes, 10 data files with identical size

The DDBPSO model is tested according to the experimental scheme 2 designed in section 4.2, the load balancing situation of the distributed system is shown in fig. 6, and the data distribution scheme and the task scheduling scheme of the full-comparative calculation in the distributed system are shown in table 7.

Table 7 scheme 2 experimental results

As can be seen from fig. 6, the task scheduling scheme provided by the DDBPSO model enables load balancing to be basically achieved among the nodes in the distributed system. As can be seen from table 7, when the file sizes are identical, the number of comparison tasks cannot be divided exactly by the number of nodes. Compared with a data distribution scheme of Hadoop, the DDBPSO model can enable all nodes in the distributed system to achieve 100% of data localization, the data localization rate of Hadoop is only the 1 st node to achieve 100% of data localization, and the data localization rates of the No. 2, the No. 3 and the No. 4 nodes are all less than 100%. In particular, node 3 has a data localization rate of only 20%, which means that most of the data needs to be obtained from other nodes when performing the comparison task on node 3. In the aspect of saving storage rate, the saving storage rate of the DDBPSO model is 32.5%, which is slightly higher than that of Hadoop by 7.25%. The solution of the DDBPSO model was performed using Matlab with a computation time of only 23 seconds.

(3)5 computing nodes, 10 data files with different sizes

Next, experiment scheme 3 designed in section 4.2 is subjected to an experiment using the DDBPSO model, the load balancing situation of the distributed system is shown in fig. 7, and the data distribution scheme and the task scheduling scheme of the full-comparative calculation in the distributed system are shown in table 8.

Table 8 scheme 3 experimental results

As can be seen from fig. 6, the task scheduling scheme given by the DDBPSO model enables basic load balancing among nodes in the distributed system. As can be seen from table 8, when the file sizes are not exactly the same, the number of comparison tasks can be divided by the number of nodes. The storage saving rate of the scheme 3 is calculated according to the formula 21, and the storage saving rate of the DDBPSO model is 31.7%, while the storage saving rate of the data distribution scheme given by Hadoop is 40%. Although the storage saving rate of the DDBPSO model is slightly lower than that of Hadoop, compared with a data distribution scheme of Hadoop, the DDBPSO model can enable all nodes in a distributed system to achieve 100% of data localization, and the data localization rate of Hadoop is 43% -70%. The solution of the DDBPSO model was performed using Matlab with a computation time of only 32 seconds.

(4)4 computing nodes, 10 data files with different sizes

Finally, the designed experimental scheme 4 is tested by using a DDBPSO model, the load balancing condition of the distributed system is shown in FIG. 8, and the data distribution scheme and the task scheduling scheme of the full-comparative calculation under the distributed system are shown in Table 9.

Table 9 scheme 4 experimental results

As can be seen from fig. 8, the task scheduling scheme provided by the DDBPSO model enables basic load balancing among nodes in the distributed system. As can be seen from table 9, when the file sizes are not completely the same and the number of comparison tasks cannot be divided by the number of nodes, the storage saving rate of the scheme 4 is calculated according to the formula 21, and the storage saving rate of the DDBPSO model can be obtained to be 28%, while the storage saving rate of the data distribution scheme given by Hadoop is only 25%. Compared with a data distribution scheme of Hadoop, the DDBPSO model can enable all nodes in the distributed system to achieve 100% of data localization, and most of the data localization rate of the nodes in the Hadoop cluster cannot achieve 100%. The solution of the DDBPSO model was performed using Matlab with a computation time of only 22 seconds.

The embodiment of the application researches the data distribution problem of full comparative calculation, provides a data distribution model DDBPSO and a related algorithm based on a particle swarm optimization algorithm, and sets four groups of experiments to perform experimental verification on the DDBPSO model and the related algorithm. The experiment contrasts and analyzes the indexes of the storage saving rate and the data localization rate adopted by the Hadoop data distribution strategy and the data distribution strategy of the DDBPSO model, and analyzes the load balance degree and the calculation time of the DDBPSO model. Experimental results show that the data distribution scheme given by the DDBPSO model can realize complete localization of data files required by tasks, and can reduce the use of storage space in a distributed system. In the aspect of load balancing, the task scheduling scheme provided by the DDBPSO model can basically realize the load balancing among all nodes in the distributed system. In the aspect of computing time, the DDBPSO model can quickly solve a data distribution scheme and a task scheduling scheme, and can well complete the full-comparison data distribution work under a distributed system. The DDBPSO model effectively solves the problem of data distribution of large-scale full-comparison calculation, and can generate better promotion effect on the research progress in the fields of bioinformatics, natural language processing and the like.

Example 2

According to an aspect of the embodiments of the present invention, there is provided a data processing apparatus, and fig. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention, as shown in fig. 9, including: the first obtaining module 92 is configured to obtain a data distribution algorithm for realizing load balancing based on particle swarm optimization in the distributed system; the first calculation module 94 is configured to calculate an optimal position of a load balancing particle in the distributed system according to a data distribution algorithm for realizing load balancing based on particle swarm optimization; the second obtaining module 96 is configured to obtain a data distribution algorithm for realizing optimized storage based on particle swarm optimization in the distributed system; and the second calculation module 98 is configured to optimize a storage space required by each node in the distributed system according to a data distribution algorithm for realizing optimized storage based on particle swarm optimization and a particle optimal position for load balancing.

Optionally, the first obtaining module 92 includes: the acquisition unit is used for acquiring initialization particle swarm parameters; and the optimization unit is used for optimizing the preset calculation formula according to the initialized particle swarm parameters to obtain a data distribution algorithm for realizing load balance based on particle swarm optimization.

Optionally, the first calculating module 94 includes: the first computing unit is used for performing iterative computation on a data distribution algorithm for realizing load balancing based on particle swarm optimization according to a preset updating rule to obtain the optimal position of the load balancing particles in the distributed system; wherein, predetermine the update rule and include: an inertial weight update rule, a particle flight velocity update rule, and a particle position update rule.

Optionally, the second calculating module 98 includes: the initialization unit is used for initializing according to the optimal particle position, the number of nodes, a file size list, a task list and node calculation amount of load balancing to obtain initialized particle swarm parameters; the updating unit is used for carrying out iterative optimization according to the initialized particle swarm parameters and updating the positions of the particles when a position updating rule is met; and the second calculation unit is used for recording the fitness corresponding to the optimal position obtained by each iteration and obtaining the optimization scheme of the storage space required by each node in the distributed system by encoding the iteration records.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A data processing method, comprising:

acquiring a data distribution algorithm of a distributed system for realizing load balance based on particle swarm optimization, namely acquiring an initialization particle swarm parameter, and optimizing a preset calculation formula according to the initialization particle swarm parameter, wherein the initialization particle swarm parameter comprises the following steps: obtaining the data distribution algorithm for realizing load balance based on particle swarm optimization by maximum iteration times, particle population scale, particle dimension, inertia weight, first acceleration coefficient, second acceleration coefficient and particle speed;

specifically, in a distributed cluster having a data set of m data files and n nodes, assuming a specific comparison algorithm as C (i, j), where i and j are two data files in the data set, the full comparison calculation can be formally described as:

M_i,j＝{C(i,j)|i<j,i＝1,2,…,m-1,j＝2,3,…,m} (1)

wherein M is_i,jIs the result of the comparison operation C (i, j), all M_i,jThe final result of the full comparison calculation is formed;

assume that there are m data files in a dataset, each data file having a size of s_iI 1,2, …, m, the formalization of the data files being the same or approximately the same size is described as follows:

when the comparison operation of two data files is carried out, the sum of the sizes of the two data files is in direct proportion to the calculation amount of the comparison operation, and c is enabled_ijRepresents the amount of computation of the comparison task C (i, j), then C_ijThe following relationships are associated with data file i and data file j:

s_i+s_j∝c_ij i<j,i＝1,2,…,m-1,j＝2,3,…,m (3)

all c can be found from equation (2)_ijWill be the same or approximately the same, assuming that K comparison tasks are assigned to node p, each task being calculated as

using x_kpTo indicate whether a comparison task k is assigned to node p, x_kpThe values of (a) are represented by 0 and 1;

x_kp＝0 or 1 k＝1,2,…,t_count,p＝1,2,…,n (6)

by c_kRepresenting the size of the kth task, the size p of the task amount on the node p in equation (4)_cCan be redefined as:

the calculation amount of the node p relative to the c can be obtained from the formula (7) and the formula (8)_avgThe degree of deviation fp of;

f_p＝|p_c-c_avg| (9)

where | is a function of absolute value, then the load balancing of the distributed system can be described as:

let w_pRepresents the total size of the file distributed onto node p;

suppose the number of files distributed to node p is m '(m' ≦ m), w_pThe size of the middle file i is r_iThen, there are:

let up (p 1,2, …, n) be the upper limit of storage space of node p, and the sum of sizes of files distributed to node p cannot exceed up;

w_p≤u_p (12)

taking a formula (10) as a solving target, and obtaining a load balance solving model of the full comparison calculation in the distributed system as follows;

the scheduling of the full comparison tasks in the distributed system according to the formula (13) enables load balancing among the computing nodes to be realized, and data files required by each comparison task can be guaranteed to have data locality;

acquiring a data distribution algorithm of the distributed system for realizing optimized storage based on particle swarm optimization;

optimizing the storage space required by each node in the distributed system according to the data distribution algorithm for realizing optimized storage based on particle swarm optimization and the optimal particle position for realizing load balancing, specifically calculating the optimal particle position for realizing load balancing in the distributed system according to the data distribution algorithm for realizing load balancing based on particle swarm optimization;

according to the data distribution algorithm for realizing optimized storage based on particle swarm optimization and the optimal particle position for load balancing, optimizing the storage space required by each node in the distributed system comprises the following steps:

initializing according to the optimal positions of the load-balanced particles, the number of nodes, a file size list, a task list and a node calculation amount to obtain initialized particle swarm parameters;

performing iterative optimization according to the initialized particle swarm parameters, and updating the positions of the particles when a position updating rule is met;

and recording the fitness corresponding to the optimal position obtained by each iteration, and coding the iteration records to obtain an optimization scheme of the storage space required by each node in the distributed system.

2. The method of claim 1, wherein the particle dimension is a product of a number of nodes in the distributed system and a number of comparison tasks.

3. The method of claim 1, wherein the first acceleration factor and the second acceleration factor are used to adjust a maximum step size for individual sweet spot and global sweet spot direction flights, respectively.

4. The method of claim 2, wherein said calculating the optimal location of load balancing particles in the distributed system according to the data distribution algorithm for load balancing based on particle swarm optimization comprises:

performing iterative computation on the data distribution algorithm for realizing load balancing based on particle swarm optimization according to a preset updating rule to obtain the optimal positions of the load-balanced particles in the distributed system; wherein the preset update rule comprises: an inertial weight update rule, a particle flight velocity update rule, and a particle position update rule.