CN110245022B

CN110245022B - Parallel Skyline processing method and system under mass data

Info

Publication number: CN110245022B
Application number: CN201910543347.3A
Authority: CN
Inventors: 鲁芹; 梁心美; 李名玉
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2021-11-12
Anticipated expiration: 2039-06-21
Also published as: CN110245022A

Abstract

The disclosure provides a parallel Skyline processing method and system under mass data, comprising: distributing the web data to worker nodes: uploading web data to an HDFS (Hadoop file system), segmenting the data through the HDFS, and distributing segmented data blocks to worker nodes for parallel calculation; carrying out Skyline calculation on the worker node: obtaining local candidate Skyline services through a local Skyline calculation stage, then transmitting each local Skyline candidate service to a master main node through a network, and finally obtaining global Skyline services through the Skyline calculation of the master node; and (3) calculating Skyline of a master node: and summarizing the candidate Skyline services of all worker nodes, and dividing all data into 4 regions by an improved Skyline algorithm to obtain the global Skyline service. The traditional Skyline algorithm is improved, the service set is subjected to region division, a large number of data points without domination relation are reduced, the occupation of a memory is saved, and the improved Skyline algorithm is parallelized through a Spark platform.

Description

Parallel Skyline processing method and system under mass data

Technical Field

The disclosure relates to the technical field of computers, in particular to a parallel Skyline processing method and system under mass data.

Background

With the rapid development of service computing, more and more services with the same functional attributes and different non-functional attributes are provided, and the traditional web service selection method faces a great challenge when processing mass data, so how to quickly and effectively find out web services capable of meeting different user requirements from the mass web service data becomes a problem to be solved urgently.

In addition, with the rapid development of internet technology, a large amount of internet data is generated, and how to mine valuable data from the large amount of internet data becomes a problem which needs to be solved urgently. Hadoop and Spark big data platforms are born under the background, Hadoop is mainly based on disk operation, and intermediate results are stored in a disk. Spark is mainly based on memory calculation, a data model of RDD is introduced, Hadoop is far faster than the iteration speed of the algorithm, and Spark quickly becomes the research focus of scholars after birth.

Disclosure of Invention

The purpose of the embodiments of this specification is to provide a parallel Skyline processing method under mass data, where the obtained global Skyline service is not dominated by any other data, and the distribution of the entire data set can be well reflected, so that web services meeting different requirements can be selected from these global Skyline services.

The embodiment of the specification provides a parallel Skyline processing method under mass data, which is realized by the following technical scheme:

the method comprises the following steps:

distributing the web data to worker nodes: uploading web data to an HDFS (Hadoop file system), segmenting the data through the HDFS, and distributing segmented data blocks to worker nodes for parallel calculation;

carrying out Skyline calculation on the worker node: obtaining local candidate Skyline services through a local Skyline calculation stage, then transmitting each local Skyline candidate service to a master main node through a network, and finally obtaining global Skyline services through the Skyline calculation of the master node;

and (3) calculating Skyline of a master node: summarizing candidate Skyline services of all worker nodes, dividing all data into 4 areas through an improved Skyline algorithm, merging data points of the area 1 and the area 3, writing a Bitmap algorithm logic through a Spark operator, and calculating final Skyline points of the area 1 and the area 3, so that the global Skyline service is obtained.

In a further technical scheme, a QoS vector set is obtained through analysis, then keys corresponding to web services are generated according to a certain distribution strategy, the whole web service data are divided into different groups, and the web data of the groups with the same key value are distributed to the same node to perform Skyline point calculation.

The further technical scheme is that the local Skyline computing part processes the distributed web service data, a point with the minimum QoS attribute in the local Skyline service data is found out through a Spark operator, the point is the Skyline point, and then only once area division is carried out at the minimum point.

The further technical scheme includes that the data set is divided into 4 areas, data of an area 1 and an area 3 dominate an area 2 and an area 4, a final calculation area is combined, data points of the area 1 and the area 3 are combined, a Bitmap algorithm is written through a Spark operator, and final Skyline point points of the area 1 and the area 3 are calculated.

According to the further technical scheme, the global Skyline service is output to the user for selection.

The implementation mode of the description provides a parallel Skyline processing system under mass data, and the parallel Skyline processing system is realized by the following technical scheme:

the method comprises the following steps:

the web data to worker node distribution module comprises: uploading web data to an HDFS (Hadoop file system), segmenting the data through the HDFS, and distributing segmented data blocks to worker nodes for parallel calculation;

a Skyline calculating module is carried out on the worker node: obtaining local candidate Skyline services through a local Skyline calculation stage, then transmitting each local Skyline candidate service to a master main node through a network, and finally obtaining global Skyline services through the Skyline calculation of the master node;

the master node Skyline calculation module: summarizing candidate Skyline services of all worker nodes, dividing all data into 4 areas through an improved Skyline algorithm, merging data points of the area 1 and the area 3, writing a Bitmap algorithm logic through a Spark operator, and calculating final Skyline points of the area 1 and the area 3, so that the global Skyline service is obtained.

Compared with the prior art, the beneficial effect of this disclosure is:

the method improves the traditional Skyline algorithm, divides the service set into regions, greatly reduces data points without domination relation, saves the occupation of memory, realizes the parallelization of the improved Skyline algorithm through a Spark platform, and proves that the parallelized Skyline algorithm can better process mass web service data through experiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a schematic diagram illustrating data region partitioning according to an embodiment of the present disclosure;

FIG. 2 is a diagram of a Hadoop web management page, according to an example embodiment of the present disclosure;

FIG. 3 is a diagram of a web management page for Spark, according to an example of the present disclosure;

FIG. 4 is a process diagram of a master node of an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an operation condition of a Spark UI monitoring cluster according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of Spark Jobs in accordance with an exemplary embodiment of the disclosure;

FIG. 7 is a diagram showing experimental results of the third and ninth columns of QWS data selected according to an example of the present disclosure;

FIG. 8 is a diagram showing experimental results of selecting the second and third columns of QWS data in an example of the present disclosure;

FIG. 9 is a graph of the parallelization operation result of the 500 ten thousand data amount algorithm according to the embodiment of the disclosure;

fig. 10 is a graph of a parallelization operation result of the 1000 ten thousand data volume algorithm according to the embodiment of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example of implementation 1

The embodiment discloses a parallel Skyline processing method under mass data, and aims at solving the problems that when the data volume of the processed web service reaches hundreds of thousands, the calculation time of an improved Skyline algorithm becomes very slow, when the data volume reaches more than ten thousands, a computer is blocked or even can not calculate a final result, calculation needs to be carried out by means of a big data calculation framework, and the Skyline algorithm is subjected to parallel design.

The improved Skyline algorithm in the embodiment of the disclosure means that only once region division is carried out at the minimum point, and a dominated region is found.

The whole parallelization calculation process of the embodiment of the present disclosure is divided into three stages: distributing the web data to the sub-nodes, the sub-node Skyline calculation and the global Skyline calculation.

In a specific implementation example, distributing web data to worker nodes: the stage is the first stage of parallel computing, namely web data are uploaded to an HDFS (Hadoop distributed file system), the data are segmented through the HDFS, and segmented data blocks are distributed to worker nodes to perform parallel computing. The QoS vector set is obtained by analyzing the Web service data, then the keys corresponding to the Web services are generated according to a certain distribution strategy, the whole Web service data can be divided into different groups, and the Web data of the groups with the same key value are distributed to the same node for Skyline point calculation.

In a specific implementation example, the worker node performs Skyline calculation: the stage is a second stage of parallelization calculation, the local Skyline calculation part processes the distributed web service data, a point with the minimum QoS attribute in the local Skyline service data is found out through a Spark operator, the point is certainly Skyline point, then only once area division is carried out at the minimum point, a data set is divided into 4 areas as shown in figure 1, the point with the minimum QoS attribute in the local Skyline service data is found out according to the local Skyline calculation, area division is carried out at the minimum point, and the relationship with the dominated relationship is supported.

The data of the area 1 and the area 3 dominate the area 2 and the area 4, so that the dominance check between the areas without dominance relation is effectively filtered and reduced, the final calculation area is merged, the data points of the area 1 and the area 3 are merged, the Bitmap algorithm is logically written through a Spark operator, and the final Skyline point of the area 1 and the area 3 is calculated.

Local candidate Skyline services can be obtained through a local Skyline calculation stage, then each local Skyline candidate service is transmitted to a master main node through a network, and global Skyline services are finally obtained through the Skyline calculation of the master node.

In a specific implementation example, the master node Skyline calculates: the stage is a third stage of parallelization calculation, the stage summarizes candidate Skyline services of all other worker nodes, all data are divided into 4 areas as shown in FIG. 1 through an improved Skyline algorithm, data points of an area 1 and an area 3 are merged, a Bitmap algorithm is written in logically through a Spark operator, and the final Skyline point of the area 1 and the area 3 is calculated. This results in a global Skyline service. The global Skyline service may be output to the user for selection. The global Skyline service is not dominated by any other data and can well reflect the distribution situation of the whole data set, so that the web service meeting different requirements can be selected from the global Skyline services.

According to the method, when the data volume is small, the calculation time of the algorithm is longer than that under a single machine because the startup comparison of the Spark cluster consumes time, and when the data volume is very large, the result can still be calculated by the parallelization Skyline algorithm. And respectively calculating the time used by the parallelization algorithm under different data set scales.

Table 1: algorithmic temporal comparison

The experiment is divided into two parts in total, the data set of the experiment in the first part is QWS Dataset, and the QWS Dataset comprises 2507 real and effective Web services with a plurality of functional domains. The QWS Dataset has 11 QoS attributes in total, and the first 8 QoS attributes in the QWS Dataset are selected in the experiment. Experiments mainly verify the experimental results under a single machine and the experimental results under a distributed mode.

The data set of the second part of experiment is based on a large number of simulation data sets, when the data volume reaches more than ten million, a single machine is operated slowly or cannot be operated, and at this time, the operation effect of the algorithm under the distributed mode is verified.

Experimental analysis: firstly, a Hadoop cluster consisting of three nodes is built, then a Spark cluster is built on the Hadoop cluster, Hadoop is started, and an interface shown in figure 2 can be seen on a web management page. Spark is initiated and can be seen on the web administration page as shown in figure 3. At the master node, we will see the following process, as shown in FIG. 4.

Meanwhile, fig. 5 shows that the improved Skyline algorithm is submitted to a Spark cluster for execution, and the running condition of the cluster is monitored through a Spark cluster UI in the execution process.

Job is shown in FIG. 6: first, a first part of the experiment was performed, and two QoS attributes, the third column (response time) and the ninth column (delay) of the QWS dataset, were selected for the experiment. The experimental result is shown in fig. 7, and the obtained Skyline point of the improved Skyline algorithm is the same in both single machine and distributed mode, and is marked by a circle in the figure, that is: (0.4000,6.0000),(0.7000,2.0000),(0.3000, 64.0000),(1.0000).

Two QoS attributes, the second column (availability) and the third column (reliability) of the QWS dataset, were chosen for the experiments below. Since the two columns of data are benefit type data, the larger the data value, the better. The experimental result is shown in fig. 8, and the obtained Skyline point of the improved Skyline algorithm is the same in both single machine and distributed mode, and is marked by a circle in the figure, that is: (67.0000,0.5000),(48.0000,0.8000),(13.0000,0.9000),(83.0000,0.3000),(12.0000,1.7000).

Experiment one verifies the correctness of the parallelized Skyline algorithm.

Next, a second part of experiments is performed, wherein experimental data are based on a large number of simulation data sets, the data volumes are 500 ten thousand and 1000 ten thousand respectively, and thus, a computer is locked in a single machine state, and a final result cannot be calculated. Under distributed parallelization, the computer can easily calculate the final result, fig. 9 shows the result when the data amount is 500 ten thousand, and fig. 10 shows the result when the data amount is 1000 ten thousand.

The method has the advantages that the requirement on computer hardware for processing massive data is high, system resources are occupied, algorithm efficiency is reduced, even a computer is jammed, and a final result cannot be calculated.

Example II

the method comprises the following steps:

the master node Skyline calculation module: summarizing candidate Skyline services of all worker nodes, dividing all data into 4 areas through an improved Skyline algorithm, merging data points of the area 1 and the area 3, writing a Bitmap algorithm logic through a Spark operator, and calculating the final Skyline point of the area 1 and the area 3, so that the global Skyline service is obtained

The implementation process of the specific module in this embodiment refers to the method for processing the mass data lower parallel Skyline in embodiment one, and a detailed description is not given here.

Example III

The embodiment in the specification provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the steps of the method for parallel Skyline processing under mass data in the first embodiment.

Example four

The embodiments in this specification provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of implementing the method for parallel Skyline processing under mass data in the first example.

It is to be understood that throughout the description of the present specification, reference to the term "one embodiment", "another embodiment", "other embodiments", or "first through nth embodiments", etc., is intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, or materials described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. The parallel Skyline processing method under mass data is characterized by comprising the following steps:

carrying out Skyline calculation on the worker node: obtaining local candidate Skyline services through a local Skyline calculation stage, then transmitting each local Skyline candidate service to a master main node through a network, and finally obtaining global Skyline services through the Skyline calculation of the master node; the local Skyline computing part processes the distributed web service data, finds out a point with the minimum QoS attribute in the local Skyline service data through a Spark operator, wherein the point is Skyline point, and then only carries out once area division at the minimum point; dividing the data set into 4 areas, wherein the data of the area 1 and the area 3 dominate the area 2 and the area 4, merging the final calculation area, merging the data points of the area 1 and the area 3, writing the Bitmap algorithm logic through a Spark operator, and calculating the final Skyline point of the area 1 and the area 3;

2. The mass data parallel Skyline processing method of claim 1, wherein QoS vector set is obtained through analysis, then keys corresponding to web services are generated according to a certain distribution strategy, the whole web service data is divided into different groups, and the web data of the groups with the same key value are distributed to the same node for Skyline point calculation.

3. The mass data parallel Skyline processing method according to claim 1, wherein the global Skyline service is output to the user for selection.

4. A parallel Skyline processing system under mass data is characterized by comprising:

a Skyline calculating module is carried out on the worker node: obtaining local candidate Skyline services through a local Skyline calculation stage, then transmitting each local Skyline candidate service to a master main node through a network, and finally obtaining global Skyline services through the Skyline calculation of the master node; the local Skyline computing part processes the distributed web service data, finds out a point with the minimum QoS attribute in the local Skyline service data through a Spark operator, wherein the point is Skyline point, and then only carries out once area division at the minimum point; dividing the data set into 4 areas, wherein the data of the area 1 and the area 3 dominate the area 2 and the area 4, merging the final calculation area, merging the data points of the area 1 and the area 3, writing the Bitmap algorithm logic through a Spark operator, and calculating the final Skyline point of the area 1 and the area 3;

5. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for parallel Skyline processing of mass data according to any one of claims 1-3 when executing the program.

6. A computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the steps of the method for parallel Skyline processing under mass data according to any of claims 1-3.