CN110990367A

CN110990367A - Method for realizing GPS positioning cluster calculation performance optimization based on graph group clustering

Info

Publication number: CN110990367A
Application number: CN201911137142.1A
Authority: CN
Inventors: 陈亮; 邓翠珠; 戴传智
Original assignee: China Mobile Group Guangdong Co Ltd
Current assignee: China Mobile Group Guangdong Co Ltd
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2020-04-10

Abstract

The invention discloses a method for realizing GPS positioning cluster calculation performance optimization based on graph group clustering, which comprises the following steps: deployment and construction of a computing environment; through external Kafka pushing and internal FLUME receiving, a five-minute position snapshot table and a base station longitude and latitude table of a mobile operator are put into an HDFS distributed file system; reading the five-minute position snapshot table and the base station longitude and latitude table, and correlating the two tables to obtain user position information data containing longitude and latitude information; removing the weight of user position data in a day of a set city from the user position information data to obtain a user longitude and latitude position statistical table; clustering the user longitude and latitude position information in the user longitude and latitude position statistical table by using a graph group clustering method; the Spark program is submitted to a yarn cluster to run, and the obtained analysis result is stored in the HDFS distributed file system. The invention reduces the calculation amount and the calculation energy consumption and improves the operation performance.

Description

Method for realizing GPS positioning cluster calculation performance optimization based on graph group clustering

Technical Field

The invention relates to the field of big data processing, in particular to a method for realizing calculation performance optimization of a GPS positioning cluster based on graph group clustering.

Background

With the continuous development of mobile intelligent terminals, various services related to the position are provided for users, and the mobile intelligent terminals will become a mainstream trend in the future mobile terminal user service. The GPS positioning in the mobile signaling data is feasible based on the graph community clustering algorithm, and meanwhile, the operation performance is optimized on a GPU cluster.

Although the prior art can apply the clustering method to computational analysis, in practical application, we find that the prior art scheme still has some inconveniences and disadvantages. The prior art has the following disadvantages:

if the application number is: 201410360455.4 the program code used by the method of the invention is realized by adopting a superset CUDA of C language, and the distributed computing technology of the K-means clustering method is not realized, so the method is easily limited by the video memory of a single-machine GPU, and the operation is likely to be unsuccessful in the clustering computation of the high-dimensional matrix. The application numbers are: 201811589386.9 this invention utilizes a distributed framework to process high dimensional big data, but does not use the combination of Hadoop and GPU for acceleration.

The existing clustering method used for GPS positioning based on mobile signaling data needs to traverse all clusters, and has high energy consumption and poor operation performance.

Disclosure of Invention

The invention provides a method for realizing the calculation performance optimization of a GPS positioning cluster based on graph group clustering, aiming at overcoming the defects that the clustering method used for GPS positioning based on mobile signaling data in the prior art needs to traverse all clusters, and has high energy consumption and poor operation performance.

The primary objective of the present invention is to solve the above technical problems, and the technical solution of the present invention is as follows:

a method for realizing GPS positioning cluster calculation performance optimization based on graph group clustering comprises the following steps:

s1: respectively building a GPU environment, a Spark cluster and a Hadoop cluster, and building a GPU calculation analysis frame on a plurality of nodes provided with the GPU environment;

s2: through external Kafka pushing and internal FLUME receiving, a five-minute position snapshot table and a base station longitude and latitude table of a mobile operator are put into an HDFS distributed file system;

s3: reading a five-minute position snapshot table and a base station longitude and latitude table in the HDFS distributed file system, and correlating the two tables to obtain user position information data containing longitude and latitude information;

s4: removing the weight of user position data in a day of a set city from the user position information data to obtain a user longitude and latitude position statistical table;

s5: carrying out mapPartitions operator operation on the longitude and latitude position information of the users in the longitude and latitude position statistical table of the users, and then clustering the longitude and latitude position information pairs of the users after the mapPartitions operator processing by using a graph group clustering method;

s6: and submitting the Spark program to a horn cluster for operation, and storing the obtained analysis result into an HDFS distributed file system, wherein the analysis result is the attribution of a base station of a user.

Further, the fields in the five-minute position snapshot table of the mobile operator in step S2 include the number of the user terminal, the time of occurrence, and the base station cgi, and the fields in the longitude and latitude table of the base station are the base station cgi, the longitude, and the latitude.

Further, the associating of the two tables in step S3 refers to performing an inter-connection associating operation on the field ue number, the time of occurrence, the base station cgi in the five-minute location snapshot table and the field base station cgi, longitude and latitude in the base station longitude and latitude table.

Further, in step S4, the user location data in one day of the set city is deduplicated from the user location information data to obtain a user longitude and latitude location statistical table, which specifically includes:

s401: screening out the longitude and latitude position information of the user with the set date of the city according to the field-longitude, latitude and appearance time of the full latitude and longitude position information data of the user;

s402: and performing duplication elimination operation on the longitude and latitude position information of the user screened in the step S401, and screening out the first piece of position information of each user by taking the user terminal number as a unique identifier.

Further, performing mapPartitions operator operation on the longitude and latitude position information of the users in the longitude and latitude position statistical table of the users, and then clustering the longitude and latitude position information pairs of the users after the mapPartitions operator processing by using a graph group clustering method; the method comprises the following specific steps:

s501: randomly dividing the user longitude and latitude position information in the user longitude and latitude position statistical table into a plurality of partitions, carrying out map function operation on each Partition, and extracting data required by clustering from the data subjected to the map function operation, wherein the data comprises user longitude and user latitude; the data type of the user longitude and latitude position information in the user longitude and latitude position statistical table is RDD data;

s502: according to the obtained longitude and user latitude information of the user, initializing N longitude and latitude position information as N vertexes, wherein each vertex independently forms a cluster, and calculating the modularity M of the cluster network, wherein the calculation formula is as follows:

where L represents the number of edges included in the graph, N represents the number of vertices, k_iDenotes the degree of the vertex i, A_ijIs the value in the adjacency matrix, c_iRepresents the clustering of the vertexes i, delta is the kronecker function, if the vertexes i, j belong to the same clustering, delta (c)_i,c_j) Returning to 1, if i, j do not belong to the same cluster, δ (c)_i,c_j) Returning to 0;

s503: randomly selecting two clusters for fusion, and calculating the modularity change delta M caused by fusion;

s504: selecting two clusters with the maximum growth in the delta M for fusion, calculating new modularity of the fused clusters, and recording;

s505: repeating the steps S503-S504, fusing a pair of clusters each time, calculating delta M, recording a new clustering mode and corresponding modularity of the new clustering mode, and stopping until all the vertexes are grouped into a cluster;

s506: and detecting all records of the clustering process, inquiring the corresponding clustering mode when the modularity value is maximum, and taking the clustering mode as a final clustering structure.

Further, the step S505 specifically includes:

s5051: converting all RDD data in the Partition into Numpy type data, specifically comprising each pair of user longitude and user latitude, a modularity change value delta M and a new modularity M, and taking the converted RDD data as data input, wherein the data output length is the same as the length of each pair of user longitude and user latitude data, and the data output type is a 3-dimensional Numpy type, wherein the 1 st dimension represents a clustering group identifier, and the 2 nd dimension and the 3 rd dimension represent re-clustered data;

s5052: copying input data to device from host, wherein host is CPU and its memory, and device is GPU and its memory;

s5053: setting grid and blocks for the GPU-kernel function, wherein the grid is all threads started by one GPU-kernel function, the grid comprises a plurality of blocks, and each block comprises a plurality of threads;

s5054: and dividing each pair of vertexes into the same class to serve as algorithm logic to write the GPU-kernel, and operating the GPU-kernel in a GPU and a memory thereof.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention overcomes the defect that the traditional clustering method needs to violently traverse all clusters through the graph group clustering method, reduces the calculated amount and the energy consumption, and realizes cluster distributed calculation by using the GPU based on a Hadoop/Spark framework, thereby improving the operation performance.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a flow chart of the community clustering algorithm.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Example 1

Fig. 1 shows a flowchart of a method for optimizing the computation performance of a GPS positioning cluster based on graph community clustering.

in a specific embodiment, environment deployment is carried out on 3 servers provided with GTX GeForce1080Ti, including building of a GPU environment, a Spark cluster and a Hadoop cluster, and a GPU calculation analysis framework is built on multiple nodes provided with the GPU environment. The construction of the GPU environment comprises the installation of the NVIDIA driver and cuda and corresponding environment configuration.

it should be noted that the fields of the five-minute position snapshot table include the number of the user terminal, the time of occurrence, and the cgi of the base station; the main fields in the base station longitude and latitude table are the base station cgi, longitude and latitude.

in a specific embodiment, a five-minute position snapshot table and a base station theodolite table in the HDFS distributed file system are read, and for a field in the five-minute position snapshot table: user terminal number, time of occurrence, base station cgi and fields in base station longitude and latitude table: base station cgi, longitude and latitude carry out the correlation operation of internal connection, finally obtain the user position information data containing longitude and latitude information;

in a specific embodiment, the specific steps are as follows:

s402: and performing duplication removal operation on the longitude and latitude position information of the user screened in the step S401, and screening out the first piece of position information (the base station cgi, the longitude and the latitude) of each user by taking the user terminal number as a unique identifier.

in a specific embodiment, firstly, a mapPartitions operator is used to randomly divide RDD data to be processed into a plurality of partitions, and then map function operation is performed on each Partiton, which is beneficial to improving the efficiency of the algorithm. The concrete process of clustering the longitude and latitude information of the users in the longitude and latitude position statistical table by using the graph group clustering method comprises six steps: as shown in fig. 2.

in step S505, mapPartitions operator calculation and GPU acceleration are performed, and the specific steps include:

it should be noted that it is a standard for measuring the quality of the graph community division, and the larger the value, the better the division.

S5052: copying input data from a CPU to a GPU;

in a specific embodiment, block is set to 256,

S6: submitting a Spark program to run on a yann cluster, and storing an obtained analysis result into an HDFS distributed file system, wherein the analysis result is the attribution of a base station of a user.

In the embodiment, 1000000 longitude and latitude position data of users are used for carrying out clustering test, a Spark program is submitted to a corner cluster to run, and the running time when the GPU is used and the running time when the GPU is not used are respectively counted. As a result, it was found that: the time consumed by the position clustering algorithm when the GPU is used is 3.6s, the time consumed by the position clustering algorithm when the GPU is not used is 27.4s, and the GPU clustering operation technology brings about more than 6 times of acceleration for the graph group clustering algorithm.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for realizing GPS positioning cluster calculation performance optimization based on graph group clustering is characterized by comprising the following steps:

2. The method of claim 1, wherein the GPS positioning cluster computing performance optimization is realized based on graph community clustering,

the fields in the five-minute position snapshot table of the mobile operator in step S2 include the user terminal number, the time of occurrence, and the base station cgi, and the fields in the base station longitude and latitude table are the base station cgi, longitude, and latitude.

3. The method as claimed in claim 2, wherein the step S3 of associating the two tables means performing an inter-connection association operation on the field ue number, the time of occurrence, the cgi of the bs and the field bs cgi, longitude and latitude of the bs in the longitude and latitude table.

4. The method according to claim 1, wherein in step S4, the user location data in one day of a set city is deduplicated from the user location information data to obtain a user longitude and latitude location statistical table, and the specific steps are as follows:

5. The method according to claim 1, wherein in step S5, mappartions operator operation is performed on the longitude and latitude position information of the user in the longitude and latitude position statistical table of the user, and then clustering is performed on the longitude and latitude position information pair of the user after being processed by the mappartions operator by using a graph community clustering method; the method comprises the following specific steps:

s502: according to the longitude and user latitude information of the acquired user, initializing N longitude and latitude position information as N vertexes, wherein each vertex independently forms a cluster, and calculating the modularity M of the cluster network, wherein the calculation formula is as follows:

6. The method for optimizing the calculation performance of the GPS positioning cluster based on the graph community clustering according to claim 5, wherein the specific process of step S505 is as follows: