CN107729418B - SPARK and DBSCAN-based distributed visitor identification type method - Google Patents

SPARK and DBSCAN-based distributed visitor identification type method Download PDF

Info

Publication number
CN107729418B
CN107729418B CN201710891930.4A CN201710891930A CN107729418B CN 107729418 B CN107729418 B CN 107729418B CN 201710891930 A CN201710891930 A CN 201710891930A CN 107729418 B CN107729418 B CN 107729418B
Authority
CN
China
Prior art keywords
mobile phone
key value
phone number
dbscan
signaling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710891930.4A
Other languages
Chinese (zh)
Other versions
CN107729418A (en
Inventor
肖定和
于建港
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hainan Zhongzhixin Information Technology Co ltd
Original Assignee
Hainan Zhongzhixin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hainan Zhongzhixin Information Technology Co ltd filed Critical Hainan Zhongzhixin Information Technology Co ltd
Priority to CN201710891930.4A priority Critical patent/CN107729418B/en
Publication of CN107729418A publication Critical patent/CN107729418A/en
Application granted granted Critical
Publication of CN107729418B publication Critical patent/CN107729418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/535Tracking the activity of the user

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a SPARK and DBSCAN-based distributed visitor identification type method, which comprises the following steps: preliminarily screening the mobile phone signaling data to obtain key value pairs of different base stations visited by a user every day; hashing the signaling of the same mobile phone number in a period of time into the same Partition; mapping the signaling of the same mobile phone number in the same Partition into a key value pair of the mobile phone number and a DBSCANPOINTS array; realizing a single version DBSCAN algorithm; clustering DBSCANOINTS data with the DBSCAN algorithm as key value pairs; obtaining a key value pair of the mobile phone number and the category number; the invention can accurately count the number of the visitors in the foreign province through the mobile phone signaling, thereby improving the reliability of the tourism statistical data.

Description

SPARK and DBSCAN-based distributed visitor identification type method
Technical Field
The invention relates to an identification technology in the field of information processing, in particular to a tourist identification type method.
Background
The prior statistics of the number of the staff and the tourists in the foreign province often adopt a questionnaire survey form, the questionnaire survey has the defects that survey objects and survey ranges are difficult to select and control, answer quality is unstable and the like, in recent years, the number of the tourists in the foreign province is counted in a mode of combining mobile phone signaling of an operator with a big data technology, although mobile phone numbers all have attribution places, the situations that local residents use mobile phone numbers in the foreign province and the foreign provinces use local mobile phone numbers exist, and the foreign province tourists cannot be distinguished only according to the attribution places of the mobile phone numbers.
The SPARK is a large data distributed computing framework based on a memory, and as the mobile phone signaling data volume of an operator is large, the data of one day in Hainan province has about 1T, and the SPARK framework is adopted for processing the data; DBSCAN is a relatively representative density-based clustering algorithm, official distributed implementation of a SPARK-based DBSCAN algorithm is not available at present, the implementation of a third-party algorithm with open sources on the network is partitioned clustering and repolymerization of single-group data, when the multiple-group data are polymerized in this way, processing can be performed only in sequence one by one, the efficiency is low, and the processing flow is complex.
Disclosure of Invention
Therefore, the invention aims to provide a SPARK and DBSCAN-based distributed tourist identification type method, which can identify types of outsourced personnel by means of mobile phone signaling data and count the number of outsourced tourists.
In the present invention, abbreviations and keywords are used to define the following:
SPARK: distributed computing framework based on memory
DBSCAN (sensitivity-Based Spatial Clustering of Applications with Noise): density-based clustering algorithm
Rdd (resource Distributed databases): elastic distributed data set
Partition: minimum unit of RDD elastic distributed data set
A distributed visitor identification type method based on SPARK and DBSCAN comprises the following steps:
preliminarily screening the mobile phone signaling data to obtain key value pairs of different base stations visited by a user every day;
hashing the signaling of the same mobile phone number in a period of time into the same Partition;
mapping the signaling of the same mobile phone number in the same Partition into a key value pair of the mobile phone number and a DBSCANPOINTS array;
realizing a single version DBSCAN algorithm;
clustering DBSCANOINTS data with the DBSCAN algorithm as key value pairs;
obtaining a key value pair of the mobile phone number and the category number;
and clustering according to the category number to obtain the number of the personnel and the tourists in the province.
Compared with the prior art, the invention has the beneficial effects that:
the invention is based on the SPARK distributed computing framework, can distribute computing tasks to a plurality of computers to be carried out, and achieves the purpose of processing high-capacity data; the DBSCAN clustering algorithm is adopted to cluster multiple groups of data simultaneously, and the types of the personnel in the province can be accurately identified through the DBSCAN algorithm based on the SPARK framework, so that the number of tourists in the province can be counted.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only preferred embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flowchart of a distributed visitor identification type method based on SPARK and DBSCAN according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, the illustrated embodiments are provided to illustrate the invention and not to limit the scope of the invention.
The invention provides a distributed visitor identification type method based on SPARK and DBSCAN, which is shown in figure 1 and comprises the steps of 1, preliminarily screening mobile phone signaling data to obtain key value pairs of different base stations visited by a user every day; step 2, hashing the signaling of the same mobile phone number in a period of time into the same Partition; step 3, mapping the signaling of the same mobile phone number in the same Partition into a key value pair of the mobile phone number and a DBSCANPOINTS array; step 4, realizing a single version DBSCAN algorithm; step 5, clustering DBSCANOINTS data with the DBSCAN algorithm as key value pairs; step 6, obtaining key value pairs of the mobile phone number and the category number; and 7, clustering according to the category number to obtain the number of the personnel and the tourists in the province.
Specifically, the preliminary screening of the mobile phone signaling data in the step 1 includes:
firstly, screening base stations with the residence time of each user being less than 1 hour every day by using an SPARK distributed computing framework, and then screening base stations which repeatedly appear in the screened base stations to obtain key value pairs for each user to access different base stations every day, wherein the key value pairs are in the forms of (mobile phone numbers and base station IDs).
Specifically, the step 2 hashes the signaling of the same mobile phone number to the same Partition within a period of time, and includes:
and hashing the signaling of the same mobile phone number to the same Partition in the RDD by using an operator group bykey of the SPARK framework, wherein the signaling of the same mobile phone number only exists in one Partition.
Specifically, step 3 maps the signaling of the same mobile phone number in the same Partition to the key value pair of the mobile phone number and the dbsca input array, and through this step, the array of the points for which each mobile phone number needs to be clustered by using the dbsca can be obtained.
Specifically, step 4 implements a single-machine version DBSCAN algorithm, the single-machine version DBSCAN algorithm is written in SCALA language, two parameters of an interface class localdscan of the DBSCAN algorithm are set as Double-type eps and Int-type minpoits, and the interface is defined as follows:
Figure BDA0001421378300000031
specifically, the step 5 of clustering the key value pair of the mobile phone number and dbsca pointes obtained in the step 3 by using the dbsca algorithm includes:
clustering is carried out on the DBSCANPOINTS array obtained in the step 3 by using the DBSCAN algorithm in the step 4 through an operator map of a SPARK frame, the parameter eps set value of the interface LocalDBSCAN in the step 4 is 800, the minPoints value is 11, and finally a clustering result corresponding to each mobile phone number is obtained, in the clustering result, 0 represents all noise points, the number N represents all points to be clustered into N types, the number N is a natural number, for example, 1 represents all points to be clustered into 1 type, 2 represents all points to be clustered into 2 types, and the like.
Specifically, the step 6 obtains key value pairs of the mobile phone numbers and the category numbers, which represent connected domains of each mobile phone number in the province according to the distance of 800 meters, and the category numbers into which the domain points residing for at least 11 days or more in one month can be divided. And the activity areas of the staff are generally residences, workplaces and entertainment places, the staff who use the mobile phone number of the province are the staff with the class number of 2 or 3, and the other staff are tourists.
Specifically, the step 7 of clustering according to the category number to obtain the number of the out-of-province staff and the number of the tourists includes:
and (3) mapping the key value pair of the mobile phone number and the category number obtained in the step (6) into a key value pair in a form of (category number, 1), and counting the number of people in each category number by using an operator reduce of the SPARK frame, so that the number of the visitors in the provinces can be further counted.
Through the description of the embodiments, compared with the conventional investigation mode, the embodiments of the invention have the advantages of improving the accuracy, the investigation efficiency, the investigation range and the like. The invention realizes the clustering of multiple groups of data by combining the DBSCAN algorithm with the SPARK distributed computing framework, and applies the algorithm to identify the outsourced workers and the outsourced visitors through mobile phone signaling for the first time, thereby having the leading property in the industry.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A SPARK and DBSCAN-based distributed visitor identification type method is characterized by comprising the following steps:
step 1: preliminarily screening the mobile phone signaling data to obtain key value pairs of different base stations visited by a user every day;
step 2: hashing the signaling of the same mobile phone number in a period of time into the same Partition;
and step 3: mapping the signaling of the same mobile phone number in the same Partition into a key value pair of the mobile phone number and a DBSCANPOINTS array;
and 4, step 4: realizing a single version DBSCAN algorithm;
and 5: clustering DBSCANOINTS data with the DBSCAN algorithm as key value pairs;
step 6: obtaining a key value pair of the mobile phone number and the category number;
and 7: and clustering according to the category number to obtain the number of the personnel and the tourists in the province.
2. The method of claim 1, wherein step 1 comprises:
firstly, screening base stations with residence time of less than 1 hour per day of each user by using an SPARK distributed computing framework;
removing base stations which repeatedly appear in base stations visited by each user every day;
and obtaining key value pairs of each user accessing different base stations every day, wherein the key value pairs are in the form of (mobile phone numbers and base station IDs).
3. The method of claim 1, wherein the step 2 comprises:
and hashing the same mobile phone number to the same Partition in the elastic distributed data set by using an operator group bykey of the SPARK distributed computing framework, wherein the signaling of the same mobile phone number only exists in one Partition.
4. The method of claim 1, wherein step 3 comprises:
an array of points that need to be clustered using DBSCAN per phone number is obtained.
5. The method according to claim 1, wherein said step 4 is implemented by using a SCALA language, and said DBSCAN algorithm interface class localDBSCAN has two parameters, which are Double type eps and Int type minPoints.
6. The method of claim 1, wherein the step 5 comprises:
clustering DBSCANOINTS data by using the DBSCAN algorithm through an operator map of a SPARK distributed computing framework to obtain a clustering result corresponding to each mobile phone number, namely a key value pair in a (mobile phone number, category number) form, wherein in the category number, 0 represents a noise point, and the number N represents that all points are clustered into N types, and N is a natural number.
7. The method of claim 1, wherein the step 7 comprises:
and mapping the key value pairs in the form of (mobile phone number and category number) into the key value pairs in the form of (category number, 1), and counting the number of people in each category by using an operator reduce of the SPARK distributed computing framework.
CN201710891930.4A 2017-09-27 2017-09-27 SPARK and DBSCAN-based distributed visitor identification type method Active CN107729418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710891930.4A CN107729418B (en) 2017-09-27 2017-09-27 SPARK and DBSCAN-based distributed visitor identification type method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710891930.4A CN107729418B (en) 2017-09-27 2017-09-27 SPARK and DBSCAN-based distributed visitor identification type method

Publications (2)

Publication Number Publication Date
CN107729418A CN107729418A (en) 2018-02-23
CN107729418B true CN107729418B (en) 2020-11-17

Family

ID=61207063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710891930.4A Active CN107729418B (en) 2017-09-27 2017-09-27 SPARK and DBSCAN-based distributed visitor identification type method

Country Status (1)

Country Link
CN (1) CN107729418B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897420A (en) * 2017-02-24 2017-06-27 东南大学 A kind of resident Activity recognition method of user's trip based on mobile phone signaling data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10423973B2 (en) * 2013-01-04 2019-09-24 PlaceIQ, Inc. Analyzing consumer behavior based on location visitation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897420A (en) * 2017-02-24 2017-06-27 东南大学 A kind of resident Activity recognition method of user's trip based on mobile phone signaling data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于YARN和Spark框架的数据挖掘算法并行研究;陈名辉;《中国优秀硕士学位论文全文数据库》;20170228(第02期);第14-15,17-27页 *
基于手机信令数据的流动人口出行特性分析方法研究;马春景;《中国优秀硕士学位论文全文数据库》;20170331(第03期);第33,41-49页 *

Also Published As

Publication number Publication date
CN107729418A (en) 2018-02-23

Similar Documents

Publication Publication Date Title
US11257103B2 (en) Device-dwell graphs
Hong et al. Exposure density and neighborhood disparities in COVID-19 infection risk
US10580025B2 (en) Micro-geographic aggregation system
CN106844781B (en) Data processing method and device
CN102163214B (en) Numerical map generation device and method thereof
CN106162544B (en) A kind of generation method and equipment of geography fence
CN110046174B (en) population migration analysis method and system based on big data
CN109684052A (en) Transaction analysis method, apparatus, equipment and storage medium
CN107395680A (en) Shop group's information push and output intent and device, equipment
US20210150631A1 (en) Machine learning approach to automatically disambiguate ambiguous electronic transaction labels
CN106991090A (en) The analysis method and device of public sentiment event entity
CN109087132A (en) A kind of the customer problem method for pushing and device of knowledge based map
CN110765280B (en) Address recognition method and device
Manley et al. New forms of data for understanding urban activity in developing countries
CN109213554A (en) A kind of icon layout method, computer readable storage medium and terminal device
Amirkhanyan et al. Real-time clustering of massive geodata for online maps to improve visual analysis
CN108345662A (en) A kind of microblog data weighted statistical method of registering considering user distribution area differentiation
CN107729418B (en) SPARK and DBSCAN-based distributed visitor identification type method
Nemoto et al. Is informal employment a result of market segmentation? evidence from china
CN113094444A (en) Data processing method, data processing apparatus, computer device, and medium
CN110619090A (en) Regional attraction assessment method and device
CN110851868A (en) Position representative element generation method for track data release
Guo et al. Global network centrality of university rankings
CN113918577B (en) Data table identification method and device, electronic equipment and storage medium
CN109657950A (en) Hierarchy Analysis Method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant