CN114374953A - APP usage prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIDS - Google Patents
APP usage prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIDS Download PDFInfo
- Publication number
- CN114374953A CN114374953A CN202210014131.XA CN202210014131A CN114374953A CN 114374953 A CN114374953 A CN 114374953A CN 202210014131 A CN202210014131 A CN 202210014131A CN 114374953 A CN114374953 A CN 114374953A
- Authority
- CN
- China
- Prior art keywords
- base station
- app
- user
- data
- under
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/60—Subscription-based services using application servers or record carriers, e.g. SIM application toolkits
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W16/00—Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
- H04W16/22—Traffic simulation tools or models
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/06—Testing, supervising or monitoring using simulated traffic
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an APP usage prediction method and system under a multisource feature conversion base station based on Hadoop and RAPIDS, the method uses a Hadoop big data frame to preprocess data, carries out feature extraction on the internet surfing data of an internet surfing user under the base station, calculates multisource data of the user of a certain base station under different time windows to construct a user multidimensional image, and carries out mapping calculation from user image features to base station image features in different time windows based on flink to obtain base station feature images under different time windows and base station feature images of adjacent base stations; predicting by adopting an Attention-LSTM multi-model integration voting method, and performing multi-GPU acceleration on model training by using RAPIDS; the method aims at predicting the APP usage by people flow change and crowd behavior change under the base station, a feature mapping method is constructed, feature extraction and mapping are carried out on the crowd layer formed by individuals, so that the model method and features can better reflect the influence of individual behaviors on the APP usage under the base station, and finally the APP usage prediction is more accurate and the effect is better.
Description
Technical Field
The invention belongs to the technical field of information, and particularly relates to an APP usage prediction method and system under a multisource characteristic conversion base station based on Hadoop and RAPIDS.
Background
With the development of mobile internet, smart phones become indispensable personal devices in our lives, almost covering the aspects of our lives, and according to data disclosed by the Chinese communication institute, the shipment volume of smart phones in 12 months in 2019 is 2893.1 thousands, which is reduced by 13.7% on year-on-year basis, wherein 541.4 thousands of 5G mobile phones. In 2019, the commodity quantity of domestic smart phones is 3.72 hundred million, which is reduced by 4.7% on year-by-year basis, wherein 1376.9 ten thousand of 5G mobile phones have higher and higher popularization rate from the market and enterprise perspective, but the market is not saturated yet, and the rapid development can still be realized. The broad mobile phone market drives China Mobile manufacturers such as China Mobile, apple, millet and the like to provide and develop better mobile phone platforms, so that under the drive of great progress of smart phone technology and platforms and enthusiasm of individual developers, a large number of mobile application programs (APP) are created to serve wide use manufacturers, daily life of people is more convenient, and great convenience is brought to life of people by using the APPs. But the rapidly growing market and ownership of mobile phones from the internet service provider's perspective has brought a tremendous challenge to global digital infrastructure.
Most services of the smart phone depend on a terminal cloud server, such as online games, video navigation and the like, so that the smart phone is seriously dependent on a base station or indoor and outdoor WiFi (wireless fidelity), especially, under the condition that no WiFi exists outdoors, the use experience of the smart phone is extremely dependent on the signal quality provided by a nearby base station, the layout change of the urban base station lags behind the development of urban layout and mobile phone market, and an internet service provider is required to provide good internet experience for users connected with the base station indoors and outdoors to surf the internet under the condition that the owned quantity of all mobile phones is rapidly developed.
The arrival of the 5G era, the rapid development of a base station construction technology, the rapid promotion of digital infrastructure, the construction of a smart city and the matching of a macro base station and a micro base station all enable an internet service provider to dynamically change the layout of base stations in the city based on the macroscopic mobile phone use condition, so that the base station connection number prediction research based on the base station and the geographic position and the APP use prediction research different from the past provide reference for the dynamic planning of the base station in the 5G era and the smart city.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method and a system for predicting APP usage under a multisource feature conversion base station based on Hadoop and RAPIDS.
In order to achieve the purpose, the invention adopts the technical scheme that: a APP usage prediction method under a multi-source feature conversion base station based on Hadoop and RAPIDS comprises the following steps:
acquiring historical data of a current base station, storing the historical data into a Hadoop cluster, preprocessing the data by adopting a MapReduce/Spark calculation engine, cleaning and reconstructing a data set in a mode of cleaning abnormal values of the data, unifying data formats and converting data types, and sequencing the data according to time;
dividing a time window of the cleaned and sequenced data, performing feature filtering and mixed sampling on features in the time window, and constructing a data set in a sliding time window mode;
classifying the data set after balancing under a time window, using the use characteristics of the existing APP categories before classification, classifying the APPs into three categories of low delay, high bandwidth and multi-connection according to the applicable scenes,
predicting the APP category most probably used in the next time window by adopting a voting mechanism based on the Attention-LSTM model;
the Attention-LSTM model optimization process is as follows:
acquiring internet data of a user connected with a base station, wherein the internet data comprises user individual data and internet data;
acquiring an APP name and an Internet URL (uniform resource locator) used by a user based on Internet surfing data, mapping the APP name and the Internet URL to a type to which the APP belongs according to a corresponding relation between the APP name and the APP type, and obtaining the APP type most used by a base station under a current time window by adopting a time and flow dual weighting mode;
establishing basic characteristics of a user under a current base station by using the most APP types of data aiming at the base station under the current time window, and then establishing a multi-dimensional image of the user based on the basic characteristics;
based on the multi-dimensional image of the user, performing mapping calculation from user image features to base station image features in different time windows based on flink to obtain base station feature images of base stations and adjacent base stations in different time windows;
feature filtering and mixed sampling are carried out on features in a time window, a data set is constructed in a sliding time window mode, an Attention-LSTM model is trained and predicted, multi-GPU acceleration is carried out based on RAPIDS in the training process, and the optimized Attention-LSTM model is obtained.
The method for acquiring the internet access data of the user connected with the base station specifically comprises the following steps: and extracting user internet data for training the model from the base station data, wherein the user internet data comprises url, request starting time, request ending time, uplink flow, downlink flow, user personal data and base station position.
Constructing a sliding time window with a fixed step length by using flink, performing secondary data cleaning, conversion and abnormal value processing in the time window, performing sliding processing on data in the time window, calculating a base station and a user characteristic image to obtain internet surfing data of a user under the fixed time step under the base station, analyzing an APP name used by the user according to a URL of an internet surfing request by using an APP category comparison table, and dividing the APP used by the user into three categories of low delay, high bandwidth and multi-connection according to the APP name, a corresponding relation between an internet URL and the APP category and use characteristics of the APP;
based on the used APP types, carrying out APP weighting according to the use time and the flow, and calculating the APP type with the maximum weight as the APP type which is used most under the current base station;
the weighted calculation formula is:
wherein: wjRepresents the weight of the APPj under the current base station, tijIndicates the time of use of the APPj by the ith individual, dijIndicates the downlink flow, up, of the APPj under the current base stationijThe method comprises the steps of representing the uplink flow size of the APPj under the current base station, representing the sum of the use durations of all APPs under the current base station by SUM (t), representing the sum of the uplink flow and the downlink flow of all APPs under the current base station by SUM (d) and SUM (up), and a, b and c are weighting coefficients.
Using the user sliding internet surfing data in a plurality of fixed time windows to extract user statistical characteristics, internet surfing characteristics and user movement data of the users in the fixed time windows, wherein the user statistical characteristics comprise ages and sexes; the user mobile data characteristics comprise a mobile distance, the number of the stop points, the duration of the stop points, a turning radius and an average mobile speed; networking characteristics: duration and time period of network access, wherein
a. The calculation formula of the activity range is as follows: calculating the minimum circumcircle of the user position point set, wherein the activity range of each hour is the area of the circumcircle, and the calculation formula is as follows:
f(a,b,r)=max(Dis(pi,j,pa,b))≤r
wherein f (a, b) represents a circle with longitude and latitude as the center of the circle and r as the radius, Dis (p)i,j,pa,b) Denotes the linear distance between (a, b) and (i, j), where pi,je.R, R is the user's position point set, and the formula represents the circumcircle expression for calculating the user's position point set R
b. The gyration radius calculation formula is as follows:
r=max(Dis(Dis(pi,j,pa,b)))
c. number of occurrences of new location
c=Count(pi,j)
Where count () represents the number of computations, pi,jE R ', R ' is the difference set of the user's location points of this time period and the last time period
Wherein p isi,jE.r, R is the set of user's location points, R is the radius of gyration.
The mapping relationship from the multidimensional portrait of the user to the portrait of the base station is as follows:
aijis the ith feature vector of the jth individual under the base station,
Dnjfor the feature vector j of the nth base station mapped to the base station with the group feature,
p(aij) Denotes aijThe probability of occurrence in the entire sequence,
t is the intercept time of the current time window,
t is the occurrence time of the individual feature,
alpha is a time attenuation factor;
and calculating the basic characteristic vector of the base station according to the mapping relation and the basic characteristic vector of the crowd under the base station to obtain the base station portrait.
When the user portrait and the base station portrait of a sliding time window are calculated by adopting a flink for feature filtering and data mixed sampling, a GBDT model is used for performing feature evaluation on the existing data, the features with the feature evaluation scores below a set threshold value are all filtered out, and the features are pruned to form a data set; then dividing the data set into a training set and a testing set;
based on Borderline-SMOTE oversampling technology, operating a training set, and classifying a few class samples in the training set into three classes, namely Safe, Danger and Noise, wherein the Safe class is that more than one half of samples around are all few class samples, the Danger class is that more than one half of samples around are all majority class samples, the samples on the boundary are regarded as the samples, and the Noise class is that the samples around are all majority class samples, and the samples are regarded as Noise; and (4) oversampling the minority class of the Danger class, randomly selecting the minority class sample by adopting a K neighbor method, and oversampling the minority class sample.
The actual APP category most likely to be used in the next time window is predicted by adopting a voting mechanism based on the Attention-LSTM model as follows: based on the extracted base station characteristic portrait, crowd characteristics and the divided APP categories, an Attention mechanism and an LSTM model are combined, an LSTM model is adopted to extract a time sequence characteristic portrait of the data of the base station, and a characteristic set is constructed together with a multi-dimensional portrait obtained by user conversion in the last step of the base station; constructing various different integrated learning combinations including boosting and bagging, performing multi-model prediction voting by adopting GBDT, random forest, SVM and KNN, and setting an initial voting weight of 0.25 for the GBDT, random forest, SVM and KNN models; selecting a model by adopting a k-fold cross validation mode, and extracting a test set and a training set in a return mode for each training and prediction; the method comprises the steps of adjusting hyper-parameters of GBDT, random forests, SVM, KNN and Attention-LSTM models based on a Bayesian optimization method, selecting a model combination with best evaluation in k-fold cross validation to perform final prediction, constructing three classification models, performing multi-GPU accelerated training based on RAPIDS during training the models, and predicting that the APP most used under the appointed base station at the next moment belongs to one of low-delay, high-bandwidth and multi-connection APPs.
On the other hand, the invention also can provide an APP usage prediction system under the multisource feature conversion base station based on Hadoop and RAPIDS, which comprises a basic feature extraction module, an APP identification module, a portrait construction combination construction module, a time sequence feature construction module, a feature sample data set balance module, a multi-GPU acceleration module based on RAPIDS and a deep learning classification identification module;
the basic feature extraction module is used for constructing a user internet access record and a user mobile position record and extracting APP use preference and user behavior features of a user based on the two data;
the APP identification module is used for converting the internet access records of the user usage flow into the records of the user usage APP and identifying the usage characteristics of the APP under the base station under different time windows;
the portrait construction module is used for constructing a user portrait based on the basic feature extraction module, mapping the user portrait into a base station APP usage portrait, and identifying a base station APP usage sequence and usage features;
the characteristic sample construction module is used for extracting the time sequence characteristics of the basic characteristic extraction module and outputting the time sequence characteristics of the user behaviors; the characteristic sample data set balancing module adopts an improved SMOTE algorithm to adjust a data set, samples and balances different classes to obtain a test set and a training set with balanced samples;
the multi-GPU acceleration module carries out multi-card parallel acceleration on deep learning model training based on RAPIDS and is used for accelerating the model training speed;
the deep learning prediction module constructs various different integrated learning combinations including boosting and bagging by using a feature sample data set based on balance, carries out voting weight adjustment based on a k-fold cross validation and Bayesian parameter adjusting mode, constructs a prediction model of low-delay, high-bandwidth and multi-connection APP, and predicts the most likely APP type to be used at the next moment of the specified base station.
In addition, the invention also provides computer equipment which comprises a processor and a memory, wherein the memory is used for storing the computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and when the processor executes part or all of the computer executable program, the APP use prediction method under the multisource characteristic transformation base station based on Hadoop and RAPIDS can be realized.
A computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the APP usage prediction method under the multisource feature transformation base station based on Hadoop and RAPIDS can be realized.
Compared with the prior art, the invention has at least the following beneficial effects:
compared with the prior art, the whole set of APP usage prediction method under the multisource characteristic conversion base station based on Hadoop and RAPIDS provided by the invention is based on Hadoop big data frame for data processing, storage and calculation, combined with flink and deep learning technology, based on RAPIDS for multi-GPU acceleration, and based on the prediction of APP usage under the base station on the pedestrian flow change and crowd behavior change, the feature mapping method is constructed, and feature extraction and mapping are carried out on the crowd level formed by individuals, so that the model method and features can better reflect the influence of individual behaviors on the APP usage under the base station, and the APP usage prediction is more accurate and the effect is better;
the invention provides an APP usage prediction method under a multisource characteristic conversion base station based on Hadoop and RAPIDS, which divides various types of APPs into three types of low delay, high bandwidth and multilink according to respective characteristics, and predicts the APP used at the next moment of the base station through a three-classification prediction model;
the method is based on a Hadoop big data frame to perform data processing, storage and calculation, combines a flash and a deep learning technology, and performs multi-GPU acceleration based on an RAPIDS;
the invention extracts the characteristics of the user internet surfing and mobile behavior, explains the APP use change from the user behavior angle, and carries out deeper analysis on the internet surfing behavior of the user to obtain more implicit characteristics of the user; based on user-to-base station feature mapping, converting individual behaviors of a user into behavior features of a base station at different moments, extracting potential change conditions of APP usage under the base station from the perspective of the user, designing an end-to-end prediction model based on flink, and performing truncation and sliding calculation on data according to a time window;
the invention is based on an Attention-LSTM multi-model integration voting mechanism, the Attention-LSTM further carries out time sequence feature extraction, the GBDT, the random forest, the SVM and the KNN four models carry out prediction voting, a voting weight distribution mode is adopted, real-time online parameter adjustment is carried out on the basis of Bayesian optimization and k-fold cross validation, and the model with the best performance is selected for prediction.
Drawings
FIG. 1 is a flow chart of a prediction method that can be implemented in accordance with the present invention.
Detailed Description
The invention provides an APP usage prediction method and system under a multisource feature conversion base station based on Hadoop and RAPIDS, wherein the method uses a Hadoop big data frame to store, clean and calculate data, extracts features of internet surfing data of internet surfing users under the base station, calculates multi-source data such as internet surfing data, position data, personal data and the like of users under a certain base station under different time windows to construct a user multi-dimensional portrait, constructs a time sequence portrait through a flink, calculates APP usage preferences of different time windows under the base station in a flow time weighting mode, and converts APP names into three types of APPs of low delay, high bandwidth and multi-link based on the usage characteristics of the existing APP types; then, converting the user multi-dimensional image into a base station image based on a conversion formula from the user multi-dimensional image to the base station multi-dimensional image; on the basis of the base station multi-dimensional feature image successfully constructed, GBDT and improved SMOTE are used for feature selection and data set balance for model training and prediction; and then performing k-fold cross validation based on the processed data set, predicting by adopting an Attention-LSTM-based multi-model integrated voting method, performing multi-GPU acceleration on model training based on RAPIDS, specifically performing feature extraction on the Attention-LSTM, performing prediction voting on the GBDT, the random forest, the SVM and the KNN, adjusting model hyperparameters through Bayesian optimization and k-fold cross validation, and selecting the model with the best performance for prediction to realize prediction of APP use under the base station.
The method specifically comprises the following steps:
the method comprises the steps that firstly, basic data are constructed through user internet surfing data, data storage and data preprocessing are carried out on the basis of Hadoop, MapReduce/Spark is used for carrying out first cleaning, conversion and format unification operation on the data to form primary available basic data, and basic feature extraction such as user internet surfing features, behavior features and individual features is carried out on the basic data on the basis of flink;
step two, using an APP-url conversion unit to convert the internet url of the user into an APP name used by the user; calculating the APP use preference of the base station under a specific time window by adopting a flow weighting mode; then, based on the use characteristics of the APP categories, converting the APP names into the APP categories;
TABLE 1 app-url and app type translation Table (partial list is listed here for demonstration)
TABLE 2 app type to predict type conversion relationship
Step three, performing feature mapping in a flink time window, and converting the individual features of the user into base station features;
step four, carrying out feature selection and sample rebalancing by using an improved SMOTE algorithm and a GBDT to obtain a data set for training prediction;
constructing an Attention-LSTM for time sequence feature extraction based on the data set, constructing an integrated learning model by using GBDT, random forest, SVM and KNN, and performing multi-GPU training acceleration on model training based on RAPIDS for voting prediction; and selecting the optimal model parameters by adopting k-fold cross validation and Bayesian optimization to predict the APP.
To make the objects, technical solutions and advantages of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings.
Aiming at full internet surfing data and user anonymous individual data of a user under a base station, which are provided by operator big data, a Hadoop big data frame is used for carrying out first data cleaning and calculation on conversation data, portrait data, position data and internet surfing data of the data, and ETL and sliding calculation based on flink are carried out on the obtained data; and laying a data foundation for the following steps.
The method comprises the steps that firstly, basic data are constructed through user internet surfing data, data storage and data preprocessing are carried out on the basis of Hadoop, MapReduce/Spark is used for carrying out first cleaning, conversion and format unification operation on the data to form primary available basic data, and basic feature extraction such as user internet surfing features, behavior position features and individual features is carried out on the basic data on the basis of flink;
the internet surfing information characteristics comprise: flow usage total, mean, variance, and usage trend within the time window, url request usage statistics, such as: statistics such as access trends and times such as hundredths, google, QQ, WeChat, APP usage statistics such as: the application times, proportion and duration of the common APP, the application statistics of the special APP and the like.
The behavior position characteristics comprise: the mobile distance of a user, the average moving speed, the real-time maximum distance for accessing a base station, the average times and entropy, the times and proportion of the internet surfing in a workplace and a home, the number and proportion of position records in the daytime and at night and the radius of gyration of the movement of the user;
the individual characteristics include: user gender, age, mobile phone account opening time and cumulative active days.
Step two, using the existing APP-URL conversion unit to convert the Internet access URL of the user into the APP name used by the user; calculating the APP use preference of the base station under a specific time window by adopting a flow weighting mode; then based on the existing rule, converting the APP name into the APP category;
constructing a sliding time window with a fixed step length by using flink, carrying out data ETL processing in the time window, carrying out sliding processing on data in the window, calculating a base station characteristic portrait and a multi-dimensional portrait of a user to obtain internet surfing data of the user under the base station under the fixed time step length, analyzing an APP name used by the user according to a URL (uniform resource locator) of an internet surfing request by using an APP category comparison table, and dividing the APP used by the user into three types of low delay, high bandwidth and multi-connection according to the existing rule and the APP category comparison table; based on the used APP types, carrying out APP weighting according to the use time and the flow, and calculating the APP type with the maximum weight as the APP type which is used most under the current base station;
the weighting formula is:
wherein: wjRepresents the weight of the APPj under the current base station, tijIndicates the time of use of the APPj by the ith individual, dijIndicates the downlink flow, up, of the APPj under the current base stationijThe method comprises the steps of representing the uplink flow size of the APPj under the current base station, representing the sum of the use durations of all APPs under the current base station by SUM (t), representing the sum of the uplink flow and the downlink flow of all APPs under the current base station by SUM (d) and SUM (up), and a, b and c are weighting coefficients.
Step three, performing feature mapping in a flink time window, and converting the individual features of the user into the features of the base station:
the method for mapping the user portrait to the base portrait comprises the following steps:
aijis the ith feature vector of the jth individual under the base station,
Dnjfor the feature vector j of the nth base station mapped to the base station with the group feature,
p(aij) Denotes aijThe probability of occurrence in the entire sequence,
t is the intercept time of the current time window,
t is the occurrence time of the individual feature,
alpha is a time attenuation factor and alpha is a time attenuation factor,
according to the mapping relation, the characteristic vector of the base station can be calculated according to the characteristic vector of the crowd under the base station, and then the portrait can be obtained.
Step four, carrying out feature selection and sample rebalancing by using an improved SMOTE algorithm and a GBDT to obtain a data set for training prediction;
performing feature filtering and data mixed sampling on the obtained multi-dimensional image of the user and the base station feature image of the time window, performing feature evaluation on the existing data by using a GBDT model, filtering out features with feature evaluation scores below a set threshold value, and pruning the features to form a data set; in the present invention, a threshold value of 0.2 is set, and then the data set is divided into a training set and a test set.
Based on Borderline-SMOTE oversampling technology, operating a training set, and classifying a few class samples in the training set into three classes, namely Safe, Danger and Noise, wherein the Safe class is that more than one half of samples around are all few class samples, the Danger class is that more than one half of samples around are all majority class samples, the samples on the boundary are regarded as the samples, and the Noise class is that the samples around are all majority class samples, and the samples are regarded as Noise; and (4) oversampling the minority class of the Danger class, randomly selecting the minority class sample by adopting a K neighbor method, and oversampling the minority class sample.
Constructing an Attention-LSTM for time sequence feature extraction based on the data set, constructing an integrated learning model by using GBDT, random forest, SVM and KNN, carrying out multi-GPU training acceleration on model training based on RAPIDS, and finally carrying out voting prediction; selecting optimal model parameters by adopting k-fold cross validation and Bayesian optimization to predict APP usage
Based on the extracted base station portrait characteristics and crowd characteristics and the divided APP categories, an Attention mechanism and an LSTM model are combined, the LSTM is adopted to extract the time sequence characteristics of the portrait, and the time sequence characteristics and the original basic portrait characteristics are combined to construct a characteristic set; constructing various different integrated learning combinations including boosting and bagging, selecting GBDT, random forests, SVM and KNN to perform multi-model prediction voting, and initializing voting weights 0.25 for the four models; selecting a model by adopting a k-fold cross validation mode, and extracting a test set and a training set in a return mode for each training and prediction; adjusting hyper-parameters of GBDT, random forest, SVM, KNN and Attention-LSTM models based on a Bayesian optimization method, selecting a model combination with best evaluation in k-fold cross validation to perform final prediction, constructing a three-classification model, and predicting that the most used APP under an appointed base station at the next moment belongs to one of low-delay, high-bandwidth and multi-connection APPs.
On the other hand, the invention also provides an APP usage prediction system under the multisource feature conversion base station based on Hadoop and RAPIDS, which comprises a basic feature extraction module, an APP identification module, a portrait construction combination construction module, a time sequence feature construction module, a feature sample data set balance module, a multi-GPU acceleration module based on RAPIDS and a deep learning classification identification module;
the basic feature extraction module is used for constructing a user internet access record and a user mobile position record and extracting APP use preference and user behavior features of a user based on the two data;
the APP identification module is used for converting the internet access records of the user usage flow into the records of the user usage APP and identifying the usage characteristics of the APP under the base station under different time windows;
the portrait construction module is used for constructing a user portrait based on the basic feature extraction module, mapping the user portrait into a base station APP usage portrait, and identifying a base station APP usage sequence and usage features;
the characteristic sample construction module is used for extracting the time sequence characteristics of the basic characteristic extraction module and outputting the time sequence characteristics of the user behaviors; the characteristic sample data set balancing module adopts an improved SMOTE algorithm to adjust a data set, samples and balances different classes to obtain a test set and a training set with balanced samples;
the multi-GPU acceleration module carries out multi-card parallel acceleration on deep learning model training based on RAPIDS and is used for accelerating the model training speed;
the deep learning prediction module constructs various different integrated learning combinations including boosting and bagging by using a feature sample data set based on balance, carries out voting weight adjustment based on a k-fold cross validation and Bayesian parameter adjusting mode, constructs a prediction model of low-delay, high-bandwidth and multi-connection APP, and predicts the most likely APP type to be used at the next moment of the specified base station.
In addition, the invention can also provide computer equipment which comprises a processor and a memory, wherein the memory is used for storing computer executable programs, the processor reads part or all of the computer executable programs from the memory and executes the computer executable programs, and when the processor executes part or all of the computer executable programs, the APP usage prediction method under the multisource characteristic transformation base station based on Hadoop and RAPIDS can be realized.
The invention also provides a computer readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the method for predicting APP usage under the multi-source feature transformation base station based on Hadoop and RAPIDS can be realized.
The device for predicting APP usage may be a laptop, a tablet, a desktop, or a workstation.
The processor may be a Graphics Processor (GPU), a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).
The memory of the invention can be an internal storage unit of a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory and a hard disk; external memory units such as removable hard disks, flash memory cards may also be used.
Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The random access memory may include a resistive random access memory (ReRAM).
Claims (10)
1. A APP usage prediction method under a multi-source feature conversion base station based on Hadoop and RAPIDS is characterized by comprising the following steps:
acquiring historical data of a current base station, storing the historical data into a Hadoop cluster, preprocessing the data by adopting a MapReduce/Spark calculation engine, cleaning and reconstructing a data set in a mode of cleaning abnormal values of the data, unifying data formats and converting data types, and sequencing the data according to time;
dividing a time window of the cleaned and sequenced data, performing feature filtering and mixed sampling on features in the time window, and constructing a data set in a sliding time window mode;
classifying the data set after balancing under a time window, using the use characteristics of the existing APP categories before classification, classifying the APPs into three categories of low delay, high bandwidth and multi-connection according to the applicable scenes,
predicting the APP category most probably used in the next time window by adopting a voting mechanism based on the Attention-LSTM model;
the Attention-LSTM model optimization process is as follows:
acquiring internet data of a user connected with a base station, wherein the internet data comprises user individual data and internet data;
acquiring an APP name and an Internet URL (uniform resource locator) used by a user based on Internet surfing data, mapping the APP name and the Internet URL to a type to which the APP belongs according to a corresponding relation between the APP name and the APP type, and obtaining the APP type most used by a base station under a current time window by adopting a time and flow dual weighting mode;
establishing basic characteristics of a user under a current base station by using the most APP types of data aiming at the base station under the current time window, and then establishing a multi-dimensional image of the user based on the basic characteristics;
based on the multi-dimensional image of the user, performing mapping calculation from user image features to base station image features in different time windows based on flink to obtain base station feature images of base stations and adjacent base stations in different time windows;
feature filtering and mixed sampling are carried out on features in a time window, a data set is constructed in a sliding time window mode, an Attention-LSTM model is trained and predicted, multi-GPU acceleration is carried out based on RAPIDS in the training process, and the optimized Attention-LSTM model is obtained.
2. The APP usage prediction method under the multi-source feature conversion base station based on Hadoop and RAPIDS as claimed in claim 1, wherein the obtaining of the Internet data of the user connected to the base station specifically comprises: and extracting user internet data for training the model from the base station data, wherein the user internet data comprises url, request starting time, request ending time, uplink flow, downlink flow, user personal data and base station position.
3. The APP usage prediction method under the multisource feature conversion base station based on Hadoop and RAPIDS as claimed in claim 1, characterized in that a sliding time window with a fixed step length is constructed by using a flink, and a second time of data cleaning, conversion and abnormal value processing is performed in the time window, the data in the time window is subjected to sliding processing, a base station and user feature portrait is calculated, Internet access data of the user under the base station under the fixed time step is obtained, an APP name used by the user is analyzed according to the URL of the Internet access request by using an APP category comparison table, and then the APP used by the user is divided into three categories of low delay, high bandwidth and multi-connection according to the APP name, the corresponding relationship between the Internet access URL and the APP category and the usage characteristics of the APP;
based on the used APP types, carrying out APP weighting according to the use time and the flow, and calculating the APP type with the maximum weight as the APP type which is used most under the current base station;
the weighted calculation formula is:
wherein: wjRepresents the weight of APP j under the current base station, tijRepresents the time when the ith person uses APP j, dijIndicates the downlink flow, up, of APP j under the current base stationijThe uplink traffic of APP j under the current base station is represented, SUM (t) represents the sum of the use durations of all APPs under the current base station, SUM (d) and SUM (up) respectively represent the sum of the uplink traffic and the downlink traffic of all APPs under the current base station, and a, b and c are weighting coefficients.
4. The APP usage prediction method under the multi-source feature conversion base station based on Hadoop and RAPIDS as claimed in claim 1, characterized in that the user sliding internet data in a plurality of fixed time windows is used to extract the user statistical features, internet features and user movement data of the user in the fixed time windows, wherein the user statistical features include age and gender; the user mobile data characteristics comprise a mobile distance, the number of the stop points, the duration of the stop points, a turning radius and an average mobile speed; networking characteristics: duration and time period of network access, wherein
a. The calculation formula of the activity range is as follows: calculating the minimum circumcircle of the user position point set, wherein the activity range of each hour is the area of the circumcircle, and the calculation formula is as follows:
f(a,b,r)=max(Dis(pi,j,pa,b))≤r
wherein f (a, b) represents a circle with longitude and latitude as the center of the circle and r as the radius, Dis (p)i,j,pa,b) Denotes the linear distance between (a, b) and (i, j), where pi,jThe element belongs to R, wherein R is a position point set of a user and is a circumscribed circle expression for calculating the position point set R of the user;
b. the gyration radius calculation formula is as follows:
r=max(Dis(Dis(pi,j,pa,b)))
c. number of occurrences of new site:
c=Count(pi,j)
where count () represents the number of computations, pi,jE.r ', R' is this of the userDifference set p of time period and last time period position pointi,jE.r, R is the set of user's location points, R is the radius of gyration.
5. The method for predicting APP usage under the multisource feature transformation base station based on Hadoop and RAPIDS as claimed in claim 1, wherein the mapping relationship from the user's multidimensional portrait to the base station portrait is:
aijis the ith feature vector of the jth individual under the base station,
Dnjfor the feature vector j of the nth base station mapped to the base station with the group feature,
p(aij) Denotes aijThe probability of occurrence in the entire sequence,
t is the intercept time of the current time window,
t is the occurrence time of the individual feature,
alpha is a time attenuation factor;
and calculating the basic characteristic vector of the base station according to the mapping relation and the basic characteristic vector of the crowd under the base station to obtain the base station portrait.
6. The APP usage prediction method under the multisource feature conversion base station based on Hadoop and RAPIDS as claimed in claim 1, characterized in that when the user portrait and the base station portrait of the sliding time window are calculated by using a flink for feature filtering and data mixed sampling, a GBDT model is used for performing feature evaluation on the existing data, filtering out features with feature evaluation scores below a set threshold, pruning the features to form a data set; then dividing the data set into a training set and a testing set;
based on Borderline-SMOTE oversampling technology, operating a training set, and classifying a few class samples in the training set into three classes, namely Safe, Danger and Noise, wherein the Safe class is that more than one half of samples around are all few class samples, the Danger class is that more than one half of samples around are all majority class samples, the samples on the boundary are regarded as the samples, and the Noise class is that the samples around are all majority class samples, and the samples are regarded as Noise; and (4) oversampling the minority class of the Danger class, randomly selecting the minority class sample by adopting a K neighbor method, and oversampling the minority class sample.
7. The APP usage prediction method under the multi-source feature transformation base station based on Hadoop and RAPIDS as claimed in claim 1, wherein the voting mechanism based on the Attention-LSTM model is adopted to predict the most likely APP actual category used in the next time window as follows: based on the extracted base station characteristic portrait, crowd characteristics and the divided APP categories, an Attention mechanism and an LSTM model are combined, an LSTM model is adopted to extract a time sequence characteristic portrait of the data of the base station, and a characteristic set is constructed together with a multi-dimensional portrait obtained by user conversion in the last step of the base station; constructing various different integrated learning combinations including boosting and bagging, performing multi-model prediction voting by adopting GBDT, random forest, SVM and KNN, and setting an initial voting weight of 0.25 for the GBDT, random forest, SVM and KNN models; selecting a model by adopting a k-fold cross validation mode, and extracting a test set and a training set in a return mode for each training and prediction; the method comprises the steps of adjusting hyper-parameters of GBDT, random forests, SVM, KNN and Attention-LSTM models based on a Bayesian optimization method, selecting a model combination with best evaluation in k-fold cross validation to perform final prediction, constructing three classification models, performing multi-GPU accelerated training based on RAPIDS during training the models, and predicting that the APP most used under the appointed base station at the next moment belongs to one of low-delay, high-bandwidth and multi-connection APPs.
8. An APP usage prediction system under a multisource feature conversion base station based on Hadoop and RAPIDS is characterized by comprising a basic feature extraction module, an APP identification module, a portrait construction combination construction module, a time sequence feature construction module, a feature sample data set balance module, a multi-GPU acceleration module based on RAPIDS and a deep learning classification identification module;
the basic feature extraction module is used for constructing a user internet access record and a user mobile position record and extracting APP use preference and user behavior features of a user based on the two data;
the APP identification module is used for converting the internet access records of the user usage flow into the records of the user usage APP and identifying the usage characteristics of the APP under the base station under different time windows;
the portrait construction module is used for constructing a user portrait based on the basic feature extraction module, mapping the user portrait into a base station APP usage portrait, and identifying a base station APP usage sequence and usage features;
the characteristic sample construction module is used for extracting the time sequence characteristics of the basic characteristic extraction module and outputting the time sequence characteristics of the user behaviors; the characteristic sample data set balancing module adopts an improved SMOTE algorithm to adjust a data set, samples and balances different classes to obtain a test set and a training set with balanced samples;
the multi-GPU acceleration module carries out multi-card parallel acceleration on deep learning model training based on RAPIDS and is used for accelerating the model training speed;
the deep learning prediction module constructs various different integrated learning combinations including boosting and bagging by using a feature sample data set based on balance, carries out voting weight adjustment based on a k-fold cross validation and Bayesian parameter adjusting mode, constructs a prediction model of low-delay, high-bandwidth and multi-connection APP, and predicts the most likely APP type to be used at the next moment of the specified base station.
9. Computer equipment, characterized by comprising a processor and a memory, wherein the memory is used for storing computer executable programs, the processor reads part or all of the computer executable programs from the memory and executes the computer executable programs, and the processor can realize the APP usage prediction method under the multi-source feature conversion base station based on Hadoop and RAPIDS as claimed in any one of claims 1-7 when executing part or all of the computer executable programs.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program is capable of implementing the method for prediction of APP usage under the Hadoop and RAPIDS based multi-source feature transformation base station according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210014131.XA CN114374953B (en) | 2022-01-06 | 2022-01-06 | APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210014131.XA CN114374953B (en) | 2022-01-06 | 2022-01-06 | APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114374953A true CN114374953A (en) | 2022-04-19 |
CN114374953B CN114374953B (en) | 2023-09-05 |
Family
ID=81141534
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210014131.XA Active CN114374953B (en) | 2022-01-06 | 2022-01-06 | APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114374953B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115617745A (en) * | 2022-10-13 | 2023-01-17 | 自然资源部国土卫星遥感应用中心 | Management method, management device and medium for satellite image data storage medium |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140006418A1 (en) * | 2012-07-02 | 2014-01-02 | Andrea G. FORTE | Method and apparatus for ranking apps in the wide-open internet |
US20160269542A1 (en) * | 2015-03-09 | 2016-09-15 | International Business Machines Corporation | Usage of software programs on mobile computing devices |
WO2018223944A1 (en) * | 2017-06-08 | 2018-12-13 | 中兴通讯股份有限公司 | Phone bill generation method, device, mobile edge platform and storage medium |
CN109492678A (en) * | 2018-10-24 | 2019-03-19 | 浙江工业大学 | A kind of App classification method of integrated shallow-layer and deep learning |
CN109857922A (en) * | 2019-01-18 | 2019-06-07 | 深圳壹账通智能科技有限公司 | Data evaluate and test model modelling approach, device, computer equipment and storage medium |
CN110266545A (en) * | 2019-06-28 | 2019-09-20 | 北京小米移动软件有限公司 | A kind of method, apparatus and medium dynamically distributing Internet resources |
CN110602322A (en) * | 2019-09-12 | 2019-12-20 | 北京车慧科技有限公司 | Method and device for layout of application icons and mobile terminal |
CN110619420A (en) * | 2019-07-31 | 2019-12-27 | 广东工业大学 | Attention-GRU-based short-term residential load prediction method |
CN111104969A (en) * | 2019-12-04 | 2020-05-05 | 东北大学 | Method for pre-judging collision possibility between unmanned vehicle and surrounding vehicle |
CN112163689A (en) * | 2020-08-18 | 2021-01-01 | 国网浙江省电力有限公司绍兴供电公司 | Short-term load quantile probability prediction method based on depth Attention-LSTM |
CN112215442A (en) * | 2020-11-27 | 2021-01-12 | 中国电力科学研究院有限公司 | Method, system, device and medium for predicting short-term load of power system |
CN112884236A (en) * | 2021-03-10 | 2021-06-01 | 南京工程学院 | Short-term load prediction method and system based on VDM decomposition and LSTM improvement |
-
2022
- 2022-01-06 CN CN202210014131.XA patent/CN114374953B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140006418A1 (en) * | 2012-07-02 | 2014-01-02 | Andrea G. FORTE | Method and apparatus for ranking apps in the wide-open internet |
US20160269542A1 (en) * | 2015-03-09 | 2016-09-15 | International Business Machines Corporation | Usage of software programs on mobile computing devices |
WO2018223944A1 (en) * | 2017-06-08 | 2018-12-13 | 中兴通讯股份有限公司 | Phone bill generation method, device, mobile edge platform and storage medium |
CN109492678A (en) * | 2018-10-24 | 2019-03-19 | 浙江工业大学 | A kind of App classification method of integrated shallow-layer and deep learning |
CN109857922A (en) * | 2019-01-18 | 2019-06-07 | 深圳壹账通智能科技有限公司 | Data evaluate and test model modelling approach, device, computer equipment and storage medium |
CN110266545A (en) * | 2019-06-28 | 2019-09-20 | 北京小米移动软件有限公司 | A kind of method, apparatus and medium dynamically distributing Internet resources |
CN110619420A (en) * | 2019-07-31 | 2019-12-27 | 广东工业大学 | Attention-GRU-based short-term residential load prediction method |
CN110602322A (en) * | 2019-09-12 | 2019-12-20 | 北京车慧科技有限公司 | Method and device for layout of application icons and mobile terminal |
CN111104969A (en) * | 2019-12-04 | 2020-05-05 | 东北大学 | Method for pre-judging collision possibility between unmanned vehicle and surrounding vehicle |
CN112163689A (en) * | 2020-08-18 | 2021-01-01 | 国网浙江省电力有限公司绍兴供电公司 | Short-term load quantile probability prediction method based on depth Attention-LSTM |
CN112215442A (en) * | 2020-11-27 | 2021-01-12 | 中国电力科学研究院有限公司 | Method, system, device and medium for predicting short-term load of power system |
CN112884236A (en) * | 2021-03-10 | 2021-06-01 | 南京工程学院 | Short-term load prediction method and system based on VDM decomposition and LSTM improvement |
Non-Patent Citations (4)
Title |
---|
杨青: "基于Hadoop的多维关联规则挖掘算法研究及应用", 《计算机工程与科学》 * |
梁烜彰: "高校数字化校园大数据平台设计研究", 《企业科技与发展》 * |
潘俊良;覃悦;韩长志;: "5G通信技术在农林业生产中的应用与展望", 安徽农学通报, no. 07 * |
程豪;吕晓玲;钟琰;范超;赵昱;: "大数据背景下智能手机APP组合推荐研究", 统计与信息论坛, no. 06 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115617745A (en) * | 2022-10-13 | 2023-01-17 | 自然资源部国土卫星遥感应用中心 | Management method, management device and medium for satellite image data storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114374953B (en) | 2023-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110928993B (en) | User position prediction method and system based on deep cyclic neural network | |
CN104348855B (en) | Processing method, mobile terminal and the server of user information | |
CN107451861B (en) | Method for identifying user internet access characteristics under big data | |
CN111212383B (en) | Method, device, server and medium for determining number of regional permanent population | |
CN107798027B (en) | Information popularity prediction method, information recommendation method and device | |
CN110119477B (en) | Information pushing method, device and storage medium | |
Roy et al. | Modeling the dynamics of hurricane evacuation decisions from twitter data: An input output hidden markov modeling approach | |
CN106776925B (en) | Method, server and system for predicting gender of mobile terminal user | |
CN111460294A (en) | Message pushing method and device, computer equipment and storage medium | |
JP2019534487A (en) | System and method for determining optimal strategy | |
US11468349B2 (en) | POI valuation method, apparatus, device and computer storage medium | |
CN111626767B (en) | Resource data issuing method, device and equipment | |
CN110377679B (en) | Public space activity measuring method and system based on track positioning data | |
US20230004614A1 (en) | Method and Apparatus for Displaying Map Points of Interest, And Electronic Device | |
WO2024040941A1 (en) | Neural architecture search method and device, and storage medium | |
CN111695046A (en) | User portrait inference method and device based on spatio-temporal mobile data representation learning | |
CN114374953B (en) | APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS | |
CN111192170B (en) | Question pushing method, device, equipment and computer readable storage medium | |
CN115221396A (en) | Information recommendation method and device based on artificial intelligence and electronic equipment | |
Li et al. | Mining boundary effects in areally referenced spatial data using the Bayesian information criterion | |
CN115238588A (en) | Graph data processing method, risk prediction model training method and device | |
CN116528282B (en) | Coverage scene recognition method, device, electronic equipment and readable storage medium | |
KR101028810B1 (en) | Apparatus and method for analyzing advertisement target | |
CN111797258B (en) | Image pushing method, system, equipment and storage medium based on aesthetic evaluation | |
CN117216382A (en) | Interactive processing method, model training method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |