CN114374953B - APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS - Google Patents

APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS Download PDF

Info

Publication number
CN114374953B
CN114374953B CN202210014131.XA CN202210014131A CN114374953B CN 114374953 B CN114374953 B CN 114374953B CN 202210014131 A CN202210014131 A CN 202210014131A CN 114374953 B CN114374953 B CN 114374953B
Authority
CN
China
Prior art keywords
base station
app
user
data
under
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210014131.XA
Other languages
Chinese (zh)
Other versions
CN114374953A (en
Inventor
邹建华
付梁毓
赵玺
黄呈昊
陶敬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG XI'AN JIAOTONG UNIVERSITY ACADEMY
Xian Jiaotong University
Original Assignee
GUANGDONG XI'AN JIAOTONG UNIVERSITY ACADEMY
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG XI'AN JIAOTONG UNIVERSITY ACADEMY, Xian Jiaotong University filed Critical GUANGDONG XI'AN JIAOTONG UNIVERSITY ACADEMY
Priority to CN202210014131.XA priority Critical patent/CN114374953B/en
Publication of CN114374953A publication Critical patent/CN114374953A/en
Application granted granted Critical
Publication of CN114374953B publication Critical patent/CN114374953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/60Subscription-based services using application servers or record carriers, e.g. SIM application toolkits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/22Traffic simulation tools or models
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/06Testing, supervising or monitoring using simulated traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a system for predicting APP use under a multisource feature conversion base station based on Hadoop and RAPIS, wherein the method preprocesses data by using a Hadoop big data frame, performs feature extraction on internet surfing data of internet surfing users under the base station, calculates multisource data of users at a certain base station under different time windows to construct a user multidimensional image, and performs mapping calculation from user image features to base station image features in different time windows based on a link to obtain base station feature images under different time windows and base station feature images of adjacent base stations; predicting by adopting an attribute-LSTM multi-model integrated voting method, and performing multi-GPU acceleration on model training by using RAPIS; and predicting APP use aiming at people flow change and crowd behavior change under the base station, constructing a feature mapping method, and extracting and mapping features on the group level formed by individuals, so that the model method and the features can reflect the influence of the individual behaviors on the APP use under the base station, and finally, the APP use prediction is more accurate and the effect is better.

Description

APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS
Technical Field
The invention belongs to the technical field of information, and particularly relates to a method and a system for predicting APP use under a multisource feature conversion base station based on Hadoop and RAPIS.
Background
Along with the development of mobile internet, smart phones have become indispensable personal devices in our lives, almost cover the aspects of our lives, and according to the data disclosed in the national communication institute, the output of smart phones in 12 months in 2019 is 2893.1 ten thousand, and the output of smart phones is reduced by 13.7% in the same ratio, wherein 5G phones 541.4 ten thousand. In 2019, the commodity output of the domestic smart phone is 3.72 hundred million parts, which is reduced by 4.7 percent, wherein 5G mobile phones 1376.9 ten thousand parts are compared, and the market is not saturated and can be developed rapidly although the popularity of the smart phone is higher and higher from the market and enterprise perspective. The wide mobile phone market drives mobile phone manufacturers such as apples and millet to provide and develop better mobile phone platforms, so that a large number of mobile application programs (APP) are created under the drive of the technology and the platform of smart phones and enthusiasm of personal developers, and a wide range of using factories are served, so that the daily life of people is more convenient, and great convenience is brought to the life of people by using the APP. But from the internet service provider perspective, the rapidly growing cell phone market and cell phone possession have created a tremendous challenge for digital infrastructure worldwide.
Because most of the services of the smart phone depend on a terminal cloud server, such as online games and video navigation, the smart phone is severely dependent on a base station or indoor and outdoor WiFi, especially when the smart phone is out of doors and does not have WiFi, the use experience of the smart phone is extremely dependent on the signal quality provided by a nearby base station, the layout change of the base station in a city is delayed from the urban layout and the development of the mobile phone market, and therefore an internet service provider is required to be capable of providing good internet surfing experience for users connected with the base station in the indoor and outdoor directions under the condition that the average possession of the smart phone is rapidly developed.
The arrival of the 5G era, the rapid development of base station construction technology, the rapid promotion of digital infrastructure, the construction of smart cities and the cooperation of macro base stations and micro base stations enable an Internet service provider to dynamically change the layout of base stations in cities based on the macroscopic mobile phone use condition of the cities, so that the invention provides references for the dynamic planning of base stations in the 5G era and smart cities based on the prediction research of the number of base station connections and geographic positions and the prediction research of APP use different from the previous.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method and a system for predicting APP use under a multisource feature conversion base station based on Hadoop and RAPIS, which are used for carrying out data processing, storage and calculation based on a Hadoop big data frame, carrying out multi-GPU acceleration based on the RAPIS by combining a flink and a deep learning technology, predicting APP use aiming at people flow change and crowd behavior change under the base station, constructing a feature mapping method, and carrying out feature extraction and mapping on the group level formed by individuals, so that the model method and features can reflect the influence of individual behaviors on APP use under the base station, and finally, the APP use prediction is more accurate and the effect is better.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a method for predicting APP use under a multisource feature conversion base station based on Hadoop and RAPIS comprises the following steps:
acquiring historical data of a current base station, storing the historical data into a Hadoop cluster, preprocessing the data by adopting a MapReduce/Spark calculation engine, cleaning and reconstructing a data set in a mode of cleaning abnormal values of the data, unifying data formats and converting data types, and sequencing according to time;
performing time window division on the cleaned and sequenced data, performing feature filtering and mixed sampling on features in the time window, and constructing a data set in a sliding time window mode;
classifying the data set after time window balancing, classifying the APP into three classes of low delay, high bandwidth and multiple connection according to applicable scenes by using the using characteristics of the existing APP class before classification,
predicting the most likely APP category used by the next time window by adopting a voting mechanism based on an Attention-LSTM model;
the optimization process of the Attention-LSTM model is as follows:
acquiring internet surfing data of a user connection base station, wherein the internet surfing data comprises user individual data and internet surfing data;
acquiring the used APP name and Internet surfing URL based on user Internet surfing data, mapping the APP name, internet surfing URL and APP category corresponding relation into the category to which the APP belongs, and obtaining the most used APP category of the base station under the current time window in a time and flow dual weighting mode;
constructing basic characteristics of a user under a current base station according to data of the most APP types used by the base station under the current time window, and constructing a multi-dimensional image of the user based on the basic characteristics;
based on the multidimensional image of the user, mapping calculation from user image features to base station image features is carried out in different time windows based on a link, so that base station feature images under different time windows and base station feature images of adjacent base stations are obtained;
and performing feature filtering and mixed sampling on the features in the time window, constructing a data set in a sliding time window mode, training and predicting the attribute-LSTM model, and performing multi-GPU acceleration based on RAPIS in the training process to obtain an optimized attribute-LSTM model.
The method for acquiring the internet data of the user connected base station specifically comprises the following steps: and extracting user internet surfing data for training the model from the base station data, wherein the user internet surfing data comprises url, request starting time, request ending time, uplink flow, downlink flow, user personal data and base station position.
Constructing a sliding time window with a fixed step length by using a flink, performing second data cleaning, conversion and outlier processing in the time window, performing sliding processing on data in the time window, calculating a base station and a user characteristic portrait, obtaining internet surfing data of a user under the base station under the fixed time step length, analyzing an APP name used by the user according to URL of an internet surfing request by using an APP category comparison table, and dividing the APP used by the user into three types of low delay, high bandwidth and multiple connection according to the APP name, the corresponding relation between the internet surfing URL and the APP category and the use characteristic of the APP;
based on the used APP types, performing APP weighting according to the use time and the flow, and calculating the APP type with the largest weight as the most used APP type under the current base station;
the weighted calculation formula is:
wherein: w (W) j Indicating the weight size, t of APPj under the current base station ij Represents the time when the ith person uses APPj, d ij Indicating the downlink flow of APPj under the current base station, up ij And SUM (t) represents the SUM of the using time of all APPs under the current base station, SUM (d) and SUM (up) respectively represent the SUM of the uplink flow and the downlink flow of all APPs under the current base station, and a, b and c are weighting coefficients.
The method comprises the steps of using users in a plurality of fixed time windows to slide internet surfing data, and extracting user statistics characteristics, internet surfing characteristics and user movement data of the users in the fixed time windows, wherein the user statistics characteristics comprise age and gender; the user movement data characteristics comprise movement distance, number of stay points, stay point duration, turning radius and average movement speed; surfing the internet and characterized by: a surfing time length and a surfing time period, wherein
a. The calculation formula of the movable range is as follows: the minimum circumscribed circle of the user position point set is calculated, the moving range per hour is the area of the circumscribed circle, and the calculation formula is as follows:
f(a,b,r)=max(Dis(p i,j ,p a,b ))≤r
wherein f (a, b) represents a circle with longitude and latitude as the center of the circle (a, b), r as the radius, dis (p) i,j ,p a,b ) Represents the linear distance between (a, b) and (i, j), where p i,j E R, R is the user's location point set, and this formula represents the circumscribed circle expression for computing the user's location point set R
b. The calculation formula of the turning radius is as follows:
r=max(Dis(Dis(p i,j ,p a,b )))
c. number of occurrences of new places
c=Count(p i,j )
Where count () represents the number of computations, p i,j E R ', R' is the difference set between this time period and the last time period location point for the user
Wherein p is i,j E R, R is the set of location points for the user and R is the radius of gyration.
The relationship of the user's multidimensional representation to base station representation mapping is:
a ij the ith feature vector for the jth person under the base station,
D nj for the feature vector j of the nth base station mapped to the base station with the group feature,
p(a ij ) Representation a ij The probability of occurrence in the whole sequence,
t is the expiration time of the current time window,
t is the time of occurrence of an individual feature,
alpha is a time attenuation factor;
and according to the mapping relation, calculating the basic feature vector of the base station according to the basic feature vector of the crowd under the base station, and obtaining the base station portrait.
When the user portrait and the base station portrait of the sliding time window are calculated by adopting a flink to perform feature filtering and data mixed sampling, the GBDT model is used for performing feature evaluation on the existing data, features with feature evaluation scores below a set threshold are filtered, and the features are pruned to form a data set; then dividing the data set into a training set and a testing set;
based on the Borderline-SMOTE oversampling technology, operating a training set, and classifying a few class samples in the training set into three classes, namely Safe, danger and Noise, wherein more than half of the Safe class samples are all minority class samples around the sample, more than half of the Danger class samples are all majority class samples around the sample, the samples are regarded as the samples on the boundary, and the Noise class samples are all majority class samples around the sample, and the Noise is regarded as Noise; and (3) oversampling a minority class of Danger classes, randomly selecting a minority class sample by adopting a K neighbor method, and oversampling the minority sample.
The actual class of APP most likely to be used in the next time window is predicted by adopting a voting mechanism based on the Attention-LSTM model as follows: based on the extracted base station characteristic image, crowd characteristics and divided APP types, using a combination method of an Attention mechanism and an LSTM model, extracting a time sequence characteristic image of data of the base station by adopting the LSTM model, and constructing a characteristic set together with a multidimensional image obtained by user conversion on the base station; constructing a plurality of different integrated learning combinations including boosting and bagging, performing multi-model predictive voting by using GBDT, random forest, SVM and KNN, and setting initial voting weight of 0.25 for GBDT, random forest, SVM and KNN models; model selection is carried out by adopting a k-fold cross validation mode, and a test set and a training set are extracted in a put-back mode for each training and prediction; and adjusting the super parameters of GBDT, random forest, SVM, KNN and attribute-LSTM models based on a Bayesian optimization method, selecting a model combination with the best evaluation in k-fold cross validation to carry out final prediction, constructing a three-classification model, and carrying out multi-GPU (graphic processing unit) acceleration training based on RAPIS (random access points) during the training model period to predict that the APP which is most used under a designated base station at the next moment belongs to one of low-delay, high-bandwidth and multi-connection APP.
On the other hand, the invention also provides an APP use prediction system under the multisource feature conversion base station based on Hadoop and RAPIS, which comprises a basic feature extraction module, an APP identification module, an image construction and integration construction module, a time sequence feature construction module, a feature sample data set balancing module, a multi-GPU acceleration module based on RAPIS and a deep learning classification identification module;
the basic feature extraction module is used for constructing a user internet access record and a user mobile position record, and extracting APP use preference and user behavior features of the user based on the two data;
the APP identification module is used for converting the internet surfing record of the user use flow into the record of the user use APP and identifying the use characteristics of the APP under the base station under different time windows;
the portrait construction module is used for constructing a user portrait based on the basic feature extraction module, mapping the user portrait into a base station APP use portrait, and identifying a base station APP use sequence and use features;
the characteristic sample construction module is used for extracting time sequence characteristics of the basic characteristic extraction module and outputting time sequence characteristics of user behaviors; the characteristic sample data set balancing module adopts an improved SMOTE algorithm to carry out data set adjustment, samples of different categories are sampled and balanced, and a sample balanced test set and a training set are obtained;
the multi-GPU acceleration module carries out multi-card parallel acceleration on the training of the deep learning model based on RAPIS, and is used for accelerating the training speed of the model;
the deep learning prediction module is used for constructing a plurality of different integrated learning combinations comprising boosting and bagging based on the feature sample data set after balancing, carrying out voting weight adjustment based on the k-fold cross validation and Bayesian parameter adjustment mode, constructing a prediction model of low-delay, high-bandwidth and multi-connection APP, and predicting the type of the APP most likely to be used at the next moment of a specified base station.
In addition, the invention also provides a computer device, which comprises a processor and a memory, wherein the memory is used for storing computer executable programs, the processor reads part or all of the computer executable programs from the memory and executes the computer executable programs, and the APP use prediction method under the multi-source feature conversion base station based on Hadoop and RAPIS can be realized when the processor executes part or all of the computer executable programs.
A computer readable storage medium, in which a computer program is stored, which when executed by a processor, can implement the method for predicting APP usage under a multi-source feature conversion base station based on Hadoop and RAPIDS according to the present invention.
Compared with the prior art, the invention has at least the following beneficial effects:
compared with the prior art, the method for predicting APP use under the multi-source feature conversion base station based on Hadoop and RAPIS provided by the invention is used for carrying out data processing, storage and calculation based on a Hadoop big data frame, carrying out multi-GPU acceleration based on RAPIS by combining with a flink and deep learning technology, predicting APP use aiming at people flow change and crowd behavior change under the base station, constructing a feature mapping method, and carrying out feature extraction and mapping on a group level formed by individuals, so that the model method and features can reflect the influence of individual behaviors on APP use under the base station, and finally, the APP use prediction is more accurate and better;
according to the APP use prediction method under the multisource feature conversion base station based on Hadoop and RAPIS, various types of APP are divided into three types of low delay, high bandwidth and multiple links according to the characteristics of the APP, and the APP used at the next moment of the base station is predicted through a three-classification prediction model;
according to the method, data processing, storage and calculation are performed based on a Hadoop big data frame, and multiple GPU acceleration is performed based on RAPIS by combining a flink and deep learning technology;
the method extracts the characteristics of the online and mobile behaviors of the user, interprets the APP use change from the perspective of the user behavior, and analyzes the online behavior of the user more deeply to obtain more implicit characteristics of the user; based on user-to-base station feature mapping, individual behaviors of a user are converted into behavior features of a base station at different moments, potential change conditions of APP use under the base station are extracted from the user angle, an end-to-end prediction model is designed based on a link, and data are cut off and slide calculated according to a time window;
the invention is based on a multimode integrated voting mechanism of the Attention-LSTM, the Attention-LSTM carries out further time sequence feature extraction, GBDT, random forest, SVM and KNN models carry out predictive voting, a voting weight distribution mode is adopted, and meanwhile, real-time online parameter adjustment is carried out based on Bayesian optimization and k-fold cross verification, and the model with the best performance is selected for prediction.
Drawings
FIG. 1 is a flow chart of a prediction method that can be implemented in the present invention.
Detailed Description
The invention provides a method and a system for predicting APP use under a multisource feature conversion base station based on Hadoop and RAPIS, wherein the method uses a Hadoop big data frame to store, clean and calculate data, and performs feature extraction on internet data of an internet user under the base station, calculates multisource data such as internet data, position data, personal data and the like of the user under a certain base station under different time windows to construct a user multidimensional image, constructs a time sequence image through flink, calculates APP use preference of different time windows under the base station in a flow time weighting mode, and converts APP names into three types of APP with low delay, high bandwidth and multiple links based on use characteristics of the existing APP types; then converting the user multi-dimensional image into a base station portrait based on a conversion formula from the user multi-dimensional image to the base station multi-dimensional portrait; on the basis of the successfully constructed base station multidimensional feature image, GBDT and improved SMOTE are used for feature selection and data set balance for model training and prediction; and then carrying out k-fold cross validation based on the processed data set, predicting by adopting a multi-model integrated voting method based on the Attention-LSTM, carrying out multi-GPU acceleration on model training based on RAPIS, specifically carrying out feature extraction on the Attention-LSTM, carrying out predictive voting on four models of GBDT, random forest, SVM and KNN, regulating model super-parameters through Bayesian optimization and k-fold cross validation, and selecting the model with the best performance for prediction, thereby realizing the prediction of APP use under a base station.
The method specifically comprises the following steps:
firstly, constructing basic data through user internet surfing data, storing and preprocessing the data based on Hadoop, performing first cleaning, conversion and format unification operation on the data by using MapReduce/Spark to form preliminary available basic data, and extracting basic characteristics such as user internet surfing characteristics, behavior characteristics, individual characteristics and the like on the basic data based on flink;
step two, an APP-url conversion unit is used for converting the online url of the user into an APP name used by the user; calculating APP use preference of the base station under a specific time window by adopting a flow weighting mode; then, based on the use characteristics of the APP category, converting the APP name into the APP category;
table 1 app-url and app type conversion Table (only a partial list is listed here for illustration)
Table 2 app type vs. prediction type conversion relationship
Step three, performing feature mapping in a link time window, and converting individual features of the user into base station features;
step four, performing feature selection and sample re-balance by utilizing an improved SMOTE algorithm and GBDT to obtain a training predicted data set;
fifthly, constructing an attach-LSTM (least squares) based on a data set for time sequence feature extraction, constructing an integrated learning model by using GBDT (global basic support system), random forest, SVM (support vector machine), KNN (K-nearest neighbor) and performing multi-GPU training acceleration based on RAPIS (random access points) for model training for voting prediction; and selecting optimal model parameters by adopting k-fold cross validation and Bayesian optimization, and carrying out APP use prediction.
For the purpose of making the technical solutions and advantages of the present invention clearer, the present invention is further described in detail below with reference to the attached drawings.
Aiming at the total online data and the anonymous individual data of the user under the base station, which are extracted by the big data of the operator, carrying out first data cleaning and calculation on call data, portrait data, position data and online data of the data based on a Hadoop big data frame, and carrying out ETL and sliding calculation based on a link on the obtained data; laying a data foundation for the following steps.
Firstly, constructing basic data through user internet surfing data, storing and preprocessing the data based on Hadoop, performing first cleaning, conversion and format unification operation on the data by using MapReduce/Spark to form preliminary available basic data, and extracting basic features such as user internet surfing features, behavior position features, individual features and the like on the basic data based on flink;
wherein the internet surfing information features include: total amount of flow usage, mean, variance, and usage trend in the time window, url requests usage statistics such as: statistics such as hundred degrees, google, QQ, weChat and other access trends and times, APP usage statistics such as: the times, the duty ratio and the duration of commonly used APP use, the statistics of special APP use and the like.
The behavioral location characteristics include: the method comprises the steps of user moving distance, average moving speed, real-time maximum distance of access to a base station, average times and entropy values, times and proportion of surfing the internet in workplaces and at home, number and duty ratio of position records in day and night and radius of gyration of user movement;
the individual features include: user gender, age, mobile phone account opening time and accumulated active days.
Step two, converting the Internet surfing URL of the user into an APP name used by the user by using an existing APP-URL conversion unit; calculating APP use preference of the base station under a specific time window by adopting a flow weighting mode; then converting the APP name into the APP category based on the existing rule;
constructing a sliding time window with a fixed step length by using a flink, performing data ETL processing in the time window, performing sliding processing on data in the window, performing calculation on a base station characteristic portrait and a multi-dimensional portrait of a user to obtain internet surfing data of the user under the base station with the fixed time step length, analyzing an APP name used by the user according to a URL of an internet surfing request by using an APP category comparison table, and dividing the APP used by the user into three types of low delay, high bandwidth and multi-connection according to the existing rule and the APP category comparison table; based on the used APP types, performing APP weighting according to the use time and the flow, and calculating the APP type with the largest weight as the most used APP type under the current base station;
the weighting formula is:
wherein: w (W) j Indicating the weight size, t of APPj under the current base station ij Represents the time when the ith person uses APPj, d ij Indicating the downlink flow of APPj under the current base station, up ij And SUM (t) represents the SUM of the using time of all APPs under the current base station, SUM (d) and SUM (up) respectively represent the SUM of the uplink flow and the downlink flow of all APPs under the current base station, and a, b and c are weighting coefficients.
Step three, feature mapping is carried out in a link time window, and individual features of users are converted into base station features:
the mapping method from user portrait to base station portrait comprises the following steps:
a ij the ith feature vector for the jth person under the base station,
D nj for the feature vector j of the nth base station mapped to the base station with the group feature,
p(a ij ) Representation a ij The probability of occurrence in the whole sequence,
t is the expiration time of the current time window,
t is the time of occurrence of an individual feature,
alpha is a time-decay factor and,
according to the mapping relation, the characteristic vector of the base station can be calculated according to the characteristic vector of the crowd under the base station, and further the portrait is obtained.
Step four, performing feature selection and sample re-balance by utilizing an improved SMOTE algorithm and GBDT to obtain a training predicted data set;
performing feature filtering and data mixed sampling on the obtained multi-dimensional image of the user and the base station feature image of the time window, performing feature evaluation on the existing data by using a GBDT model, filtering out features with feature evaluation scores below a set threshold value, and pruning the features to form a data set; in the invention, the threshold value is set to be 0.2, and then the data set is divided into a training set and a testing set.
Based on the Borderline-SMOTE oversampling technology, operating a training set, and classifying a few class samples in the training set into three classes, namely Safe, danger and Noise, wherein more than half of the Safe class samples are all minority class samples around the sample, more than half of the Danger class samples are all majority class samples around the sample, the samples are regarded as the samples on the boundary, and the Noise class samples are all majority class samples around the sample, and the Noise is regarded as Noise; and (3) oversampling a minority class of Danger classes, randomly selecting a minority class sample by adopting a K neighbor method, and oversampling the minority sample.
Fifthly, constructing an attach-LSTM (least squares) based on a data set for time sequence feature extraction, constructing an integrated learning model by using GBDT (global basic support system), random forest, SVM (support vector machine), KNN (K-nearest neighbor) and performing multi-GPU training acceleration based on RAPIS (random access points) model training, and finally performing voting prediction; selecting optimal model parameters by adopting k-fold cross validation and Bayesian optimization to predict APP use
Based on the extracted base station portrait features, crowd features and divided APP categories, using a combination method of an Attention mechanism and an LSTM model to extract sequential features of portraits by adopting the LSTM, and constructing a feature set together with original basic portrait features; constructing a plurality of different integrated learning combinations including boosting and bagging, selecting GBDT, random forest, SVM and KNN to perform multi-model predictive voting, and initializing voting weights of 0.25 for the four models; model selection is carried out by adopting a k-fold cross validation mode, and a test set and a training set are extracted in a put-back mode for each training and prediction; and adjusting the super parameters of GBDT, random forest, SVM, KNN and attribute-LSTM models based on a Bayesian optimization method, selecting a model combination with the best evaluation in k-fold cross validation to carry out final prediction, constructing a three-classification model, and predicting that the APP which is most used under a specified base station at the next moment belongs to one of low-delay, high-bandwidth and multi-connection APP.
On the other hand, the invention also provides an APP use prediction system under the multisource feature conversion base station based on Hadoop and RAPIS, which comprises a basic feature extraction module, an APP identification module, an image construction and integration construction module, a time sequence feature construction module, a feature sample data set balancing module, a multi-GPU acceleration module based on RAPIS and a deep learning classification identification module;
the basic feature extraction module is used for constructing a user internet access record and a user mobile position record, and extracting APP use preference and user behavior features of the user based on the two data;
the APP identification module is used for converting the internet surfing record of the user use flow into the record of the user use APP and identifying the use characteristics of the APP under the base station under different time windows;
the portrait construction module is used for constructing a user portrait based on the basic feature extraction module, mapping the user portrait into a base station APP use portrait, and identifying a base station APP use sequence and use features;
the characteristic sample construction module is used for extracting time sequence characteristics of the basic characteristic extraction module and outputting time sequence characteristics of user behaviors; the characteristic sample data set balancing module adopts an improved SMOTE algorithm to carry out data set adjustment, samples of different categories are sampled and balanced, and a sample balanced test set and a training set are obtained;
the multi-GPU acceleration module carries out multi-card parallel acceleration on the training of the deep learning model based on RAPIS, and is used for accelerating the training speed of the model;
the deep learning prediction module is used for constructing a plurality of different integrated learning combinations comprising boosting and bagging based on the feature sample data set after balancing, carrying out voting weight adjustment based on the k-fold cross validation and Bayesian parameter adjustment mode, constructing a prediction model of low-delay, high-bandwidth and multi-connection APP, and predicting the type of the APP most likely to be used at the next moment of a specified base station.
In addition, the invention also provides a computer device, which comprises a processor and a memory, wherein the memory is used for storing computer executable programs, the processor reads part or all of the computer executable programs from the memory and executes the computer executable programs, and the processor can realize the APP use prediction method under the multi-source feature conversion base station based on Hadoop and RAPIS.
A computer readable storage medium may also be provided, where a computer program is stored, where the computer program, when executed by a processor, can implement the method for predicting APP usage under a multi-source feature conversion base station based on Hadoop and RAPIDS according to the present invention.
The device used by the prediction APP can be a notebook computer, a tablet computer, a desktop computer or a workstation.
The processor may be a Graphics Processor (GPU), a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).
The memory can be an internal memory unit of a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory and a hard disk; external storage units such as removable hard disks, flash memory cards may also be used.
Computer readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include a resistive random access memory (ReRAM).

Claims (9)

1. The APP use prediction method under the multisource feature conversion base station based on Hadoop and RAPIS is characterized by comprising the following steps:
acquiring historical data of a current base station, storing the historical data into a Hadoop cluster, preprocessing the data by adopting a MapReduce/Spark calculation engine, cleaning and reconstructing a data set in a mode of cleaning abnormal values of the data, unifying data formats and converting data types, and sequencing according to time;
performing time window division on the cleaned and sequenced data, performing feature filtering and mixed sampling on features in the time window, and constructing a data set in a sliding time window mode;
classifying the data set after time window balancing, classifying the APP into three classes of low delay, high bandwidth and multiple connection according to applicable scenes by using the using characteristics of the existing APP class before classification,
predicting the most likely APP category used by the next time window by adopting a voting mechanism based on an Attention-LSTM model;
the optimization process of the Attention-LSTM model is as follows:
acquiring internet surfing data of a user connection base station, wherein the internet surfing data comprises user individual data and internet surfing data;
acquiring the used APP name and Internet surfing URL based on user Internet surfing data, mapping the APP name, internet surfing URL and APP category corresponding relation into the category to which the APP belongs, and obtaining the most used APP category of the base station under the current time window in a time and flow dual weighting mode;
constructing basic characteristics of a user under a current base station according to data of the most APP types used by the base station under the current time window, and constructing a multi-dimensional image of the user based on the basic characteristics;
based on the multidimensional image of the user, mapping calculation from user image features to base station image features is carried out in different time windows based on a link, so that base station feature images under different time windows and base station feature images of adjacent base stations are obtained;
feature filtering and mixed sampling are carried out on features in a time window, a data set is constructed in a sliding time window mode, the attribute-LSTM model is trained and predicted, multi-GPU acceleration is carried out based on RAPIS in the training process, and an optimized attribute-LSTM model is obtained; the actual class of APP most likely to be used in the next time window is predicted by adopting a voting mechanism based on the Attention-LSTM model as follows: based on the extracted base station characteristic image, crowd characteristics and divided APP types, using a combination method of an Attention mechanism and an LSTM model, extracting a time sequence characteristic image of data of the base station by adopting the LSTM model, and constructing a characteristic set together with a multidimensional image obtained by user conversion on the base station; constructing a plurality of different integrated learning combinations including boosting and bagging, performing multi-model predictive voting by using GBDT, random forest, SVM and KNN, and setting initial voting weight of 0.25 for GBDT, random forest, SVM and KNN models; model selection is carried out by adopting a k-fold cross validation mode, and a test set and a training set are extracted in a put-back mode for each training and prediction; and adjusting the super parameters of GBDT, random forest, SVM, KNN and attribute-LSTM models based on a Bayesian optimization method, selecting a model combination with the best evaluation in k-fold cross validation to carry out final prediction, constructing a three-classification model, and carrying out multi-GPU (graphic processing unit) acceleration training based on RAPIS (random access points) during the training model period to predict that the APP which is most used under a designated base station at the next moment belongs to one of low-delay, high-bandwidth and multi-connection APP.
2. The method for predicting APP usage under a multi-source feature conversion base station based on Hadoop and RAPIDS according to claim 1, wherein the obtaining of the internet data of the user connection base station is specifically: and extracting user internet surfing data for training the model from the base station data, wherein the user internet surfing data comprises url, request starting time, request ending time, uplink flow, downlink flow, user personal data and base station position.
3. The method for predicting APP use under a multisource feature conversion base station based on Hadoop and RAPIS according to claim 1, wherein a sliding time window with a fixed step length is constructed by using a flink, data in the time window are subjected to second data cleaning, conversion and outlier processing, the data in the time window are subjected to sliding processing, base station and user feature images are calculated, internet surfing data of a user under the base station with the fixed time step length are obtained, an APP category comparison table is used, APP names used by the user are analyzed according to URL of an internet surfing request, and then the APP used by the user is divided into three categories of low delay, high bandwidth and multiple connection according to the APP names, the corresponding relation between internet surfing URLs and APP categories and the use characteristics of the APP;
based on the used APP types, performing APP weighting according to the use time and the flow, and calculating the APP type with the largest weight as the most used APP type under the current base station;
the weighted calculation formula is:
wherein: w (W) j Indicating the weight size, t of APPj under the current base station ij Represents the time when the ith person uses APPj, d ij Representing downlink flow of APP j under current base station, up ij And SUM (t) represents the SUM of the using time of all APPs under the current base station, SUM (d) and SUM (up) respectively represent the SUM of the uplink flow and the downlink flow of all APPs under the current base station, and a, b and c are weighting coefficients.
4. The method for predicting APP usage under a multi-source feature conversion base station based on Hadoop and RAPIDS according to claim 1, wherein user statistics, internet surfing characteristics and user movement data of users in a fixed time window are extracted by using user sliding internet surfing data in a plurality of fixed time windows, including user statistics including age and gender; the user movement data characteristics comprise movement distance, number of stay points, stay point duration, turning radius and average movement speed; surfing the internet and characterized by: a surfing time length and a surfing time period, wherein
a. The calculation formula of the movable range is as follows: the minimum circumscribed circle of the user position point set is calculated, the moving range per hour is the area of the circumscribed circle, and the calculation formula is as follows:
f(a,b,r)=max(Dis(p i,j ,p a,b ))≤r
wherein f (a, b) represents a circle with longitude and latitude as the center of the circle (a, b), r as the radius, dis (p) i,j ,p a,b ) Represents a straight line between (a, b) and (i, j)Distance, where p i,j E, R is a user position point set and is an circumscribed circle expression for calculating the user position point set R;
b. the calculation formula of the turning radius is as follows:
r=max(Dis(Dis(p i,j ,p a,b )))
c. number of occurrences of new location:
c=Count(p i,j )
wherein count () represents the number of computations, p i,j E R ', R' is the difference p between this time period and the last time period location point of the user i,j E R, R is the set of location points for the user and R is the radius of gyration.
5. The method for predicting APP usage under a multi-source feature conversion base station based on Hadoop and RAPIDS according to claim 1, wherein the relationship of multi-dimensional portrait to base station portrait mapping of the user is:
a ij the ith feature vector for the jth person under the base station,
D nj for the feature vector j of the nth base station mapped to the base station with the group feature,
p(a ij ) Representation a ij The probability of occurrence in the whole sequence,
t is the expiration time of the current time window,
t is the time of occurrence of an individual feature,
alpha is a time attenuation factor;
and according to the mapping relation, calculating the basic feature vector of the base station according to the basic feature vector of the crowd under the base station, and obtaining the base station portrait.
6. The method for predicting APP use under a multi-source feature conversion base station based on Hadoop and RAPIS according to claim 1, wherein when feature filtering and data mixed sampling are performed on a user portrait and a base station portrait of a sliding time window calculated by a link, feature evaluation is performed on existing data by using a GBDT model, features with feature evaluation scores below a set threshold are filtered, and pruning is performed on the features to form a data set; then dividing the data set into a training set and a testing set;
based on the Borderline-SMOTE oversampling technology, operating a training set, and classifying a few class samples in the training set into three classes, namely Safe, danger and Noise, wherein more than half of the Safe class samples are all minority class samples around the sample, more than half of the Danger class samples are all majority class samples around the sample, the samples are regarded as the samples on the boundary, and the Noise class samples are all majority class samples around the sample, and the Noise is regarded as Noise; and (3) oversampling a minority class of Danger classes, randomly selecting a minority class sample by adopting a K neighbor method, and oversampling the minority sample.
7. The APP use prediction system under the multisource feature conversion base station based on Hadoop and RAPIS is characterized by comprising a basic feature extraction module, an APP identification module, a portrait construction and integration module, a time sequence feature construction module, a feature sample data set balancing module, a multi-GPU acceleration module based on RAPIS and a deep learning classification identification module, wherein the method is used for realizing any one of claims 1-6;
the basic feature extraction module is used for constructing a user internet access record and a user mobile position record, and extracting APP use preference and user behavior features of the user based on the two data;
the APP identification module is used for converting the internet surfing record of the user use flow into the record of the user use APP and identifying the use characteristics of the APP under the base station under different time windows;
the portrait construction module is used for constructing a user portrait based on the basic feature extraction module, mapping the user portrait into a base station APP use portrait, and identifying a base station APP use sequence and use features;
the characteristic sample construction module is used for extracting time sequence characteristics of the basic characteristic extraction module and outputting time sequence characteristics of user behaviors; the characteristic sample data set balancing module adopts an improved SMOTE algorithm to carry out data set adjustment, samples of different categories are sampled and balanced, and a sample balanced test set and a training set are obtained;
the multi-GPU acceleration module carries out multi-card parallel acceleration on the training of the deep learning model based on RAPIS, and is used for accelerating the training speed of the model;
the deep learning prediction module is used for constructing a plurality of different integrated learning combinations comprising boosting and bagging based on the feature sample data set after balancing, carrying out voting weight adjustment based on the k-fold cross validation and Bayesian parameter adjustment mode, constructing a prediction model of low-delay, high-bandwidth and multi-connection APP, and predicting the type of the APP most likely to be used at the next moment of a specified base station.
8. A computer device, comprising a processor and a memory, wherein the memory is configured to store a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor can implement the method for predicting APP usage under the multi-source feature conversion base station based on Hadoop and RAPIDS according to any one of claims 1 to 6 when executing part or all of the computer executable program.
9. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the computer program can implement the method for predicting APP usage under a multi-source feature conversion base station based on Hadoop and RAPIDS as claimed in any one of claims 1 to 6.
CN202210014131.XA 2022-01-06 2022-01-06 APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS Active CN114374953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210014131.XA CN114374953B (en) 2022-01-06 2022-01-06 APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210014131.XA CN114374953B (en) 2022-01-06 2022-01-06 APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS

Publications (2)

Publication Number Publication Date
CN114374953A CN114374953A (en) 2022-04-19
CN114374953B true CN114374953B (en) 2023-09-05

Family

ID=81141534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210014131.XA Active CN114374953B (en) 2022-01-06 2022-01-06 APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS

Country Status (1)

Country Link
CN (1) CN114374953B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617745B (en) * 2022-10-13 2023-08-01 自然资源部国土卫星遥感应用中心 Management method, management device and medium for satellite image data storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018223944A1 (en) * 2017-06-08 2018-12-13 中兴通讯股份有限公司 Phone bill generation method, device, mobile edge platform and storage medium
CN109492678A (en) * 2018-10-24 2019-03-19 浙江工业大学 A kind of App classification method of integrated shallow-layer and deep learning
CN109857922A (en) * 2019-01-18 2019-06-07 深圳壹账通智能科技有限公司 Data evaluate and test model modelling approach, device, computer equipment and storage medium
CN110266545A (en) * 2019-06-28 2019-09-20 北京小米移动软件有限公司 A kind of method, apparatus and medium dynamically distributing Internet resources
CN110602322A (en) * 2019-09-12 2019-12-20 北京车慧科技有限公司 Method and device for layout of application icons and mobile terminal
CN110619420A (en) * 2019-07-31 2019-12-27 广东工业大学 Attention-GRU-based short-term residential load prediction method
CN111104969A (en) * 2019-12-04 2020-05-05 东北大学 Method for pre-judging collision possibility between unmanned vehicle and surrounding vehicle
CN112163689A (en) * 2020-08-18 2021-01-01 国网浙江省电力有限公司绍兴供电公司 Short-term load quantile probability prediction method based on depth Attention-LSTM
CN112215442A (en) * 2020-11-27 2021-01-12 中国电力科学研究院有限公司 Method, system, device and medium for predicting short-term load of power system
CN112884236A (en) * 2021-03-10 2021-06-01 南京工程学院 Short-term load prediction method and system based on VDM decomposition and LSTM improvement

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140006418A1 (en) * 2012-07-02 2014-01-02 Andrea G. FORTE Method and apparatus for ranking apps in the wide-open internet
US9609118B2 (en) * 2015-03-09 2017-03-28 International Business Machines Corporation Usage of software programs on mobile computing devices

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018223944A1 (en) * 2017-06-08 2018-12-13 中兴通讯股份有限公司 Phone bill generation method, device, mobile edge platform and storage medium
CN109492678A (en) * 2018-10-24 2019-03-19 浙江工业大学 A kind of App classification method of integrated shallow-layer and deep learning
CN109857922A (en) * 2019-01-18 2019-06-07 深圳壹账通智能科技有限公司 Data evaluate and test model modelling approach, device, computer equipment and storage medium
CN110266545A (en) * 2019-06-28 2019-09-20 北京小米移动软件有限公司 A kind of method, apparatus and medium dynamically distributing Internet resources
CN110619420A (en) * 2019-07-31 2019-12-27 广东工业大学 Attention-GRU-based short-term residential load prediction method
CN110602322A (en) * 2019-09-12 2019-12-20 北京车慧科技有限公司 Method and device for layout of application icons and mobile terminal
CN111104969A (en) * 2019-12-04 2020-05-05 东北大学 Method for pre-judging collision possibility between unmanned vehicle and surrounding vehicle
CN112163689A (en) * 2020-08-18 2021-01-01 国网浙江省电力有限公司绍兴供电公司 Short-term load quantile probability prediction method based on depth Attention-LSTM
CN112215442A (en) * 2020-11-27 2021-01-12 中国电力科学研究院有限公司 Method, system, device and medium for predicting short-term load of power system
CN112884236A (en) * 2021-03-10 2021-06-01 南京工程学院 Short-term load prediction method and system based on VDM decomposition and LSTM improvement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨青.基于Hadoop的多维关联规则挖掘算法研究及应用.《计算机工程与科学》.2019,全文. *

Also Published As

Publication number Publication date
CN114374953A (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN109816111B (en) Reading understanding model training method and device
US10725008B2 (en) Automatic siting for air quality monitoring stations
CN113705959B (en) Network resource allocation method and electronic equipment
CN110267292B (en) Cellular network flow prediction method based on three-dimensional convolutional neural network
CN110119477B (en) Information pushing method, device and storage medium
CN104348855A (en) User information processing method, mobile terminal and server
CN111460294A (en) Message pushing method and device, computer equipment and storage medium
CN111078847A (en) Power consumer intention identification method and device, computer equipment and storage medium
CN116244513B (en) Random group POI recommendation method, system, equipment and storage medium
CN109241243A (en) Candidate documents sort method and device
CN114374953B (en) APP use prediction method and system under multi-source feature conversion base station based on Hadoop and RAPIS
CN116106988A (en) Weather prediction method and device, electronic equipment and storage medium
WO2024040941A1 (en) Neural architecture search method and device, and storage medium
CN115221396A (en) Information recommendation method and device based on artificial intelligence and electronic equipment
Said et al. AI-based solar energy forecasting for smart grid integration
CN114626585A (en) Urban rail transit short-time passenger flow prediction method based on generation of countermeasure network
CN110855474B (en) Network feature extraction method, device, equipment and storage medium of KQI data
CN111192170A (en) Topic pushing method, device, equipment and computer readable storage medium
US10984305B1 (en) Synthetic clickstream testing using a neural network
CN116030323B (en) Image processing method and device
CN115604131B (en) Link flow prediction method, system, electronic device and medium
CN117216382A (en) Interactive processing method, model training method and related device
Zhang et al. Wild plant data collection system based on distributed location
CN105279266A (en) Mobile internet social contact picture-based user context information prediction method
CN117058498B (en) Training method of segmentation map evaluation model, and segmentation map evaluation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant