CN112990382B - Base station co-site identification method based on big data - Google Patents

Base station co-site identification method based on big data Download PDF

Info

Publication number
CN112990382B
CN112990382B CN202110509326.7A CN202110509326A CN112990382B CN 112990382 B CN112990382 B CN 112990382B CN 202110509326 A CN202110509326 A CN 202110509326A CN 112990382 B CN112990382 B CN 112990382B
Authority
CN
China
Prior art keywords
data
cell
base station
model
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110509326.7A
Other languages
Chinese (zh)
Other versions
CN112990382A (en
Inventor
寇红侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange Frame Technology Jiangsu Co ltd
Original Assignee
Orange Frame Technology Jiangsu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Orange Frame Technology Jiangsu Co ltd filed Critical Orange Frame Technology Jiangsu Co ltd
Priority to CN202110509326.7A priority Critical patent/CN112990382B/en
Publication of CN112990382A publication Critical patent/CN112990382A/en
Application granted granted Critical
Publication of CN112990382B publication Critical patent/CN112990382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/10Scheduling measurement reports ; Arrangements for measurement reports
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The application discloses a base station co-site identification method based on big data, which particularly relates to the field of base station co-site identification, and comprises the following steps: s1, data collection: the MR data and the industrial parameter data of the multi-day wireless measurement report are collected, and the main used index variables are as follows: time, base station SiteId, own cell id, own cell TA, own cell RSRP, own cell frequency point, neighboring cell NCellId, neighboring cell frequency point, neighboring cell RSRP, longitude of user, latitude of user, longitude of own cell, latitude of own cell, and sign of co-station or not. The application cleans the measured data of the different network frequency band signals in the wireless environment measurement report MR, and then adopts a machine learning method to realize classification of the common-site sites, thereby successfully overcoming the influence of inaccurate site information in a resource management system, accurately identifying whether the base station is a shared base station, providing powerful support for the landing of operators for common-site sharing, and being a scientific, effective and low-cost solution.

Description

Base station co-site identification method based on big data
Technical Field
The application relates to the technical field of base station co-site identification, in particular to a base station co-site identification method based on big data.
Background
The wireless environment measurement report MR data in the mobile communication network can accurately reflect the coverage condition of the network, and provides good tool support for operators to know the coverage of the wireless network. Good network coverage is a fundamental guarantee for operator survival. However, with further evolution of the mobile communication network, especially, the transition from 4G to 5G is gradual, the signal frequency band wavelength adopted by the wireless network is shorter and shorter, which results in a multiple increase of the construction scale. According to incomplete statistics, the size of the existing 4G sites in the whole country reaches more than 400 tens of thousands, and the size of the 5G sites is more than 3 times of that of the 4G sites, so that the total investment cost of operators is directly increased.
Co-site sharing is a good cost optimization strategy, three major operators (China Mobile, china telecom and China Unicom) have already built iron tower groups at present, the iron tower groups are used for base station construction and leased to the three major operators, and the three major operators pay according to the use condition. Due to the historical legacy problem, three operators have a large number of own sites, so that the existing sites of the operators and the shared sites cannot be well distinguished and classified, and the site cost allocation of the iron tower group is further influenced, for example, the existing sites are distinguished and classified mainly by the longitude and latitude information of the sites, but because the basic information of the sites in the existing operator resource management system is inconsistent with the information of the site sites in a large amount (mainly because the resource management system cannot be updated in time after the later-stage site migration), the existing site classification is inaccurate.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, an embodiment of the present application provides a method for identifying co-sited sites of a base station based on big data, which cleans data measured by different network frequency band signals in a wireless environment measurement report MR, and then adopts a machine learning method to implement classification of co-sited sites, so as to solve the problems set forth in the background art.
In order to achieve the above purpose, the present application provides the following technical solutions: a base station co-site identification method based on big data comprises the following steps:
s1, data collection: the MR data and the industrial parameter data of the multi-day wireless measurement report are collected, and the main used index variables are as follows: time, base station SiteId, own cell Id, own cell TA, own cell RSRP, own cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of own cell, latitude of own cell, and sign of whether to co-station;
s2, data processing: processing the MR data and the engineering parameter data to obtain new data, selecting MR sampling points with the RSRP value of the cell within a certain range according to the new data, counting the MR sampling points of each base station according to the SiteId of the base station for the processed data, and reserving the base station MR sampling points with the number of the MR sampling points being larger than a set value;
s3, feature extraction: calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell, calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell and the like for different TAs of the own cell according to the SiteId dimension of each base station, wherein the values are the characteristic data of each base station; if the common station mark is the label data of each base station, the two data form new data;
s4, modeling by an algorithm: dividing the data after the characteristics are extracted into a training set and a testing set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT, xgboost), and verifying the testing set by using the trained model;
s5, selecting a model: training the training data by using random forest, GBDT and Xgboost algorithms respectively, obtaining an optimal model by continuously adjusting parameters by each algorithm, and verifying a test set by using the trained model;
s6, model application: and (3) after selecting a final model according to the above, saving the model, collecting MR measurement report data and work parameter data, processing the data, classifying the base stations by using the saved model, and outputting the identification results of all the base stations.
Further, the step S2 includes the following substeps:
s21, matching MR data with industrial parameter data through cell Id to obtain a base station to which each cell belongs and position coordinates (longitude and latitude) of the cell, and deleting the data with the empty position coordinates from the matched record;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell for the data processed in the step S21, deleting MR sampling points far from the base station and deleting sampling points with unmatched distance with TA, and obtaining new data;
s23, selecting MR sampling points with the RSRP value of the cell within a certain range for the data obtained in the step S22, counting the MR sampling points of each base station for the processed data according to the SiteId of the base station, and reserving the MR sampling points of the base station with the number of the MR sampling points being larger than a set value.
Further, the algorithm of the random forest in the step S4 includes the following steps:
s411, randomly extracting K new self-service sample sets from the training set in a put-back way by applying a bootstrap method, and constructing K classification trees by the self-service sample sets, wherein samples which are not extracted each time form K pieces of out-of-bag data;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables for node splitting;
s413, completely generating all decision trees without pruning;
s414, determining the category of the terminal node by the mode category corresponding to the node;
s415, classifying the new observation points by using all trees, wherein the classification is generated by a majority decision principle.
Further, the algorithm of GBDT in step S4 includes the following steps:
s421, initializing estimation values of all samples on K categories, F k (X) is a matrix, which can be initialized to 0 or set randomly;
s422, cycling the following learning update process M times;
s423, performing Logistic transformation on the function estimated value without the sample, and converting the estimated value of the sample into the probability that the sample belongs to a certain class through the following transformation formula:
the estimated value of each category is 0 at the initial time of the sample, the probability of belonging to the category is equal, and the estimated value changes along with the continuous updating of the sample, and the probability correspondingly changes;
s424, traversing the probability of each category for all samples, noting that each category is traversed in this step, not all samples;
s425, solving probability gradients of each sample on the K-th class, wherein in the above, the probabilities that a plurality of samples belong to a certain class K and the probabilities that whether the samples really belong to the class K are solved through a regression tree algorithm, learning through a common building cost function and a derived gradient descent method, wherein the log likelihood function form of the cost function is as follows:
deriving a cost function to obtain:
s426, learning a regression tree of J leaf nodes along the gradient method,
we input all samplesAnd the residual error of probability of each sample on the K category is taken as the updating direction, the regression tree with J leaves is learned, and the basic learning process is similar to the regression tree: traversing the feature dimension of the sample, selecting a feature as a partition point, and stopping learning once J leaf nodes are learned according to the principle that the minimum mean square error is required to be met;
s427, obtaining the gain of each leaf node, wherein the gain calculation formula of each node is as follows:
s428, updating the estimated values of all samples under the K-th class, wherein the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
under the K-th type in the M-th iteration, the estimated values F of all samples can be obtained by summing the estimated values of the samples and the gain vector in the previous iteration M-1, the gain vector is obtained by multiplying the gain values of all J leaf nodes and then the gain vector with the vector 1, so that after M times of iterative learning, the final estimated matrix of all samples under all types can be obtained, and based on the estimated value matrix, multi-type classification can be realized.
Further, the algorithm of Xgboost in step S4 includes the following steps:
s431, defining complexity of the tree: splitting the tree into a structural part q and a leaf node weight part w, wherein w is a vector and represents the output value in each leaf node;
introducing regularization term Ω (f) t ) To control the complexity of the numbers, thereby achieving an efficient overfitting of the control model;
s432, boosting Tree model in XGBoost: as with the GBDT method, the lifting model of XGBoost also adopts residual errors, except that the minimum square loss is not necessarily required when the split nodes are selected, the loss function is as follows, and compared with GBDT, a regularization term is added according to the complexity of the tree model:
s433, rewriting the objective function: in XGBoost, the loss function is directly expanded into a binomial function by Taylor expansion (provided that the loss function is first-order, second-order; continuous-derivative), and the leaf node area is assumed to be:
our objective function can be converted into:
at this time we derive wj and let the derivative be 0, we can obtain:
s434, scoring function of tree structure: the Obj value above represents how much it decreases at most above the target when a tree structure is specified, we can refer to it as a structure score, which can be considered as a function that scores the tree structure more generally like the base index, we can enumerate all the possibilities for finding the tree structure with the smallest Obj score, then compare the structure scores to obtain the optimal tree structure, however this method is computationally expensive, more commonly greedy, each attempt to segment an already existing leaf node (the first leaf node is the root node), and then obtain the gain after segmentation as:
taking Gain as a condition for judging whether to divide, if Gain<0, then this leaf node does not partition, however, all partition schemes still need to be listed for each partition. In practice we will first take all samples g i According to the segmentation mode, GL and GR can be segmented only by scanning the sample once, and then segmentation is carried out according to the score of Gain.
Further, the verification in step S5 is to calculate the accuracy, recall, and F of each model 1 The value is calculated as follows:
wherein TP is the number of positive classes, FP is the number of negative classes, FN is the number of positive classes;
as can be seen from the definition of recall and accuracy, to a certain extent, an increase in one accuracy will result in a decrease in the other, thus F 1 The values can be compared and integrated to display the identification effect, and F is arranged on the test set according to three models 1 Values, comparing their sizes, select F 1 And the model with the largest value is a final model, and a classification result is output.
The application has the technical effects and advantages that:
compared with the prior art, the method and the device have the advantages that the data measured by the different-network frequency band signals in the wireless environment measurement report MR are cleaned, and then the classification of the co-sited sites is realized by adopting a machine learning method. Through verification, the method successfully overcomes the influence of inaccurate site information in the resource management system, can accurately identify whether the base station is a shared base station, provides powerful support for the landing shared by operators, and is a scientific, effective and low-cost solution.
Drawings
FIG. 1 is a flow chart of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The method for identifying the common site of the base stations based on the big data as shown in the attached figure 1 comprises the following steps:
s1, data collection: the MR data and the industrial parameter data of the multi-day wireless measurement report are collected, and the main used index variables are as follows: time, base station SiteId, own cell Id, own cell TA, own cell RSRP, own cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of own cell, latitude of own cell, and sign of whether to co-station;
s2, data processing: processing the MR data and the engineering parameter data to obtain new data, selecting MR sampling points with the RSRP value of the cell within a certain range according to the new data, counting the MR sampling points of each base station according to the SiteId of the base station for the processed data, and reserving the base station MR sampling points with the number of the MR sampling points being larger than a set value;
step S2 comprises the following sub-steps:
s21, matching MR data with industrial parameter data through cell Id to obtain a base station to which each cell belongs and position coordinates (longitude and latitude) of the cell, and deleting the data with the empty position coordinates from the matched record;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell for the data processed in the step S21, deleting MR sampling points far from the base station and deleting sampling points with unmatched distance with TA, and obtaining new data;
s23, selecting MR sampling points with the RSRP value of the cell within a certain range for the data obtained in the step S22, counting the MR sampling points of each base station for the processed data according to the SiteId of the base station, and reserving the MR sampling points of the base station with the number of the MR sampling points being larger than a set value;
s3, feature extraction: calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell, calculating the average value, variance and discrete coefficient of the RSRP of the own cell and the RSRP of the neighbor cell, the correlation coefficient between the RSRP of the own cell and the TA of the own cell and the like for different TAs of the own cell according to the SiteId dimension of each base station, wherein the values are the characteristic data of each base station; if the common station mark is the label data of each base station, the two data form new data;
s4, modeling by an algorithm: dividing the data after the characteristics are extracted into a training set and a testing set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT, xgboost), and verifying the testing set by using the trained model;
the algorithm of the random forest comprises the following steps:
s411, randomly extracting K new self-service sample sets from the training set in a put-back way by applying a bootstrap method, and constructing K classification trees by the self-service sample sets, wherein samples which are not extracted each time form K pieces of out-of-bag data;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables for node splitting;
s413, completely generating all decision trees without pruning;
s414, determining the category of the terminal node by the mode category corresponding to the node;
s415, classifying the new observation points by using all trees, wherein the classification is generated by a majority decision principle;
the algorithm of GBDT comprises the following steps:
s421, initializing estimation values of all samples on K categories, F k (X) is a matrix, which can be initialized to 0 or set randomly;
s422, cycling the following learning update process M times;
s423, performing Logistic transformation on the function estimated value without the sample, and converting the estimated value of the sample into the probability that the sample belongs to a certain class through the following transformation formula:
the estimated value of each category is 0 at the initial time of the sample, the probability of belonging to the category is equal, and the estimated value changes along with the continuous updating of the sample, and the probability correspondingly changes;
s424, traversing the probability of each category for all samples, noting that each category is traversed in this step, not all samples;
s425, solving probability gradients of each sample on the K-th class, wherein in the above, the probabilities that a plurality of samples belong to a certain class K and the probabilities that whether the samples really belong to the class K are solved through a regression tree algorithm, learning through a common building cost function and a derived gradient descent method, wherein the log likelihood function form of the cost function is as follows:
deriving a cost function to obtain:
s426, learning a regression tree of J leaf nodes along the gradient method,
we input all samplesAnd the residual error of probability of each sample on the K category is taken as the updating direction, the regression tree with J leaves is learned, and the basic learning process is similar to the regression tree: traversing the feature dimension of the sample, selecting a feature as a partition point, and stopping learning once J leaf nodes are learned according to the principle that the minimum mean square error is required to be met;
s427, obtaining the gain of each leaf node, wherein the gain calculation formula of each node is as follows:
s428, updating the estimated values of all samples under the K-th class, wherein the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
under the K-th type in the M-th iteration, the estimated values F of all samples can be obtained by summing the estimated values of the samples and the gain vector in the previous iteration M-1, the gain vector is obtained by multiplying the gain values of all J leaf nodes and then the gain vector with the vector 1, so that after M times of iterative learning, the final estimated matrix of all samples under all types can be obtained, and based on the estimated value matrix, multi-type classification can be realized;
the algorithm of Xgboost comprises the following steps:
s431, defining complexity of the tree: splitting the tree into a structural part q and a leaf node weight part w, wherein w is a vector and represents the output value in each leaf node;
introducing regularization term Ω (f) t ) To control the complexity of the numbers, thereby achieving an efficient overfitting of the control model;
s432, boosting Tree model in XGBoost: as with the GBDT method, the lifting model of XGBoost also adopts residual errors, except that the minimum square loss is not necessarily required when the split nodes are selected, the loss function is as follows, and compared with GBDT, a regularization term is added according to the complexity of the tree model:
s433, rewriting the objective function: in XGBoost, the loss function is directly expanded into a binomial function by Taylor expansion (provided that the loss function is first-order, second-order; continuous-derivative), and the leaf node area is assumed to be:
our objective function can be converted into:
at this time we derive wj and let the derivative be 0, we can obtain:
s434, scoring function of tree structure: the Obj value above represents how much it decreases at most above the target when a tree structure is specified, we can refer to it as a structure score, which can be considered as a function that scores the tree structure more generally like the base index, we can enumerate all the possibilities for finding the tree structure with the smallest Obj score, then compare the structure scores to obtain the optimal tree structure, however this method is computationally expensive, more commonly greedy, each attempt to segment an already existing leaf node (the first leaf node is the root node), and then obtain the gain after segmentation as:
taking Gain as a condition for judging whether to divide, if Gain<0, then this leaf node does not partition, however, all partition schemes still need to be listed for each partition. In practice we will first take all samples g i According to the segmentation mode, GL and GR can be segmented only by scanning a sample once, and then segmentation is carried out according to the score of Gain;
s5, selecting a model: training the training data by using random forest, GBDT and Xgboost algorithms respectively, obtaining an optimal model by continuously adjusting parameters by each algorithm, and verifying a test set by using the trained model;
the verification in step S5 is to calculate the accuracy, recall and F1 of each model, and the calculation formula is as follows:
wherein TP is the number of positive classes, FP is the number of negative classes, FN is the number of positive classes;
the definition of the recall rate and the accuracy rate shows that the improvement of a certain accuracy rate can cause the reduction of another accuracy rate to a certain extent, so that the F1 value can be used for comprehensively displaying the identification effect, the sizes of the three models are compared according to the F1 values of the three models on the test set, the model with the largest F1 value is selected as the final model, and the classification result is output;
s6, model application: and (3) after selecting a final model according to the above, saving the model, collecting MR measurement report data and work parameter data, processing the data, classifying the base stations by using the saved model, and outputting the identification results of all the base stations.
The last points to be described are: first, in the description of the present application, it should be noted that, unless otherwise specified and defined, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be mechanical or electrical, or may be a direct connection between two elements, and "upper," "lower," "left," "right," etc. are merely used to indicate relative positional relationships, which may be changed when the absolute position of the object being described is changed;
secondly: in the drawings of the disclosed embodiments, only the structures related to the embodiments of the present disclosure are referred to, and other structures can refer to the common design, so that the same embodiment and different embodiments of the present disclosure can be combined with each other under the condition of no conflict;
finally: the foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (6)

1. The method for identifying the common station address of the base station based on the big data is characterized by comprising the following steps:
s1, data collection: the MR data and the industrial data of the multi-day wireless measurement report are collected, and the used index variables are as follows: time, base station SiteId, own cell Id, own cell TA, own cell RSRP, own cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of own cell, latitude of own cell, and sign of whether to co-station;
s2, data processing: processing the MR data and the engineering parameter data to obtain new data, selecting MR sampling points with the RSRP value of the cell within a certain range according to the new data, counting the MR sampling points of each base station according to the SiteId of the base station for the processed data, and reserving the base station MR sampling points with the number of the MR sampling points being larger than a set value;
s3, feature extraction: according to SiteId dimension of each base station, calculating own-cell RSRP average value, own-cell RSRP variance, own-cell RSRP discrete coefficient, own-cell RSRP average value, own-cell RSRP variance and own-cell RSRP discrete coefficient of all MR sampling points, and calculating own-cell RSRP average value, own-cell RSRP variance, own-cell RSRP discrete coefficient, own-cell RSRP average value, own-cell RSRP variance, own-cell RSRP discrete coefficient and own-cell TA of different own-cell TAs, which are characteristic data of each base station; if the common station mark is the label data of each base station, the two data form new data;
s4, modeling by an algorithm: dividing the data after the characteristics are extracted into a training set and a testing set according to a certain proportion, performing model training on the training set by using a classification algorithm, and verifying the testing set by using the trained model, wherein the classification algorithm is random forest, GBDT (global binary system) and Xgboost;
s5, selecting a model: training the training data by using random forest, GBDT and Xgboost algorithms respectively, obtaining an optimal model by continuously adjusting parameters by each algorithm, and verifying a test set by using the trained model;
s6, model application: and (3) after selecting a final model according to the above, saving the model, collecting MR measurement report data and work parameter data, processing the data, classifying the base stations by using the saved model, and outputting the identification results of all the base stations.
2. The method for identifying a base station co-site based on big data according to claim 1, wherein said step S2 comprises the sub-steps of:
s21, matching MR data with industrial parameter data through cell Id to obtain a base station to which each cell belongs and position coordinates of the cells, deleting data with empty position coordinates from the matched records, wherein the position coordinates are longitude and latitude;
s22, calculating the distance between each MR sampling point and the base station according to the position coordinates of the user and the position coordinates of the cell for the data processed in the step S21, wherein the position coordinates are longitude and latitude, deleting MR sampling points far away from the base station and deleting sampling points with unmatched distance with TA, and obtaining new data;
s23, selecting MR sampling points with the RSRP value of the cell within a certain range for the data obtained in the step S22, counting the MR sampling points of each base station for the processed data according to the SiteId of the base station, and reserving the MR sampling points of the base station with the number of the MR sampling points being larger than a set value.
3. The method for identifying the co-site of the base station based on big data as set forth in claim 1, wherein the algorithm of the random forest in the step S4 comprises the steps of:
s411, randomly extracting K new self-service sample sets from the training set in a put-back way by applying a bootstrap method, and constructing K classification trees by the self-service sample sets, wherein samples which are not extracted each time form K pieces of out-of-bag data;
s412, randomly extracting M < M variables at each node of each tree, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables for node splitting;
s413, completely generating all decision trees without pruning;
s414, determining the category of the terminal node by the mode category corresponding to the node;
s415, classifying the new observation points by using all trees, wherein the classification is generated by a majority decision principle.
4. The method for identifying the co-site of the base station based on the big data according to claim 1, wherein the algorithm of GBDT in the step S4 comprises the following steps:
s421, initializing estimation values of all samples on K categories, F k (X) is a matrix initialized to all 0 s, or randomly set;
s422, cycling the following learning update process M times;
s423, performing Logistic transformation on the function estimated value without the sample, and converting the estimated value of the sample into the probability that the sample belongs to a certain category through the following transformation formula:
the estimated value of each category is 0 at the initial time of the sample, the probability of belonging to the category is equal, and the estimated value changes along with the continuous updating of the sample, and the probability correspondingly changes;
s424, traversing the probability of each category for all samples, in which step each category is traversed instead of all samples;
s425, solving probability gradients of each sample on the K-th class, wherein in the above, the probabilities that a plurality of samples belong to a certain class K and the probabilities that whether the samples really belong to the class K are solved through a regression tree algorithm, establishing a cost function, learning by using a derived gradient descent method, wherein the log likelihood function form of the cost function is as follows:
deriving a cost function to obtain:
s426, learning a regression tree of J leaf nodes along the gradient method,
we input all samples x i I=1, 2, … N, and the residuals of the probability of each sample on the kth class as update direction, we learn a regression tree with J leaves, the basic process of learning is similar to that of the regression tree: traversing the feature dimension of the sample, selecting a feature as a partition point, and stopping learning once J leaf nodes are learned according to the principle that the minimum mean square error is required to be met;
s427, obtaining the gain of each leaf node, wherein the gain calculation formula of each node is as follows:
s428, updating the estimated values of all samples under the K-th class, wherein the gain obtained in the previous step is calculated based on the gradient, and updating the estimated values of the samples by using the gain:
under the K-th type in the M-th iteration, the estimated values F of all samples are obtained through the estimated values of the samples and gain vectors in the previous iteration M-1, the gain vectors are obtained by summing the gain values of all J leaf nodes and then multiplying the gain values with the vector 1, so that after M times of iterative learning, the final estimated matrix of all samples under all types is obtained, and based on the estimated value matrix, multi-type classification is realized.
5. The method for identifying the co-site of the base station based on big data as set forth in claim 1, wherein the algorithm of Xgboost in step S4 comprises the steps of:
s431, defining complexity of the tree: splitting the tree into a structural part q and a leaf node weight part w, wherein w is a vector and represents the output value in each leaf node;
f t (x)=w q(x) ,w∈R T ,q:R d →{1,2,…,T}
introducing regularization term Ω (f) t ) To control the complexity of the numbers, thereby achieving an efficient overfitting of the control model;
s432, boosting Tree model in XGBoost: as with the GBDT method, the lifting model of XGBoost also adopts residual errors, except that the minimum square loss is not necessarily required when the split nodes are selected, the loss function is as follows, and compared with GBDT, a regularization term is added according to the complexity of the tree model:
s433, rewriting the objective function: directly expanding the loss function into a binomial function by using a Taylor expansion in XGBoost, provided that the loss function has a first order and a second order; continuous conduction, assuming that our leaf node area is:
I j ={i|q(x i )=j}
our objective function can be converted into:
at this time we are for w j Deriving and letting the derivative be 0, the method can be as follows:
s434, scoring function of tree structure: the structure score is a function similar to the base index, scoring the tree structure, enumerating all the possibilities for obtaining the tree structure with the minimum Obj score, and comparing the structure scores to obtain the optimal tree structure, however, the calculation consumption of the method is too great, the existing leaf nodes are segmented by using a greedy method each time, the leaf node at the beginning is a root node, and the gain after segmentation is obtained as follows:
taking Gain as a condition for judging whether to divide, if Gain<0, then the leaf node does not partition, however, so for each partition we first get all samples g, or all partition schemes need to be listed i Sorting from small to large, then traversingChecking whether each node needs to be segmented, and dividing G by only scanning the sample once L ,G R Then, the segmentation is performed according to the score of Gain.
6. The method for identifying the co-site of the base station based on big data as claimed in claim 1, wherein: the verification in the step S5 is to calculate the accuracy, recall and F of each model 1 The value is calculated as follows:
wherein TP is the number of positive classes, FP is the number of negative classes, FN is the number of positive classes;
as can be seen from the definition of recall and accuracy, the improvement of one accuracy in the recall and accuracy results in the decrease of the other accuracy, thus F 1 The values can be compared and integrated to display the identification effect, and F is arranged on the test set according to three models 1 Values, comparing their sizes, select F 1 And the model with the largest value is a final model, and a classification result is output.
CN202110509326.7A 2021-05-11 2021-05-11 Base station co-site identification method based on big data Active CN112990382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110509326.7A CN112990382B (en) 2021-05-11 2021-05-11 Base station co-site identification method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110509326.7A CN112990382B (en) 2021-05-11 2021-05-11 Base station co-site identification method based on big data

Publications (2)

Publication Number Publication Date
CN112990382A CN112990382A (en) 2021-06-18
CN112990382B true CN112990382B (en) 2023-11-21

Family

ID=76337493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110509326.7A Active CN112990382B (en) 2021-05-11 2021-05-11 Base station co-site identification method based on big data

Country Status (1)

Country Link
CN (1) CN112990382B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114401527A (en) * 2021-12-21 2022-04-26 中国电信股份有限公司 Load identification method and device of wireless network and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103907368A (en) * 2011-12-27 2014-07-02 松下电器产业株式会社 Server device, base station device, and identification number establishment method
CN106131953A (en) * 2016-07-07 2016-11-16 上海奕行信息科技有限公司 A kind of method realizing mobile subscriber location based on frequency weighting in community in the period
CN109302714A (en) * 2018-12-07 2019-02-01 南京华苏科技有限公司 Realize that base station location is studied and judged and area covered knows method for distinguishing based on user data
CN112418445A (en) * 2020-11-09 2021-02-26 深圳市洪堡智慧餐饮科技有限公司 Intelligent site selection fusion method based on machine learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120120887A1 (en) * 2010-11-12 2012-05-17 Battelle Energy Alliance, Llc Systems, apparatuses, and methods to support dynamic spectrum access in wireless networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103907368A (en) * 2011-12-27 2014-07-02 松下电器产业株式会社 Server device, base station device, and identification number establishment method
CN106131953A (en) * 2016-07-07 2016-11-16 上海奕行信息科技有限公司 A kind of method realizing mobile subscriber location based on frequency weighting in community in the period
CN109302714A (en) * 2018-12-07 2019-02-01 南京华苏科技有限公司 Realize that base station location is studied and judged and area covered knows method for distinguishing based on user data
CN112418445A (en) * 2020-11-09 2021-02-26 深圳市洪堡智慧餐饮科技有限公司 Intelligent site selection fusion method based on machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于机器学习的基站覆盖范围仿真》;王旺;《电脑与电信》;20190531;第2018年卷(第11期);全文 *
Automatic Site Identification and Hardware-to-Site Mapping for Base Station Self-configuration;T. Bandh etal.;《2011 IEEE 73rd Vehicular Technology Conference (VTC Spring)》;20110718;全文 *

Also Published As

Publication number Publication date
CN112990382A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN106792514B (en) User position analysis method based on signaling data
CN110991690B (en) Multi-time wind speed prediction method based on deep convolutional neural network
CN112347970B (en) Remote sensing image ground object identification method based on graph convolution neural network
CN109151750B (en) LTE indoor positioning floor distinguishing method based on recurrent neural network model
CN108446616B (en) Road extraction method based on full convolution neural network ensemble learning
CN112135248B (en) WIFI fingerprint positioning method based on K-means optimal estimation
CN114092832A (en) High-resolution remote sensing image classification method based on parallel hybrid convolutional network
US20220191818A1 (en) Method and Apparatus for Obtaining Emission Probability, Method and Apparatus for Obtaining Transition Probability, and Sequence Positioning Method and Apparatus
CN106681305A (en) Online fault diagnosing method for Fast RVM (relevance vector machine) sewage treatment
CN107027148A (en) A kind of Radio Map classification and orientation methods based on UE speed
CN113592132B (en) Rainfall objective forecasting method based on numerical weather forecast and artificial intelligence
CN112990382B (en) Base station co-site identification method based on big data
CN113297174B (en) Land utilization change simulation method based on deep learning
CN112004233B (en) Network planning method based on big data mining
CN113780345A (en) Small sample classification method and system facing small and medium-sized enterprises and based on tensor attention
CN111461192B (en) River channel water level flow relation determination method based on multi-hydrological station linkage learning
CN110290466A (en) Floor method of discrimination, device, equipment and computer storage medium
CN106993296A (en) The performance estimating method and device of terminal
CN111343664B (en) User positioning method, device, equipment and medium
CN108898157B (en) Classification method for radar chart representation of numerical data based on convolutional neural network
CN115567871A (en) WiFi fingerprint indoor floor identification and position estimation method
CN115292381A (en) Virtual currency mining behavior identification method based on extreme gradient lifting algorithm
CN112487724B (en) Urban dynamic expansion simulation method based on partition and improved CNN-CA model
CN114818849A (en) Convolution neural network based on big data information and anti-electricity-stealing method based on genetic algorithm
CN114331206A (en) Point location addressing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant