CN112990382A - Base station common-site identification method based on big data - Google Patents
Base station common-site identification method based on big data Download PDFInfo
- Publication number
- CN112990382A CN112990382A CN202110509326.7A CN202110509326A CN112990382A CN 112990382 A CN112990382 A CN 112990382A CN 202110509326 A CN202110509326 A CN 202110509326A CN 112990382 A CN112990382 A CN 112990382A
- Authority
- CN
- China
- Prior art keywords
- data
- base station
- sample
- cell
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000005259 measurement Methods 0.000 claims abstract description 11
- 238000013480 data collection Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 39
- 238000005070 sampling Methods 0.000 claims description 36
- 238000004422 calculation algorithm Methods 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 18
- 238000012360 testing method Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 9
- 238000007637 random forest analysis Methods 0.000 claims description 9
- 238000000638 solvent extraction Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000012795 verification Methods 0.000 claims description 4
- 238000007635 classification algorithm Methods 0.000 claims description 3
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000009795 derivation Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000008571 general function Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000013138 pruning Methods 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 abstract description 3
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 6
- 229910052742 iron Inorganic materials 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/10—Scheduling measurement reports ; Arrangements for measurement reports
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a base station common-station address identification method based on big data, in particular to the field of base station common-station address identification, which comprises the following steps: s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, user longitude, user latitude, local cell longitude, local cell latitude, and whether the cell is marked by co-station. According to the invention, data measured by different network frequency band signals in the wireless environment measurement report MR are cleaned, and then the classification of common-site sites is realized by adopting a machine learning method, so that the influence based on inaccurate site information in a resource management system is successfully overcome, whether the base station is a shared base station or not can be accurately identified, powerful support is provided for the landing of common-site sharing of operators, and the method is a scientific, effective and low-cost solution.
Description
Technical Field
The invention relates to the technical field of base station common-station address identification, in particular to a base station common-station address identification method based on big data.
Background
The wireless environment measurement report MR data in the mobile communication network can accurately reflect the coverage condition of the network, and provides good tool support for operators to know the coverage of the wireless network. And the good network coverage is the fundamental guarantee for the survival of operators. However, as the mobile communication network further evolves, especially 4G gradually changes to 5G network, the wavelength of the signal band used by the wireless network becomes shorter and shorter, resulting in the multiplied station building scale. According to incomplete statistics, the scale of the existing 4G sites in China reaches more than 400 million, and the scale of the 5G sites is more than 3 times that of the 4G sites, so that the total investment cost of operators is directly increased.
The co-station sharing is a good optimization cost strategy, and currently, an iron tower group is already established in three operators (China Mobile, China telecom and China Unicom), the iron tower group carries out base station construction, and then leases the base station to the three operators, and the three operators pay according to the use conditions. Due to the historical problem, three operators have a large number of self sites, so that the existing sites of the operators cannot be well distinguished and classified from shared sites, the allocation of the site cost of an iron tower group is further influenced, if the existing sites are distinguished, the sites are classified mainly by the longitude and latitude information of the sites, but the existing sites are not accurately classified due to the fact that the basic information of the sites in the existing operator resource management system is greatly inconsistent with the information of the sites in the field (mainly because the resource management system cannot be updated in time after the sites are migrated in the later period).
Disclosure of Invention
In order to overcome the above defects in the prior art, embodiments of the present invention provide a method for identifying a co-site of a base station based on big data, which cleans data measured by a frequency band signal of a different network in a measurement report MR of a wireless environment, and then classifies co-site sites by using a machine learning method, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a base station common-site identification method based on big data comprises the following steps:
s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of local cell, latitude of local cell, and whether the local cell is marked by co-station;
s2, data processing: processing the MR data and the work parameter data to obtain new data, selecting MR sampling points of the RSRP value of the cell in a certain range according to the new data, counting the number of the MR sampling points of each base station according to the base station Siteid on the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is greater than a set value;
s3, feature extraction: according to the Siteid dimension of each base station, calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell of all MR sampling points, the correlation coefficient between the RSRP of the local cell and the TA of the local cell, and calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell, the correlation coefficient between the RSRP of the local cell and the TA of the local cell and the like of different TA of the local cell, wherein the values are the characteristic data of each base station; whether the station sharing mark is the label data of each base station or not is judged, and the two kinds of data form new data;
s4, algorithm modeling: dividing the data with the characteristics extracted into a training set and a test set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT and Xgboost), and verifying the test set by using the trained model;
s5, model selection: respectively carrying out model training on training data by using random forest, GBDT and Xgboost algorithms, obtaining an optimal model by each algorithm through continuously adjusting parameters, and then verifying a test set by using the trained models;
s6, model application: and after the final model is selected according to the above, storing the model, collecting MR measurement report data and engineering parameter data, processing the data, classifying the base stations by using the stored model, and outputting the identification results of all the base stations.
Further, the step S2 includes the following sub-steps:
s21, matching the MR data with the working parameters through the cell CellId to obtain the position coordinates (longitude and latitude) of the base station and the cell to which each cell belongs, and deleting data with empty position coordinates from the matched records;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell of the processed data in the step S21, deleting the MR sampling points far away from the base station and deleting the sampling points of which the distances are not matched with the TA to obtain new data;
s23, selecting MR sampling points with the RSRP value of the cell in a certain range for the data obtained in the step S22, counting the number of the MR sampling points of each base station for the processed data according to the base station Siteid, and reserving the base station MR sampling points with the number of the MR sampling points larger than a set value.
Further, the algorithm of the random forest in the step S4 includes the following steps:
s411, randomly and repeatedly extracting K new self-help sample sets from the training set by applying a bootstrap method, and constructing K classification trees according to the K new self-help sample sets, wherein the samples which are not extracted each time form K pieces of data outside bags;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables to perform node splitting;
s413, generating all decision trees completely without pruning;
s414, the category of the terminal node is determined by the mode category corresponding to the node;
s415, for the new observation point, the observation points are classified by all trees, and the classification is generated by a majority rule.
Further, the algorithm of GBDT in step S4 includes the following steps:
s421, initializing the estimated values of all samples in K classes, Fk(X) is a matrix, which can be initialized to all 0's or set randomly;
s422, circulating the following learning updating process for M times;
s423, performing Logistic transformation on the function estimation value without the sample, and converting the estimation value of the sample into the probability of the sample belonging to a certain class through the following transformation formula:
the estimated value of each category is 0 when the sample is initial, the probability of belonging to the category is also equal, the estimated value changes with the continuous update of the following, and the probability also changes correspondingly;
s424, traversing the probability of each category of all samples, wherein each category is traversed instead of all samples;
s425, solving the probability gradient of each sample on the K-th class, wherein in the above, the probability that a plurality of samples belong to a certain class K and the probability whether the samples really belong to the class K are solved through an algorithm of a regression tree, learning is performed through a common gradient descent method of establishing a cost function and derivation, and the log-likelihood function form of the cost function is as follows:
deriving the cost function to obtain:
s426, learning a regression tree of J leaf nodes along a gradient method,
we input all samplesAnd the residual error of the probability of each sample on the Kth category is used as an updating direction, the regression tree with J leaves is learned, and the learning basic process is similar to that of the regression tree: traversing the feature dimension of the sample, and selectingSelecting one feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and once J leaf nodes are learned, the learning is stopped;
s427, the gain of each leaf node is calculated, and the gain calculation formula of each node is as follows:
s428, updating the estimated values of all samples in class K, where the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
in the K-th class of the M-th iteration, the estimated values F of all samples can be obtained by summing the gain values of all J leaf nodes in the previous iteration M-1 and multiplying the sum by the vector 1, so that after the M times of iterative learning, the final estimated matrices of all samples in all classes can be obtained, and based on the estimated value matrices, multi-class classification can be realized.
Further, the algorithm of Xgboost in step S4 includes the following steps:
s431, defining the complexity of the tree: firstly splitting a tree into a structure part q and a leaf node weight part w, wherein w is a vector and represents an output value in each leaf node;
introducing a regularization term omega (f)t) The complexity of the number is controlled, so that the overfitting of the effective control model is realized;
s432, a Boosting Tree model in XGboost: the same as the GBDT method, the lifting model of the XGboost also adopts residual errors, the difference is that the minimum square loss is not necessarily the minimum square loss when the split node is selected, the loss function is as follows, and compared with the GBDT, a regularization term is added according to the complexity of a tree model:
s433, rewriting the objective function: in XGboost, Taylor expansion is directly used to expand the loss function into a binomial function (provided that the loss function is first order and second order; continuous conductible), and we assume that the leaf node regions are:
our objective function can be converted into:
now we derive wj and let the derivative be 0, we can:
s434, scoring function of tree structure: the above Obj value represents how much to reduce the target at most when a tree structure is specified, and we can refer to it as a structure score, which can be considered as a more general function like a kini index to score the tree structure, and for finding the tree structure with the smallest Obj score, we can enumerate all possibilities and then compare the structure scores to obtain the optimal tree structure, however, this method is too computationally expensive, and more commonly, greedy method is used, each time we try to segment the existing leaf nodes (the first leaf node is the root node), and then obtain the gain after segmentation as follows:
using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it is still necessary to list all the partitioning schemes for each partitioning. In practice we first take all samples giAccording to the division mode, the GL and GR can be divided by scanning the sample once, and then division is carried out according to the fraction of Gain.
Further, the verification in step S5 is to calculate the accuracy, recall, and F of each model1The value, its formula of calculation is as follows:
wherein, TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes;
as can be seen from the definition of recall and accuracy, to some extent, an increase in one of the two will have a probability of causing the other to be accurateIs thus F1The values can be compared to comprehensively display the recognition effect according to F of the three models on the test set1Values, comparing their sizes, selecting F1And the model with the maximum value is the final model, and the classification result is output.
The invention has the technical effects and advantages that:
compared with the prior art, the method and the device have the advantages that the data measured by the different network frequency band signals in the wireless environment measurement report MR are cleaned, and then the machine learning method is adopted to realize the classification of the common-site sites. The verification proves that the method successfully overcomes the influence of inaccurate station information in a resource management system, can accurately identify whether the base station is a shared base station, provides powerful support for landing of co-station sharing of operators, and is a scientific, effective and low-cost solution.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a method for identifying a common station address of a base station based on big data includes the following steps:
s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of local cell, latitude of local cell, and whether the local cell is marked by co-station;
s2, data processing: processing the MR data and the work parameter data to obtain new data, selecting MR sampling points of the RSRP value of the cell in a certain range according to the new data, counting the number of the MR sampling points of each base station according to the base station Siteid on the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is greater than a set value;
step S2 includes the following substeps:
s21, matching the MR data with the working parameters through the cell CellId to obtain the position coordinates (longitude and latitude) of the base station and the cell to which each cell belongs, and deleting data with empty position coordinates from the matched records;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell of the processed data in the step S21, deleting the MR sampling points far away from the base station and deleting the sampling points of which the distances are not matched with the TA to obtain new data;
s23, selecting MR sampling points of the RSRP value of the cell in a certain range from the data obtained in the step S22, counting the number of the MR sampling points of each base station according to the base station Siteid from the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is more than a set value;
s3, feature extraction: according to the Siteid dimension of each base station, calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell of all MR sampling points, the correlation coefficient between the RSRP of the local cell and the TA of the local cell, and calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell, the correlation coefficient between the RSRP of the local cell and the TA of the local cell and the like of different TA of the local cell, wherein the values are the characteristic data of each base station; whether the station sharing mark is the label data of each base station or not is judged, and the two kinds of data form new data;
s4, algorithm modeling: dividing the data with the characteristics extracted into a training set and a test set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT and Xgboost), and verifying the test set by using the trained model;
the algorithm of the random forest comprises the following steps:
s411, randomly and repeatedly extracting K new self-help sample sets from the training set by applying a bootstrap method, and constructing K classification trees according to the K new self-help sample sets, wherein the samples which are not extracted each time form K pieces of data outside bags;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables to perform node splitting;
s413, generating all decision trees completely without pruning;
s414, the category of the terminal node is determined by the mode category corresponding to the node;
s415, classifying the new observation points by using all trees, wherein the categories are generated by a majority decision principle;
the algorithm of GBDT includes the following steps:
s421, initializing the estimated values of all samples in K classes, Fk(X) is a matrix, which can be initialized to all 0's or set randomly;
s422, circulating the following learning updating process for M times;
s423, performing Logistic transformation on the function estimation value without the sample, and converting the estimation value of the sample into the probability of the sample belonging to a certain class through the following transformation formula:
the estimated value of each category is 0 when the sample is initial, the probability of belonging to the category is also equal, the estimated value changes with the continuous update of the following, and the probability also changes correspondingly;
s424, traversing the probability of each category of all samples, wherein each category is traversed instead of all samples;
s425, solving the probability gradient of each sample on the K-th class, wherein in the above, the probability that a plurality of samples belong to a certain class K and the probability whether the samples really belong to the class K are solved through an algorithm of a regression tree, learning is performed through a common gradient descent method of establishing a cost function and derivation, and the log-likelihood function form of the cost function is as follows:
deriving the cost function to obtain:
s426, learning a regression tree of J leaf nodes along a gradient method,
we input all samplesAnd the residual error of the probability of each sample on the Kth category is used as an updating direction, the regression tree with J leaves is learned, and the learning basic process is similar to that of the regression tree: traversing the feature dimension of the sample, selecting a feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and stopping learning once J leaf nodes are learned;
s427, the gain of each leaf node is calculated, and the gain calculation formula of each node is as follows:
s428, updating the estimated values of all samples in class K, where the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
in the mth iteration, under the K-th iteration, the estimated values F of all samples can be obtained through the previous iteration M-1, the estimated values of the samples and a gain vector, the gain vector needs to sum the gain values of all J leaf nodes and then multiply the gain values with the vector 1, therefore, after the M times of iterative learning, the final estimated matrixes of all samples under all categories can be obtained, and the multi-category classification can be realized based on the estimated value matrixes;
the algorithm of Xgboost comprises the following steps:
s431, defining the complexity of the tree: firstly splitting a tree into a structure part q and a leaf node weight part w, wherein w is a vector and represents an output value in each leaf node;
introducing a regularization term omega (f)t) The complexity of the number is controlled, so that the overfitting of the effective control model is realized;
s432, a Boosting Tree model in XGboost: the same as the GBDT method, the lifting model of the XGboost also adopts residual errors, the difference is that the minimum square loss is not necessarily the minimum square loss when the split node is selected, the loss function is as follows, and compared with the GBDT, a regularization term is added according to the complexity of a tree model:
s433, rewriting the objective function: in XGboost, Taylor expansion is directly used to expand the loss function into a binomial function (provided that the loss function is first order and second order; continuous conductible), and we assume that the leaf node regions are:
our objective function can be converted into:
now we derive wj and let the derivative be 0, we can:
s434, scoring function of tree structure: the above Obj value represents how much to reduce the target at most when a tree structure is specified, and we can refer to it as a structure score, which can be considered as a more general function like a kini index to score the tree structure, and for finding the tree structure with the smallest Obj score, we can enumerate all possibilities and then compare the structure scores to obtain the optimal tree structure, however, this method is too computationally expensive, and more commonly, greedy method is used, each time we try to segment the existing leaf nodes (the first leaf node is the root node), and then obtain the gain after segmentation as follows:
using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it is still necessary to list all the partitioning schemes for each partitioning. In practice we first take all samples giAccording to the segmentation mode, the GL and GR can be segmented as long as a sample is scanned once, and then segmentation is carried out according to the fraction of Gain;
s5, model selection: respectively carrying out model training on training data by using random forest, GBDT and Xgboost algorithms, obtaining an optimal model by each algorithm through continuously adjusting parameters, and then verifying a test set by using the trained models;
the verification in step S5 is to calculate the accuracy, recall, and F1 value of each model, and the calculation formula is as follows:
wherein, TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes;
it can be known from the definitions of the recall rate and the accuracy that the improvement of one of the two accuracy rates to a certain extent has probability to cause the reduction of the other accuracy rate, so that the F1 value can compare the comprehensive display identification effect, the sizes of the three models are compared according to the F1 values of the three models on a test set, the model with the largest F1 value is selected as a final model, and the classification result is output;
s6, model application: and after the final model is selected according to the above, storing the model, collecting MR measurement report data and engineering parameter data, processing the data, classifying the base stations by using the stored model, and outputting the identification results of all the base stations.
The points to be finally explained are: first, in the description of the present application, it should be noted that, unless otherwise specified and limited, the terms "mounted," "connected," and "connected" should be understood broadly, and may be a mechanical connection or an electrical connection, or a communication between two elements, and may be a direct connection, and "upper," "lower," "left," and "right" are only used to indicate a relative positional relationship, and when the absolute position of the object to be described is changed, the relative positional relationship may be changed;
secondly, the method comprises the following steps: in the drawings of the disclosed embodiments of the invention, only the structures related to the disclosed embodiments are referred to, other structures can refer to common designs, and the same embodiment and different embodiments of the invention can be combined with each other without conflict;
and finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.
Claims (6)
1. A base station common-site identification method based on big data is characterized by comprising the following steps:
s1, data collection: collecting multi-day wireless measurement report MR data and industrial parameter data, wherein the main used index variables are as follows: time, base station SiteId, local cell CellId, local cell TA, local cell RSRP, local cell frequency point, neighbor cell NCellId, neighbor cell frequency point, neighbor cell RSRP, longitude of user, latitude of user, longitude of local cell, latitude of local cell, and whether the local cell is marked by co-station;
s2, data processing: processing the MR data and the work parameter data to obtain new data, selecting MR sampling points of the RSRP value of the cell in a certain range according to the new data, counting the number of the MR sampling points of each base station according to the base station Siteid on the processed data, and reserving the base station MR sampling points of which the number of the MR sampling points is greater than a set value;
s3, feature extraction: according to the Siteid dimension of each base station, calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell of all MR sampling points, the correlation coefficient between the RSRP of the local cell and the TA of the local cell, and calculating the RSRP mean value, variance and discrete coefficient of the local cell and the adjacent cell, the correlation coefficient between the RSRP of the local cell and the TA of the local cell and the like of different TA of the local cell, wherein the values are the characteristic data of each base station; whether the station sharing mark is the label data of each base station or not is judged, and the two kinds of data form new data;
s4, algorithm modeling: dividing the data with the characteristics extracted into a training set and a test set according to a certain proportion, performing model training on the training set by using a classification algorithm (random forest, GBDT and Xgboost), and verifying the test set by using the trained model;
s5, model selection: respectively carrying out model training on training data by using random forest, GBDT and Xgboost algorithms, obtaining an optimal model by each algorithm through continuously adjusting parameters, and then verifying a test set by using the trained models;
s6, model application: and after the final model is selected according to the above, storing the model, collecting MR measurement report data and engineering parameter data, processing the data, classifying the base stations by using the stored model, and outputting the identification results of all the base stations.
2. The big data based co-sited site identification method of base station according to claim 1, wherein said step S2 includes the following sub-steps:
s21, matching the MR data with the working parameters through the cell CellId to obtain the position coordinates (longitude and latitude) of the base station and the cell to which each cell belongs, and deleting data with empty position coordinates from the matched records;
s22, calculating the distance from each MR sampling point to the base station according to the position coordinates (longitude and latitude) of the user and the position coordinates (longitude and latitude) of the cell of the processed data in the step S21, deleting the MR sampling points far away from the base station and deleting the sampling points of which the distances are not matched with the TA to obtain new data;
s23, selecting MR sampling points with the RSRP value of the cell in a certain range for the data obtained in the step S22, counting the number of the MR sampling points of each base station for the processed data according to the base station Siteid, and reserving the base station MR sampling points with the number of the MR sampling points larger than a set value.
3. The big data-based base station co-site identification method according to claim 1, wherein the algorithm of the random forest in the step S4 includes the following steps:
s411, randomly and repeatedly extracting K new self-help sample sets from the training set by applying a bootstrap method, and constructing K classification trees according to the K new self-help sample sets, wherein the samples which are not extracted each time form K pieces of data outside bags;
s412, randomly extracting M < M variables at each node of each number, calculating the information content of each variable, and then selecting one variable with the most classification capability from the M variables to perform node splitting;
s413, generating all decision trees completely without pruning;
s414, the category of the terminal node is determined by the mode category corresponding to the node;
s415, for the new observation point, the observation points are classified by all trees, and the classification is generated by a majority rule.
4. The big data based co-sited site identification method of claim 1, wherein the algorithm of GBDT in step S4 includes the following steps:
s421, initializing the estimated values of all samples in K classes, Fk(X) is a matrix, which can be initialized to all 0's or set randomly;
s422, circulating the following learning updating process for M times;
s423, performing Logistic transformation on the function estimation value without the sample, and converting the estimation value of the sample into the probability of the sample belonging to a certain class through the following transformation formula:
Pk(X) denotes the probability that a sample X belongs to a certain class k, Fk(X) matrix representing the estimated values of the sample X on class k, F1(X) represents an estimate of the sample X over class i;
the estimated value of each category is 0 when the sample is initial, the probability of belonging to the category is also equal, the estimated value changes with the continuous update of the following, and the probability also changes correspondingly;
s424, traversing the probability of each category of all samples, wherein each category is traversed instead of all samples;
s425, solving the probability gradient of each sample on the K-th class, wherein in the above, the probability that a plurality of samples belong to a certain class K and the probability whether the samples really belong to the class K are solved through an algorithm of a regression tree, learning is performed through a common gradient descent method of establishing a cost function and derivation, and the log-likelihood function form of the cost function is as follows:
Pk(X) corresponds to the above expression, ykRepresenting the probability that the sample really belongs to class k;
deriving the cost function to obtain:
Pk,m-1(Xi) Representing the probability, P, that the m-1 th iteration sample belongs to a class kk,m-1(X) represents the probability of the sample belonging to a certain class in the (m-1) th iteration, the others being consistent with the above representation;
s426, learning a regression tree of J leaf nodes along a gradient method,
we input all samplesAnd each sample is inThe residual error of the probability on the Kth category is used as an updating direction, a regression tree with J leaves is learned, and the basic learning process is similar to that of the regression tree: traversing the feature dimension of the sample, selecting a feature as a segmentation point, wherein the minimum mean square error principle needs to be met, and stopping learning once J leaf nodes are learned;
s427, the gain of each leaf node is calculated, and the gain calculation formula of each node is as follows:
yikprobability of sample i on class k;
s428, updating the estimated values of all samples in class K, where the gain obtained in the previous step is calculated based on the gradient, and the estimated values of the samples can be updated by using the gain:
Fkm(X) denotes the estimated value of the sample in class k in the mth iteration, Fk,m-1(X) represents the estimated value of the sample in class k in the m-1 iteration; j denotes the number of leaf nodes, γjkmThe gain of the leaf node j in the kth class in the mth iteration is shown, and the gain is consistent with the expression;
in the K-th class of the M-th iteration, the estimated values F of all samples can be obtained by summing the gain values of all J leaf nodes in the previous iteration M-1 and multiplying the sum by the vector 1, so that after the M times of iterative learning, the final estimated matrices of all samples in all classes can be obtained, and based on the estimated value matrices, multi-class classification can be realized.
5. The big data based base station co-site identification method according to claim 1, wherein the algorithm of Xgboost in step S4 comprises the following steps:
s431, defining the complexity of the tree: firstly splitting a tree into a structure part q and a leaf node weight part w, wherein w is a vector and represents an output value in each leaf node;
w, vector of leaf, q, structure of tree, X, argument, ft(x) Tree model for argument X, T, number of leaves;
introducing a regularization term omega (f)t) The complexity of the number is controlled, so that the overfitting of the effective control model is realized;
gamma, hyperparameter, weight coefficient, number of T leaf nodes, lambda, hyperparameter, weight coefficient, Wj 2Outputting the square of the score on the leaf node;
s432, a Boosting Tree model in XGboost: the same as the GBDT method, the lifting model of the XGboost also adopts residual errors, the difference is that the minimum square loss is not necessarily the minimum square loss when the split node is selected, the loss function is as follows, and compared with the GBDT, a regularization term is added according to the complexity of a tree model:
ζ (hot) represents the loss function of the model, i represents the ith sample,irepresents the estimated value of the i-th sample, yiRepresents the ithTrue value of the sample,/, (i,yi) A representative function ofi=yiWhen it is 0, otherwise it is 1, K represents the number of numbers, Ω (f)k) The complexity of the kth tree of the above formula;
s433, rewriting the objective function: in XGboost, Taylor expansion is directly used to expand the loss function into a binomial function (provided that the loss function is first order and second order; continuous conductible), and we assume that the leaf node regions are:
Ijis defined as the set of samples above the leaf j, i being the ith sample, xiIs the argument of the ith sample, j is the jth leaf node, q (x)i) Is xiThe structural function of (1);
our objective function can be converted into:
t represents the T tree, the number of T leaf nodes, j represents the j leaf node, IjIn accordance with the above expression, i is the ith sample, wjRepresents the output fraction above the leaf node, λ is the weight coefficient, Wj 2The square of the output fraction on the leaf node is shown, and gamma is a weight coefficient;
now we derive wj and let the derivative be 0, we can:
in accordance with the above statement;
s434, scoring function of tree structure: the above Obj value represents how much to reduce the target at most when a tree structure is specified, and we can refer to it as a structure score, which can be considered as a more general function like a kini index to score the tree structure, and for finding the tree structure with the smallest Obj score, we can enumerate all possibilities and then compare the structure scores to obtain the optimal tree structure, however, this method is too computationally expensive, and more commonly, greedy method is used, each time we try to segment the existing leaf nodes (the first leaf node is the root node), and then obtain the gain after segmentation as follows:
using Gain as a condition for judging whether to divide, if so<0, then this leaf node does not do the partitioning, however, it still needs to list all the partitioning schemes for each partitioning; in practice we first take all samples giAccording to the division mode, the GL and GR can be divided by scanning the sample once, and then division is carried out according to the fraction of Gain.
6. The big data-based base station co-site identification method according to claim 1, wherein: the verification in step S5 is to calculate the accuracy, recall rate and F of each model1The value, its formula of calculation is as follows:
wherein, TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes;
from the definitions of recall and accuracy, it can be seen that to some extent an increase in one of these two accuracies will have a probability of causing a decrease in the other, so F1The values can be compared to comprehensively display the recognition effect according to F of the three models on the test set1Values, comparing their sizes, selecting F1And the model with the maximum value is the final model, and the classification result is output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110509326.7A CN112990382B (en) | 2021-05-11 | 2021-05-11 | Base station co-site identification method based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110509326.7A CN112990382B (en) | 2021-05-11 | 2021-05-11 | Base station co-site identification method based on big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112990382A true CN112990382A (en) | 2021-06-18 |
CN112990382B CN112990382B (en) | 2023-11-21 |
Family
ID=76337493
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110509326.7A Active CN112990382B (en) | 2021-05-11 | 2021-05-11 | Base station co-site identification method based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112990382B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114401527A (en) * | 2021-12-21 | 2022-04-26 | 中国电信股份有限公司 | Load identification method and device of wireless network and storage medium |
CN118301658A (en) * | 2024-06-05 | 2024-07-05 | 亚信科技(中国)有限公司 | Common site detection method, apparatus, device, storage medium and program product |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120120887A1 (en) * | 2010-11-12 | 2012-05-17 | Battelle Energy Alliance, Llc | Systems, apparatuses, and methods to support dynamic spectrum access in wireless networks |
CN103907368A (en) * | 2011-12-27 | 2014-07-02 | 松下电器产业株式会社 | Server device, base station device, and identification number establishment method |
CN106131953A (en) * | 2016-07-07 | 2016-11-16 | 上海奕行信息科技有限公司 | A kind of method realizing mobile subscriber location based on frequency weighting in community in the period |
CN109302714A (en) * | 2018-12-07 | 2019-02-01 | 南京华苏科技有限公司 | Realize that base station location is studied and judged and area covered knows method for distinguishing based on user data |
CN112418445A (en) * | 2020-11-09 | 2021-02-26 | 深圳市洪堡智慧餐饮科技有限公司 | Intelligent site selection fusion method based on machine learning |
-
2021
- 2021-05-11 CN CN202110509326.7A patent/CN112990382B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120120887A1 (en) * | 2010-11-12 | 2012-05-17 | Battelle Energy Alliance, Llc | Systems, apparatuses, and methods to support dynamic spectrum access in wireless networks |
CN103907368A (en) * | 2011-12-27 | 2014-07-02 | 松下电器产业株式会社 | Server device, base station device, and identification number establishment method |
CN106131953A (en) * | 2016-07-07 | 2016-11-16 | 上海奕行信息科技有限公司 | A kind of method realizing mobile subscriber location based on frequency weighting in community in the period |
CN109302714A (en) * | 2018-12-07 | 2019-02-01 | 南京华苏科技有限公司 | Realize that base station location is studied and judged and area covered knows method for distinguishing based on user data |
CN112418445A (en) * | 2020-11-09 | 2021-02-26 | 深圳市洪堡智慧餐饮科技有限公司 | Intelligent site selection fusion method based on machine learning |
Non-Patent Citations (4)
Title |
---|
T. BANDH ETAL.: "Automatic Site Identification and Hardware-to-Site Mapping for Base Station Self-configuration", 《2011 IEEE 73RD VEHICULAR TECHNOLOGY CONFERENCE (VTC SPRING)》 * |
T. BANDH ETAL.: "Automatic Site Identification and Hardware-to-Site Mapping for Base Station Self-configuration", 《2011 IEEE 73RD VEHICULAR TECHNOLOGY CONFERENCE (VTC SPRING)》, 18 July 2011 (2011-07-18) * |
王旺: "《基于机器学习的基站覆盖范围仿真》", 《电脑与电信》 * |
王旺: "《基于机器学习的基站覆盖范围仿真》", 《电脑与电信》, vol. 2018, no. 11, 31 May 2019 (2019-05-31) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114401527A (en) * | 2021-12-21 | 2022-04-26 | 中国电信股份有限公司 | Load identification method and device of wireless network and storage medium |
CN118301658A (en) * | 2024-06-05 | 2024-07-05 | 亚信科技(中国)有限公司 | Common site detection method, apparatus, device, storage medium and program product |
CN118301658B (en) * | 2024-06-05 | 2024-07-30 | 亚信科技(中国)有限公司 | Common site detection method, apparatus, device, storage medium and program product |
Also Published As
Publication number | Publication date |
---|---|
CN112990382B (en) | 2023-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111148118A (en) | Flow prediction and carrier turn-off method and system based on time sequence | |
CN109617888B (en) | Abnormal flow detection method and system based on neural network | |
CN112990382A (en) | Base station common-site identification method based on big data | |
CN112243249B (en) | LTE new access anchor point cell parameter configuration method and device under 5G NSA networking | |
CN112712209B (en) | Reservoir warehousing flow prediction method and device, computer equipment and storage medium | |
CN110009614A (en) | Method and apparatus for output information | |
CN111523778A (en) | Power grid operation safety assessment method based on particle swarm algorithm and gradient lifting tree | |
CN113901977A (en) | Deep learning-based power consumer electricity stealing identification method and system | |
CN109978870A (en) | Method and apparatus for output information | |
CN112541634B (en) | Method and device for predicting top-layer oil temperature and discriminating false alarm and storage medium | |
CN109787821B (en) | Intelligent prediction method for large-scale mobile client traffic consumption | |
CN111586728B (en) | Small sample characteristic-oriented heterogeneous wireless network fault detection and diagnosis method | |
CN113780345A (en) | Small sample classification method and system facing small and medium-sized enterprises and based on tensor attention | |
CN115422788B (en) | Power distribution network line loss analysis management method, device, storage medium and system | |
CN114169502A (en) | Rainfall prediction method and device based on neural network and computer equipment | |
CN116958806A (en) | Pest identification model updating, pest identification method and device and electronic equipment | |
CN112163613A (en) | Rapid identification method for power quality disturbance | |
CN115567871A (en) | WiFi fingerprint indoor floor identification and position estimation method | |
Qin et al. | A wireless sensor network location algorithm based on insufficient fingerprint information | |
CN114066250A (en) | Method, device, equipment and storage medium for measuring and calculating repair cost of power transmission project | |
CN111738878B (en) | Bridge stress detection system | |
CN111343664B (en) | User positioning method, device, equipment and medium | |
CN116702839A (en) | Model training method and application system based on convolutional neural network | |
CN115512174A (en) | Anchor-frame-free target detection method applying secondary IoU loss function | |
CN113807462A (en) | AI-based network equipment fault reason positioning method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |