CN116156538A

CN116156538A - Quality difference cell root cause positioning method based on SMOTE-ReliefF-XGBoost algorithm

Info

Publication number: CN116156538A
Application number: CN202310153935.2A
Authority: CN
Inventors: 王宁; 崔天姿; 申凌峰; 陈鑫; 史文祥; 陈任翔
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-05-23

Abstract

The invention provides a quality difference cell root cause positioning method based on an SMOTE-ReliefF-XGBoost algorithm, which comprises the following steps: establishing a cell base station information matrix, and labeling a base station quality difference root cause according to historical experience; adopting a proper data preprocessing method according to the characteristics of different characteristics of the base station information; equalizing different types of data by using an SMOTE oversampling method; because of redundancy of the data features, firstly calculating weight values of the features by adopting a ReliefF algorithm, and then selecting important features according to a Sequential Forward Selection method; finally, the processed data are sent into 4 independent XGBoost classification models for training, and a genetic algorithm is adopted for parameter tuning; the invention effectively avoids the problem of over-fitting by adopting the SMOTE over-sampling technology, saves the memory resources of a computer and eliminates redundant features by adopting a Relieff feature selection algorithm, improves the performance of the model, and finally reduces the training time of the model by adopting a genetic algorithm to optimize the XGBoost model.

Description

Root Cause Location of Poor Quality Cells Based on SMOTE-ReliefF-XGBoost Algorithm method

技术领域technical field

本发明属于通信技术领域，具体涉及一种基于SMOTE-ReliefF-XGBoost算法的质差小区根因定位方法。The invention belongs to the technical field of communication, and in particular relates to a method for locating the root cause of a cell with poor quality based on the SMOTE-ReliefF-XGBoost algorithm.

背景技术Background technique

随着移动通信技术持续高速发展，无线网络已经成为生活和工作中不可缺少的部分，当前的移动通信网络规模越来越大，随之而来的是网络结构越来越复杂，各种各样的网络问题也纷至沓来。With the continuous and rapid development of mobile communication technology, wireless networks have become an indispensable part of life and work. The current mobile communication network is getting larger and larger, and the network structure is becoming more and more complex. Network problems also cropped up.

无线网络通过基站进行信息传输，每个基站都包含其局域网络数据信息，主要包括带宽、频段、宏站功率、空口上/下行业务流量、忙时上/下行PRB平均利用率和双流占比等特征信息。The wireless network transmits information through base stations, and each base station contains its local area network data information, mainly including bandwidth, frequency band, macro station power, air interface up/downlink traffic, busy up/downlink PRB average utilization rate, and dual-stream ratio, etc. characteristic information.

基站无线网络差受诸多因素影响，例如用户数量和基站功率；若要对基站无线网络进行优化，首先需要知道基站质差根因，才能对症下药。根据质差基站数据分析质差原因是无线网络优化必须面对的难题，质差基站若不得到及时优化，不仅会影响用户体验，甚至可能导致用户流失。The poor wireless network of the base station is affected by many factors, such as the number of users and the power of the base station; if you want to optimize the wireless network of the base station, you first need to know the root cause of the poor quality of the base station, so as to prescribe the right medicine. According to the data analysis of poor-quality base stations, the cause of poor quality is a problem that must be faced in wireless network optimization. If poor-quality base stations are not optimized in time, it will not only affect user experience, but may even lead to user loss.

目前基站无线网络质差根因主要通过人力分析，比较耗时费力；鲜有算法实现基站无线网络自动化质差根因分析，现有算法对特征的选取主要通过主观判断，对特征信息不能做到应用尽用，且目前自动化质差根因定位算法简单，不能对质差根因准确定位。At present, the root cause of the poor quality of the base station wireless network is mainly analyzed by manpower, which is time-consuming and laborious; there are few algorithms to realize the automatic analysis of the root cause of the poor quality of the base station wireless network. The application is exhausted, and the current automatic poor quality root cause location algorithm is simple, which cannot accurately locate the root cause of poor quality.

因此，针对上述技术问题，提出一种基于SMOTE-ReliefF-XGBoost算法的质差小区根因定位方法实现质差根因的自动化可靠定位。Therefore, in view of the above technical problems, a root cause location method for poor quality cells based on the SMOTE-ReliefF-XGBoost algorithm is proposed to realize automatic and reliable location of the root cause of poor quality.

发明内容Contents of the invention

本发明的目的在于针对现有技术的不足之处，提高智能质差根因定位的可靠性，提供一种基于SMOTE-ReliefF-XGBoost算法的质差小区根因定位方法。The purpose of the present invention is to improve the reliability of intelligent location of the root cause of poor quality based on the deficiencies of the prior art, and provide a method for location of the root cause of poor quality cells based on the SMOTE-ReliefF-XGBoost algorithm.

针对上述目的，本发明基于以下步骤实现，主要包括：For above-mentioned purpose, the present invention realizes based on following steps, mainly comprises:

S1：建立小区基站信息矩阵，根据小区基站的带宽、场景、平均用户数、总流量等信息，将各个基站的质差原因根据不同状态信息进行标记；导致基站质差的原因有：覆盖问题、干扰问题、容量问题和天馈问题，分别对应标签1至4；S1: Establish a cell base station information matrix, and mark the reasons for the poor quality of each base station according to different status information according to the bandwidth, scene, average number of users, and total traffic of the cell base station; the reasons for the poor quality of the base station are: coverage problems, Interference issues, capacity issues and antenna feeder issues correspond to labels 1 to 4 respectively;

S2：标签编码，将S1中四个标签对应问题的程度进行标记。其中，覆盖问题分为弱覆盖、过覆盖、重叠覆盖，对应标签0-2；干扰问题分为外干扰和系统内干扰，对应标签0-1；容量问题分为高负荷，License受限扇区间话务负荷不均衡和高负荷，其他原因，对应标签0-1；天馈问题分为双流占比问题和天馈断连问题，对应标签0-1；此外，还有些基站的质差原因不属于以上任何问题，称为空白集，紧跟各问题下子问题的编码，例如，覆盖问题中空白集编码为3；S2: Label encoding, marking the degree of the four labels in S1 corresponding to the question. Among them, the coverage problem is divided into weak coverage, over-coverage, and overlapping coverage, corresponding to labels 0-2; the interference problem is divided into external interference and system interference, corresponding to labels 0-1; the capacity problem is divided into high load and license-restricted sectors Unbalanced traffic load and high load, other reasons, correspond to labels 0-1; antenna feeder problems are divided into dual-stream ratio problems and antenna feeder disconnection problems, corresponding to labels 0-1; Belonging to any of the above questions, it is called a blank set, followed by the codes of the sub-questions under each question, for example, the code of the blank set in the covering question is 3;

S3：数据预处理，根据小区基站的不同信息的特点对其分别进行处理。针对频段、区域、带宽和资源分配失败而导致RRC连接建立失败的次数(次)这类离散信息，采用one-hot方式编码处理；针对周下行小区自忙时平均流量(GByte)和空口上行业务流量(GByte)等这类单位较大的信息，采用Z-Score方法标准化处理；S3: data preprocessing, which is processed separately according to the characteristics of different information of the cell base station. For discrete information such as the number (times) of RRC connection establishment failures caused by frequency band, area, bandwidth, and resource allocation failures, one-hot encoding is used for processing; for weekly downlink cell self-busy average traffic (GByte) and air interface uplink services For information with large units such as traffic (GByte), use the Z-Score method for standardized processing;

S4：数据均衡化，通过SMOTE方法均衡不同质差原因问题下的子问题数据，使得各问题标签下的子问题标签对应的基站数据量近乎相同；S4: Data equalization, using the SMOTE method to balance the sub-problem data under different causes of poor quality, so that the amount of base station data corresponding to the sub-problem labels under each problem label is almost the same;

S5：特征选择，将S4中各标签问题下均衡后的数据分别通过ReliefF算法计算各特征的重要性得分，然后通过Sequential Forward Selection方法挑选各标签问题对应的重要特征，剔除不相关特征；S5: Feature selection, calculate the importance score of each feature through the ReliefF algorithm for the balanced data under each label question in S4, and then select the important features corresponding to each label question through the Sequential Forward Selection method, and eliminate irrelevant features;

S6：模型训练，将S5中针对各标签问题挑选的特征及其子问题标签组成的矩阵送入4个XGBoost分类模型独立进行训练，利用遗传优化算法调整得到使损失函数(交叉熵函数)值极小的超参数，通过交叉验证测试模型性能；S6: Model training. The features selected in S5 for each label problem and the matrix composed of sub-problem labels are sent to four XGBoost classification models for independent training, and the genetic optimization algorithm is used to adjust the loss function (cross entropy function) to obtain an extreme value. Small hyperparameters, test model performance by cross-validation;

S7：质差根因定位，将S6中调试的最佳超参数应用于模型，然后将小区基站信息矩阵输入模型得到基站质差的根因，从而根据质差根因优化基站布局和相关配置。S7: Locating the root cause of poor quality, applying the optimal hyperparameters debugged in S6 to the model, and then inputting the cell base station information matrix into the model to obtain the root cause of poor quality of the base station, thereby optimizing the base station layout and related configuration according to the root cause of poor quality.

所述步骤S1-S3可具体描述为：The steps S1-S3 can be specifically described as:

假设小区基站信息由F＝{X₁,X₂,...,X_n}^T＝{F₁,F₂,...,F_m}表示，其中

1≤k≤n,1≤i≤m。Assume that cell base station information is represented by F={X ₁ , X ₂ ,...,X _n } ^T ={F ₁ ,F ₂ ,...,F _m }, where

1≤k≤n, 1≤i≤m.

小区基站标签由L＝{L¹,L²,L³,L⁴}表示，其中

1≤j≤4。The cell base station label is represented by L={L ¹ , L ² , L ³ , L ⁴ }, where

1≤j≤4.

对于小区基站信息F中数值较大的特征，根据

公式归一化其特征值，其中μ_i是特征均值，σ_i是特征标准差。For features with larger values in cell base station information F, according to

The formula normalizes its eigenvalues, where μ _i is the feature mean and σ _i is the feature standard deviation.

对于小区基站信息F中离散值特征，采用one-hot方式编码。最后将数据分为训练集和测试集。For the discrete value features in cell base station information F, one-hot coding is adopted. Finally, the data is divided into training set and test set.

所述S4可具体描述为：The S4 can be specifically described as:

首先从训练集中随机选择一个少数样本X_t,1≤t≤n，然后根据欧式距离公式

计算其与剩余样本的距离，并随机从k(k＝3)个最近邻样本中选取一个。First randomly select a minority sample X _t from the training set, 1≤t≤n, and then according to the Euclidean distance formula

Calculate the distance between it and the remaining samples, and randomly select one of the k (k=3) nearest neighbor samples.

最后在少数样本X_t和随机选取样本之间的映射路径上随机选取一个点作为新采样后的样本。反复采样多次，直到各类样本数据量达到平衡。Finally, randomly select a point on the mapping path between the minority sample X _t and the randomly selected sample as the new sampled sample. Sampling is repeated many times until the data volume of various samples reaches a balance.

所述S5中ReliefF算法首先从训练集中随机选取一个样本R，根据欧式距离公式分别计算与其同类和异类样本之间的距离，并分别选取与其k个最近邻同类和异类样本。The ReliefF algorithm in S5 first randomly selects a sample R from the training set, calculates the distance between it and its similar and heterogeneous samples according to the Euclidean distance formula, and selects its k nearest neighbors of the same and heterogeneous samples respectively.

根据下式迭代计算每个特征权重：Each feature weight is iteratively calculated according to the following formula:

其中，diff(F_i,R,H_j)表示样本R和H_j在特征F_i上的差，H_j表示类别C∈class(R)中的第j个最近邻样本，M_j(C)表示类别

中的第j个最近邻样本，m表示迭代次数，k表示选取的最近邻样本个数。Among them, diff(F _i ,R,H _j ) represents the difference between sample R and H _j on feature F _i , H _j represents the jth nearest neighbor sample in category C∈class(R), M _j (C) Indicates category

The jth nearest neighbor sample in , m represents the number of iterations, and k represents the number of selected nearest neighbor samples.

diff(A,R₁,R₂)的计算如下式：The calculation of diff(A,R ₁ ,R ₂ ) is as follows:

所述S5中Sequential Forward Selection算法所挑选的初始特征子集S为

根据ReliefF算法计算的特征权重信息迭代向S中逐个添加特征，直到找出使模型性能达到最佳的特征子集时停止。The initial feature subset S selected by the Sequential Forward Selection algorithm in S5 is

According to the feature weight information calculated by the ReliefF algorithm, iteratively add features to S one by one, until the feature subset that makes the model performance optimal is found and stops.

所述S6中XGBoost分类算法的本质是将若干基分类器模型集合成为一个基于基分类器的模型集合。The essence of the XGBoost classification algorithm in S6 is to assemble several base classifier models into a base classifier-based model set.

所述S6中遗传优化算法的本质是通过计算机模拟生物和人类进化的过程实现参数调优。The essence of the genetic optimization algorithm in S6 is to realize parameter tuning through computer simulation of the process of biological and human evolution.

本发明的有益效果：Beneficial effects of the present invention:

1、本发明采用ReliefF算法计算特征与标签之间的相关性权重，监督特征选择方式可以提高所选择特征的有效性；1. The present invention uses the ReliefF algorithm to calculate the correlation weight between features and labels, and the supervised feature selection method can improve the effectiveness of the selected features;

2、本发明针对数据类不平衡的特征，采用SMOTE过采样算法平衡各类别数据量，有效避免模型产生过拟合、权重偏差影响；2. Aiming at the characteristics of unbalanced data types, the present invention adopts the SMOTE oversampling algorithm to balance the amount of data of each type, effectively avoiding the influence of over-fitting and weight deviation of the model;

3、本发明针对各大根因采用分而治之的方法，分别应用四个独立的XGBoost分类模型对基站质差根因进行定位，有效提高定位根因的可靠性；3. The present invention adopts a divide-and-conquer method for each major root cause, respectively applying four independent XGBoost classification models to locate the root cause of poor base station quality, effectively improving the reliability of locating the root cause;

4、本发明采用遗传算法调优XGBoost分类模型的超参数，可以有效降低模型的训练时间并提高模型的性能。4. The present invention uses a genetic algorithm to tune the hyperparameters of the XGBoost classification model, which can effectively reduce the training time of the model and improve the performance of the model.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面所描述的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the embodiments of the present invention or in the description of the prior art. Obviously, the accompanying drawings described below are only illustrations of the present invention For some embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without creative effort.

图1是本发明一种基于SMOTE-ReliefF-XGBoost算法的质差小区根因定位方法的总体系统流程图。Fig. 1 is an overall system flow chart of a method for locating the root cause of a poor quality cell based on the SMOTE-ReliefF-XGBoost algorithm of the present invention.

图2是本发明本发明一种基于SMOTE-ReliefF-XGBoost算法的质差小区根因定位方法的的总体系统算法框图。FIG. 2 is a block diagram of an overall system algorithm of a method for locating the root cause of a poor-quality cell based on the SMOTE-ReliefF-XGBoost algorithm of the present invention.

具体实施方式Detailed ways

下面结合附图所示的流程实施方式详细介绍本发明。但该实施方式并不限制本发明，基于本发明的实施例，本领域普通技术人员根据该方法所做出的结构、方法或功能上的变换均包含在本发明的保护范围内。The present invention will be described in detail below in conjunction with the implementation of the flow chart shown in the accompanying drawings. However, this embodiment does not limit the present invention. Based on the embodiments of the present invention, any structural, method or functional changes made by those skilled in the art according to the method are included in the protection scope of the present invention.

本发明提供一种基于SMOTE-ReliefF-XGBoost算法的质差小区根因定位方法，参考图1-2所示，该方法包括步骤：The present invention provides a method for locating the root cause of poor quality cells based on the SMOTE-ReliefF-XGBoost algorithm, as shown in Figure 1-2, the method includes steps:

S5：特征选择，将S4中各标签问题下均衡后的数据分别通过ReliefF算法计算各特征的重要性得分，然后通过Sequential Forward Selection方法挑选各标签问题对应的重要特征，剔除不相关特征；S5: Feature selection, calculate the importance score of each feature through the ReliefF algorithm on the balanced data under each label question in S4, and then select the important features corresponding to each label question through the Sequential Forward Selection method, and eliminate irrelevant features;

其中，所述步骤S1-S3可具体描述为：Wherein, the steps S1-S3 can be specifically described as:

1≤k≤n,1≤i≤m。Assume that cell base station information is represented by F={X ₁ ,X ₂ ,...,X _n } ^T ={F ₁ ,F ₂ ,...,F _m }, where

1≤k≤n, 1≤i≤m.

小区基站标签由L＝{L¹,L²,L³,L⁴}表示，其中

1≤j≤4.

对于小区基站信息F中数值较大的特征，根据

具体地，所述S4可具体描述为：Specifically, the S4 can be specifically described as:

具体地，如附图2所示，所述S5中ReliefF算法首先从训练集中随机选取一个样本R，根据欧式距离公式分别计算与其同类和异类样本之间的距离，并分别选取与其k个最近邻同类和异类样本。Specifically, as shown in Figure 2, the ReliefF algorithm in S5 first randomly selects a sample R from the training set, calculates the distances between its similar and heterogeneous samples according to the Euclidean distance formula, and selects its k nearest neighbors Homogeneous and heterogeneous samples.

具体地，所述S5中Sequential Forward Selection算法所挑选的初始特征子集S为

根据所述S4中ReliefF算法计算的特征权重信息迭代向特征子集S中逐个添加特征，直到找出使模型性能达到最佳的特征子集时停止。Specifically, the initial feature subset S selected by the Sequential Forward Selection algorithm in S5 is

According to the feature weight information calculated by the ReliefF algorithm in S4, iteratively add features to the feature subset S one by one, and stop until the feature subset that makes the model performance optimal is found.

如附图2所示，所述S6中XGBoost分类算法的本质是将若干基分类器模型集合成为一个基于基分类器的模型集合，基分类器一般采用决策树，XGBoost分类算法原理就是在一个基分类器的基础上不断增加，直到算法的精确率不再显著增加为止。As shown in Figure 2, the essence of the XGBoost classification algorithm in S6 is to assemble several base classifier models into a base classifier-based model set. The base classifier generally uses a decision tree. The principle of the XGBoost classification algorithm is to The basis of the classifier continues to increase until the accuracy of the algorithm no longer increases significantly.

具体地，假设XGBoost中有K个基分类器f_k,1≤k≤K，对于第i(1≤i≤n)个样本X_i，第j(1≤j≤4)个XGBoost模型对第i个样本的输出为：Specifically, assuming that there are K base classifiers f _k in XGBoost, 1≤k≤K, for the ith (1≤i≤n) sample X _i , the jth (1≤j≤4) XGBoost model is The output of i samples is:

构建如下目标函数：Build the following objective function:

XGBoost模型的训练通过最小化目标函数argmin Obj^j实现。The training of the XGBoost model is achieved by minimizing the objective function argmin Obj ^j .

具体地，上述优化过程根据所述S6中遗传优化算法实现，其本质是通过计算机模拟生物和人类进化的过程实现参数调优。Specifically, the above-mentioned optimization process is implemented according to the genetic optimization algorithm in S6, and its essence is to realize parameter tuning by simulating the process of biological and human evolution by computer.

以上仅为本发明的较佳实施例而已，仅具体描述了本发明的技术原理，这些描述只是为了解释本发明的原理，不能以任何方式解释为对本发明保护范围的限制。基于此处解释，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进，及本领域的技术人员不需要付出创造性的劳动即可联想到本发明的其他具体实施方式，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention, and only specifically describe the technical principle of the present invention. These descriptions are only for explaining the principle of the present invention, and cannot be interpreted as limiting the protection scope of the present invention in any way. Based on the explanations here, any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, and those skilled in the art who can think of other specific implementations of the present invention without creative work are all Should be included within the protection scope of the present invention.

Claims

1. a kind of quality poor cell root cause localization method based on SMOTE-ReliefF-XGBoost algorithm, it is characterized in that, described method comprises the following steps:

S1: Establish a cell base station information matrix, and mark the reasons for the poor quality of each base station according to different status information according to the bandwidth, scene, average number of users, and total traffic of the cell base station; the reasons for the poor quality of the base station are: coverage problems, Interference issues, capacity issues and antenna feeder issues correspond to labels 1 to 4 respectively;

S2: Label coding, marking the degree of the four labels in S1 corresponding to the problem; among them, the coverage problem is divided into weak coverage, over-coverage, and overlapping coverage, corresponding to labels 0-2; the interference problem is divided into external interference and internal system interference, Corresponds to labels 0-1; capacity problems are divided into high load, unbalanced traffic load and high load among license-restricted sectors, and other reasons, corresponding to labels 0-1; antenna feeder problems are divided into dual-flow ratio problems and antenna feeder disconnection Questions, corresponding to labels 0-1; In addition, there are some reasons for the poor quality of base stations that do not belong to any of the above problems, which are called blank sets, which are closely followed by the codes of sub-problems under each question. For example, the blank set code in the coverage problem is 3;

S3: Data preprocessing, which is processed separately according to the characteristics of different information of the cell base station; for discrete information such as the number of RRC connection establishment failures caused by frequency band, area, bandwidth, and resource allocation failures, one-hot encoding is used for processing ;Aiming at the average traffic (GByte) and air interface uplink business traffic (GByte) when the weekly downlink cell is busy, the Z-Score method is used to standardize the processing;

S4: Data equalization, using the SMOTE method to balance the sub-problem data under different causes of poor quality, so that the amount of base station data corresponding to the sub-problem labels under each problem label is almost the same;

S5: Feature selection, calculate the importance score of each feature through the ReliefF algorithm for the balanced data under each label question in S4, and then select the important features corresponding to each label question through the Sequential Forward Selection method, and eliminate irrelevant features;

S6: Model training. The features selected in S5 for each label problem and the matrix composed of sub-problem labels are sent to four XGBoost classification models for independent training, and the genetic optimization algorithm is used to adjust the loss function (cross entropy function) to obtain an extreme value. Small hyperparameters, test model performance by cross-validation;

S7: Locating the root cause of poor quality, applying the optimal hyperparameters debugged in S6 to the model, and then inputting the cell base station information matrix into the model to obtain the root cause of poor quality of the base station, thereby optimizing the base station layout and related configuration according to the root cause of poor quality.

2. a kind of poor-quality cell root cause localization method based on SMOTE-ReliefF-XGBoost algorithm according to claim 1, is characterized in that, described steps S1-S3 can be specifically described as:

Assume that cell base station information is represented by F={X ₁ ,X ₂ ,…,X _n } ^T ={F ₁ ,F ₂ ,…,F _m }, where

The cell base station label is represented by L={L ¹ , L ² , L ³ , L ⁴ }, where />

For features with larger values in cell base station information F, according to

The formula normalizes its eigenvalues, where μ _i is the feature mean and σ _i is the feature standard deviation; for the discrete value features in the cell base station information F, one-hot encoding is used; finally, the data is divided into training set and test set.

3. a kind of quality poor cell root cause localization method based on SMOTE-ReliefF-XGBoost algorithm according to claim 1, it is characterized in that, among the described S4, SMOTE algorithm can specifically be described as:

First randomly select a minority sample X _t from the training set, 1≤t≤n, and then according to the Euclidean distance formula

Calculate the distance between it and the remaining samples, and randomly select one of the k (k=3) nearest neighbor samples, and finally randomly select a point on the mapping path between the minority sample X _t and the randomly selected sample as the newly sampled Sample; repeated sampling multiple times until the data volume of various samples reaches a balance.

4. a kind of quality poor cell root cause localization method based on SMOTE-ReliefF-XGBoost algorithm according to claim 1, it is characterized in that, ReliefF algorithm first randomly selects a sample R from training set in the described S5, according to Euclidean distance The formula calculates the distance between its similar and heterogeneous samples, and selects its k nearest neighbors of the same and heterogeneous samples; iteratively calculates the weight of each feature according to the following formula:

Among them, diff(F _i ,R,H _j ) represents the difference between sample R and H _j on feature F _i , H _j represents the jth nearest neighbor sample in category C∈class(R), M _j (C) Indicates category

The jth nearest neighbor sample in , m represents the number of iterations, and k represents the number of selected nearest neighbor samples; the calculation of diff(A,R ₁ ,R ₂ ) is as follows:

5. a kind of quality poor cell root cause localization method based on SMOTE-ReliefF-XGBoost algorithm according to claim 1, it is characterized in that, among the described S5, the selected initial feature subset S of Sequential Forward Selection algorithm is

According to the feature weight information calculated by the ReliefF algorithm, it iteratively adds features to S one by one, and stops until the model performance is found to be optimal.

6. a kind of quality poor cell root cause localization method based on SMOTE-ReliefF-XGBoost algorithm according to claim 1, it is characterized in that, the essence of XGBoost classification algorithm among the described S6 is that several base classifier models are assembled into one An ensemble of models based on base classifiers.

7. a kind of method for localizing the root cause of poor-quality residential areas based on the SMOTE-ReliefF-XGBoost algorithm according to claim 1, is characterized in that, the essence of the genetic optimization algorithm among the described S6 is to simulate the process of biological and human evolution by computer Realize parameter tuning.