WO2024001102A1 - 一种通信行业家庭圈智能识别的方法、装置及设备 - Google Patents

一种通信行业家庭圈智能识别的方法、装置及设备 Download PDF

Info

Publication number
WO2024001102A1
WO2024001102A1 PCT/CN2022/141223 CN2022141223W WO2024001102A1 WO 2024001102 A1 WO2024001102 A1 WO 2024001102A1 CN 2022141223 W CN2022141223 W CN 2022141223W WO 2024001102 A1 WO2024001102 A1 WO 2024001102A1
Authority
WO
WIPO (PCT)
Prior art keywords
family
data
model
broadband
circles
Prior art date
Application number
PCT/CN2022/141223
Other languages
English (en)
French (fr)
Inventor
谢国城
张伟斌
陈静旋
徐少强
杜昭
贾雪飞
廖小文
Original Assignee
广东亿迅科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广东亿迅科技有限公司 filed Critical 广东亿迅科技有限公司
Publication of WO2024001102A1 publication Critical patent/WO2024001102A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Definitions

  • a method, device and equipment for intelligent identification of family circles in the communications industry A method, device and equipment for intelligent identification of family circles in the communications industry.
  • the present invention relates to the field of communication technology, and more specifically, to a method, device and equipment for intelligent identification of family circles in the communication industry.
  • the home market is one of the key competitive markets in the communications industry. With the development of full-service and integrated packages, the home market is becoming more and more important. At the same time, the home market has broad room for growth. In addition to mobile phone communication cards, new network attractions and other businesses, the home market is becoming more and more important. , as well as the development and layout of the entire industry chain such as home broadband and IPTV built on broadband, home smart devices, etc. Therefore, accurately identifying family member relationships is of very important practical significance. Based on the need to develop the home market, identification of home users is one of the key points. Existing home user identification models often build "social network” models based on users' call records and other data, and use "community discovery" algorithms to mine closely connected groups as suspected home customers.
  • the general method is: use the user's call records as the basis for building connections; after determining the connection relationship between users, use community segmentation algorithms to divide closely connected communities as suspected family customers.
  • the traditional family circle recognition model uses call behavior as the basis for pairing two numbers. It has the following shortcomings: First, the established family member relationship is easily interfered by intermediate nodes with large out- and in-degrees, such as real estate agents, takeaways, and couriers. For this type of people who need to maintain customer relationships through phone calls, due to the existence of these intermediate nodes when dividing communities, it is easy to divide two non-family member groups into the same family; second, the traditional model only identifies the relationship between numbers and family members.
  • the technical problem to be solved by the present invention is to address the above-mentioned deficiencies of the prior art.
  • the purpose of the present invention is to provide a method, device and equipment for intelligent identification of family circles in the communication industry, so as to solve the problem that the basis for traditional model identification is not comprehensive enough and the results obtained are Problems of poor stability and low accuracy.
  • the invention provides a method for intelligent identification of family circles in the communications industry, which includes: designing wide table requirements for a broadband classification model, and extracting broadband dpi data from a database; performing de-extreme value and MinMax standardization processing on the broadband dpi data.
  • the prediction results of the family circle intelligent recognition model and the original data are further integrated and imported into the knowledge graph to obtain a family relationship graph.
  • the positive samples and negative samples are input into multiple decision tree algorithm models for training to obtain multiple pre-selected models, test samples are used to test the effects of each pre-selected model, and the performance of each pre-selected model is evaluated through evaluation indicators. And the results of each pre-selected model are stacked to obtain the family circle intelligent recognition model.
  • multiple decision tree algorithm models include at least LightGBM, RandomForest, and xgboost algorithm models. Furthermore, a five-fold cross-validation method was used to conduct a comprehensive evaluation of the model prediction robustness of the family circle intelligent recognition model.
  • the number pairs of the positive samples satisfy the following three conditions at the same time: there is a relationship between the primary and secondary cards, there is a call behavior, and they are in the same residential area or the same permanent broadband wifi account; the negative samples are non-primary and secondary cards. The number of the relationship is right.
  • the K-means algorithm was used for cluster analysis and comparison to obtain three categories of broadband classification model results: home wifi, workplace wifi, and consumption place wifi.
  • the broadband dpi data includes: broadband account, number of connected devices, average usage time of connected devices, number of newly connected devices, number of reduced connected devices, average device connection frequency, number of connected devices from 7:00-21:00 Proportion, proportion field of the number of connected devices between 21:00-7:00.
  • the present invention provides a device for intelligent identification of family circles in the communications industry, which includes: a first acquisition module, used to extract broadband dpi data from a database, and perform de-extreme value and MinMax standardization processing on the broadband dpi data, and then perform The results of the broadband classification model are obtained through cluster analysis and comparison; the second acquisition module is used to extract the number pairs with call behavior from the database, and obtain the call behavior data of the number pairs and the location data of the number.
  • a preprocessing module used to associate the call behavior data and the position data of the number with the broadband classification model results, and calculate the coincidence degree of different paired numbers therein to obtain the initial wide table data; check the fields of the initial wide table data Quality and distribution, process missing values and outliers in fields, and then conduct correlation coefficient tests on pairs of variables. Calculate iv values for variable pairs that fail the test, and eliminate variables with lower iv values in the variable pairs.
  • a training module is used to select all positive samples from the preprocessed data and extract a set proportion of negative samples; input the positive samples and negative samples into the decision tree algorithm model for training to obtain the family circle intelligence Recognition model; prediction module, used to use the family circle intelligent recognition model to predict the family relationship probability of actual data, and label potential family circles for family circles whose probability is greater than a set threshold.
  • the present invention provides an electronic device.
  • the device includes a processor and a memory: the memory is used to store program code and transmit the program code to the processor; the processor is used to process the program code according to the program code.
  • the instructions execute the above-mentioned method of intelligent identification of family circles in the communications industry.
  • the present invention has the following advantages: 1. Insights into family relationships through multiple dimensions such as call behavior, WiFi analysis, location signaling data, etc., and uses the existing primary and secondary card relationships to define the correct family circle. Negative samples, a reasonable family circle identification scheme is designed. 2. Through broadband DPI analysis, use the clustering method to divide WiFi into three major categories, and use the overlap of number pairs in each category of WiFi as a label to enter the model, thereby improving the model effect. 3. Use the knowledge graph to further analyze whether the composition structure of family members is reasonable and verify the reliability of the family relationship recognition model from the side. On the basis of the original data information, the data is reprocessed, analyzed and correlated to effectively ensure the availability of model identification results and maximize the application value of big data.
  • Figure 1 is a flow chart for identifying WiFi classification in the present invention.
  • Figure 2 is a flow chart of the present invention.
  • Figure 3 is a radar chart of clustering results in the present invention.
  • Figure 4 is an example diagram of five-fold cross-validation in the present invention.
  • Figure 5 is a ROC curve diagram of each model in the present invention.
  • Figure 6 is a flow chart of family unit identification in the present invention.
  • Figure 7 is a family relationship map in the present invention.
  • Broadband dpi data includes: broadband account number, number of connected devices, Average usage time of connected devices, number of newly connected devices, number of reduced connected devices, average device connection frequency, proportion of the number of connected devices from 7:00-21:00, and proportion of the number of connected devices from 21:00-7:00 fields, such as As shown in Table 1.
  • cluster analysis and comparison are performed to obtain the results of the broadband classification model.
  • the capping method is used to remove extreme values and MinMax standardization
  • the K-means algorithm is used for cluster analysis and comparison.
  • User WiFi is characterized by a small number of connected devices, high frequency, long duration, and the Internet access is mainly during non-working hours; workplace WiFi is characterized by a large number of connected devices, high frequency, long duration, and Internet access time is mainly during working hours; consumption places WiFi is characterized by a large number of connected devices, low duration, and a large number of incoming and outgoing devices.
  • Extract the number pairs with call behavior from the database You can extract the number pairs with call behavior within a set period, such as within the current month, or within 3 months, and obtain the call behavior data and number location data of the number pairs. Correlate the call behavior data and number position data with the results of the broadband classification model, and calculate the overlap of different paired numbers to obtain the initial wide table data; as shown in Table 2.
  • call behavior number of calls per month, number of call days per month, average number of calls per day, coefficient of variation of the number of calls in the past 3 months, trend of the number of calls in the past 3 months, number of calls on weekdays, number of call days on weekdays, call duration on weekdays, Number of calls on rest days and holidays, number of call days on rest days and holidays, duration of calls on rest days and holidays, number of calls during non-working hours on weekdays (21:00-7:00), number of calls on non-working hours on weekdays (21:00-7) :00), the number of call days in non-working hours (21:00-7:00) on weekdays, the number of short-term calls (call duration less than 60s), the standard deviation of the number of calls on rest days and holidays/working time period on weekdays Standard deviation of the number of calls, degree of overlap of call circles, whether there is a core communication circle (continuous calls to each other every month within half a year), the shortest call duration, the longest call duration; location data: the same number of base
  • P value is a parameter used to determine the result of the hypothesis test.
  • the P value is when the null hypothesis is true, compared with the obtained sample observation results. probability of more extreme results), and initially screen out variables with a P value less than 0.05.
  • Pearson correlation coefficient test is used for continuous variables, and chi-square test is used for categorical variables.
  • the number pairs in the positive samples meet the following three conditions at the same time: there is a primary and secondary card relationship, there is a call behavior, the same permanent residence or the same permanent broadband wifi account; the negative sample is a number pair that does not have a primary and secondary card relationship.
  • the family circle intelligent recognition model uses the family circle intelligent recognition model to predict the probability of family relationships in actual data, and label family circles with potential family circles whose probability is greater than the set threshold. Furthermore, positive samples and negative samples are input into a variety of decision tree algorithm models for training to obtain a variety of pre-selected models. Test samples are used to test the effects of each pre-selected model, and the performance of each pre-selected model is evaluated through evaluation indicators, such as evaluation Indicators include precision rate, hit rate, coverage rate, f1 value, auc value, improvement degree, area under the ROC curve, and stacking processing of the results of each pre-selected model to obtain a family circle intelligent recognition model.
  • the various decision tree algorithm models in this implementation include at least LightGBM, RandomForest, and xgboost algorithm models. The main parameters of the finally determined optimal models of LightGBM, RandomForest, and xgboost are as follows.
  • a five-fold cross-validation method is used to conduct a comprehensive evaluation of the robustness of model prediction, that is, the model is evaluated and evaluated in different model parameter spaces based on the set training set and test set.
  • the selection makes the complexity of the model more reasonable, avoids the parameter space of the model being too complex, reduces the risk of model overfitting, and enables the model to achieve good prediction results when used in actual online applications.
  • this embodiment also uses LightGBM, RandomForest, xgboost algorithm, and the traditional family circle recognition model f, and uses the aforementioned parameters to directly calculate the data on the data shown in Table 2 Train the model; then use a data set with known labels to test each model (i.e. g_1 ⁇ *, g_2 ⁇ *, g_3 ⁇ *, f), calculate the area under the ROC curve, draw the ROC curve and compare. The area under the ROC curve of each model is shown in Table 4, and the ROC curve is shown in Figure 5.
  • the area under the ROC curve of the final model g obtained by the technology proposed by the present invention is significantly higher than the area under the ROC curve of the model f obtained by the existing technology, that is, the technical effect of the present invention is more excellent.
  • the relevant data of users to be predicted in the new data set are organized into the form of Table 2 through the same feature engineering operation, and then the features contained in it are input into three models respectively, and 3 representative numbers to be tested can be output.
  • the probability of belonging to the family circle The average value can be used as the final probability value output.
  • the probability threshold of the potential family circle is set at 0.5, and the family circle with a probability greater than this value is labeled as a potential family circle. The results are shown in Table 5.
  • the prediction results of the family circle intelligent recognition model and the original data are further integrated and imported into the knowledge graph to obtain the family relationship graph.
  • the knowledge graph is Neoj4.
  • the prediction results of the family circle intelligent recognition model and the original data are further integrated to form a data format that is consistent with the input Neoj4.
  • the family circle character relationships, relationship probability information, and character attribute information are shown in Tables 6 and 7.
  • the above data information into the local Neo4j import file load the data and execute the program to visualize the data to obtain the family relationship map, as shown in Figure 7, which can easily view the relationships between people.
  • the multi-person family map relationships are further analyzed based on the family relationship map results and sent to marketers in the form of labels so that they can selectively carry out marketing activities.
  • "entities” are used to express the nodes in the graph
  • “relationships” are used to express the "edges” and "arrow pointing” in the graph.
  • the number of times a node appears represents the number of users who have identified a family relationship with the user. The more users there are, the larger the node will be and will be highlighted in the network.
  • the color of the node indicates whether the user is a user on a different network. If the user is on a different network, it will be marked in red. If it is a user on the local network, it will be marked in blue.
  • the thickness of the edge represents the number of calls between users. The thicker the edge, the more frequent the calls between users.
  • the arrows point to indicate the proportion of call duration between the calling and called users, and the user with a high proportion of calling calls points to the user with a low proportion of calling calls.
  • a device for intelligent identification of family circles in the communications industry including a first acquisition module for extracting broadband dpi data from a database, performing de-extreme value and MinMax standardization processing on the broadband dpi data, and performing cluster analysis and comparison to obtain the broadband Classification model results; the second acquisition module is used to extract the number pairs with call behavior from the database, and obtain the call behavior data of the number pairs and the location data of the number; the preprocessing module is used to combine the call behavior data and the location data of the number Correlate the results of the broadband classification model, and calculate the overlap between different paired numbers to obtain the initial wide table data; test the field quality and distribution of the initial wide table data, process missing values and outliers in the fields, and then process the variables Perform a correlation coefficient test on each pair, calculate the iv
  • An electronic device includes a processor and a memory: the memory is used to store program code and transmit the program code to the processor; the processor is used to execute the above-mentioned method for intelligent identification of family circles in the communications industry according to instructions in the program code. .
  • the above are only the preferred embodiments of the present invention. It should be pointed out that those skilled in the art can also make several modifications and improvements without departing from the structure of the present invention, and these will not affect the effect and effectiveness of the present invention. Patent utility.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种通信行业家庭圈智能识别的方法、装置及设备,涉及通信技术领域,解决传统的家庭圈识别模型实用性差、准确率低的技术问题,方法包括:从数据库中提取宽带dpi类数据,通过聚类分析对比得到宽带分类模型结果;从数据库中提取有通话行为的号码对,获取号码对的通话行为数据和号码的位置数据,关联用户wifi解析分类数据得到初始宽表数据;稽核初始宽表数据字段质量及分布情况并进行填充、替换处理得到预处理数据;从预处理数据中选取正样本、负样本,并输入树算法模型进行训练得到家庭圈智能识别模型;使用家庭圈智能识别模型预测实际数据的家庭关系概率,并采用知识图谱的知识采集、知识推理步骤来创建家庭单元和可视化展示。

Description

一种通信行业家庭圈智能识别的方法、装置及设备
一种通信行业家庭圈智能识别的方法、装置及设备。
本发明涉及通信技术领域,更具体地说,它涉及一种通信行业家庭圈智能识别的方法、装置及设备。
家庭市场是通信行业重点竞争的市场之一,随着全业务、融合套餐的发展,家庭市场越来越重要,同时家庭市场拥有广阔的增长空间,除了手机通信卡、异网拉新等业务外,还有家庭宽带以及建构在宽带上的IPTV、家庭智能设备等全产业链的开拓和布局。因此准确识别家庭成员关系,具有非常重要的现实意义。基于家庭市场的开拓需要,对家庭用户的识别是重点之一。现有的家庭用户识别模型,往往是基于用户的通话记录等数据构建“社交网络”模型,通过“社群发现”算法挖掘紧密联系的群体作为疑似家庭客户。做法一般是:通过用户的通话记录作为构建连线的依据;确定用户间的连线关系后,利用社群划分算法等划分出联系紧密的社群,以此作为疑似家庭客户。传统的家庭圈识别模型使用通话行为作为两个号码配对的依据,存在以下缺点:一是建立的家庭成员关系容易受到出度入度较大的中间节点干扰,如房产中介、外卖员、快递员这类需要以通话维系客户关系为手段的人群,在进行社群划分时由于这些中间节点的存在,容易将两个非家庭成员群体划分为同一家庭;二是传统模型仅识别号码对家庭关系,针对3人、4人家庭成员关系识别不够充分;三是忽略宽带dpi信息,家庭成员共同连接宽带信息,是识别家庭关系的重要指标,因此传统模型识别的依据不够全面,得到的结果稳定性差、准确率低。
本发明要解决的技术问题是针对现有技术的上述不足,本发明的目的是提供一种通信行业家庭圈智能识别的方法、装置及设备,以解决传统模型识别的依据不够全面,得到的结果稳定性差、准确率低的问题。
本发明提供一种通信行业家庭圈智能识别的方法,包括:设计宽带分类模型宽表需求,并从数据库中提取宽带dpi类数据;对所述宽带dpi类数据进行去极值和MinMax标准化处理后,进行聚类分析对比得到宽带分类模型结果;从数据库中提取有通话行为的号码对,获取号码对的通话行为数据和号码的位置数据;将所述通话行为数据和号码的位置数据关联所述宽带分类模型结果,并计算不同配对号码在其中的重合度得到初始宽表数据;检验所述初始宽表数据的字段质量及分布情况,对字段的缺失值、异常值进行处理,再对变量两两进行相关系数检验,对检验未通过的变量对,计算iv值,剔除变量对中iv值较低的变量,最后得到预处理数据;从所述预处理数据中选取全部正样本,并抽取设定比例的负样本;将所述正样本、负样本输入决策树算法模型进行训练得到家庭圈智能识别模型;使用所述家庭圈智能识别模型预测实际数据的家庭关系概率,对概率大于设定阈值的家庭圈打上潜在家庭圈标签。作为进一步地改进,将所述家庭圈智能识别模型的预测结果和原始数据进一步整合,并导入知识图谱中得到家庭关系图谱。进一步地,将所述正样本、负样本输入多种决策树算法模型进行训练得到多种预选模型,使用测试样本对各预选模型的效果进行测试,通过评价指标对各预选模型的性能进行评估,以及对各预选模型的结果进行stacking处理得到家庭圈智能识别模型。进一步地,多种决策树算法模型至少包括LightGBM、RandomForest、xgboost算法模型。进一步地,采用五折交叉验证方法来对所述家庭圈智能识别模型进行模型预测稳健性综合评估。进一步地,所述正样本的号码对之间同时满足以下3个条件:存在主副卡关系、有通话行为、同一常住小区或同一常连宽带wifi账号下;所述负样本为非主副卡关系的号码对。进一步地,使用K-means算法进行聚类分析对比得到宽带分类模型结果的3个类别:家庭wifi、工作场所wifi、消费场所wifi。进一步地,所述宽带dpi类数据包括:宽带账号、连接设备数、连接设备平均使用时长、新增连接设备数、减少连接设备数、平均设备连接频率、7:00-21:00连接设备数占比、21:00-7:00连接设备数占比字段。本发明提供一种通信行业家庭圈智能识别的装置,包括:第一获取模块,用于从数据库中提取宽带dpi类数据,对所述宽带dpi类数据进行去极值和MinMax标准化处理后,进行聚类分析对比得到宽带分类模型结果;第二获取模块,用于从数据库中提取有通话行为的号码对,获取号码对的通话行为数据和号码的位置数据。
预处理模块,用于将所述通话行为数据和号码的位置数据关联所述宽带分类模型结果,并计算不同配对号码在其中的重合度得到初始宽表数据;检验所述初始宽表数据的字段质量及分布情况,对字段的缺失值、异常值进行处理,再对变量两两进行相关系数检验,对检验未通过的变量对,计算iv值,剔除变量对中iv值较低的变量,最后得到预处理数据;训练模块,用于从所述预处理数据中选取全部正样本,并抽取设定比例的负样本;将所述正样本、负样本输入决策树算法模型进行训练得到家庭圈智能识别模型;预测模块,用于使用所述家庭圈智能识别模型预测实际数据的家庭关系概率,对概率大于设定阈值的家庭圈打上潜在家庭圈标签。本发明提供一种电子设备,所述设备包括处理器以及存储器:所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;所述处理器用于根据所述程序代码中的指令执行上述的一种通信行业家庭圈智能识别的方法。
本发明与现有技术相比,具有的优点为:1、通过通话行为、wifi解析、位置信令数据等多维度来洞察家庭关联关系,并利用已有的主副卡关系来定义家庭圈正负样本,设计了一种合理的家庭圈识别的方案。2、通过宽带dpi解析,利用聚类的方法将wifi分为3大类,将号码对在每一类wifi重合度作为标签,进入模型,从而提高模型效果。3、通过知识图谱来进一步分析家庭成员的组成结构是否合理,从侧面来验证家庭关系识别模型的可靠性。在原有的数据信息的基础上,对数据进行再加工、分析和关联,有效的保证模型识别结果的可用性,发挥了大数据的应用价值。4、通过对号码对之间的家庭关系识别,从知识图谱进一步分析出家庭单元,避免了家庭圈识别关系只停留在操作上,而是把实际的家庭圈关系精准地刻画出来,为本网维稳、异网策反提供了数据支撑,有效降低损失和提升收入。附图说明。图1为本发明中识别wifi分类的流程图。图2为本发明的流程图。图3为本发明中的聚类结果雷达图。图4为本发明中五折交叉验证的示例图。图5为本发明中各模型的ROC曲线图。图6为本发明中家庭单元识别流程图。图7为本发明中的家庭关系图谱。
下面结合附图中的具体实施例对本发明做进一步的说明。参阅图1-7,一种通信行业家庭圈智能识别的方法,包括:设计宽带分类模型宽表需求,并从数据库中提取宽带dpi类数据,宽带dpi类数据包括:宽带账号、连接设备数、连接设备平均使用时长、新增连接设备数、减少连接设备数、平均设备连接频率、7:00-21:00连接设备数占比、21:00-7:00连接设备数占比字段,如表1所示。
  
对宽带dpi类数据进行去极值和MinMax标准化处理后,进行聚类分析对比得到宽带分类模型结果。优选的,采用盖帽法去极值和MinMax标准化,使用K-means算法进行聚类分析对比,通过肘部分析法确定K值在3往后聚类误差平方和越来越稳定,因此本实施例K=3,获得雷达图如图3所示,得到宽带分类模型结果的3个类别:家庭wifi、工作场所wifi、消费场所wifi。用户wifi的特点是连接设备数少、频率高、时长高、上网间主要为非工作时间;工作场所wifi的特点是连接设备数多、频率高、时长高、上网时间主要为工作时间;消费场所wifi的特点是连接设备数多、时长低、大量流入\流出设备。从数据库中提取有通话行为的号码对,可以提取设定时段内有通话行为的号码对,如当月内的,或3个月内的,获取号码对的通话行为数据和号码的位置数据。将通话行为数据和号码的位置数据关联宽带分类模型结果,并计算不同配对号码在其中的重合度得到初始宽表数据;如表2所示。其中,通话行为:月通话次数、月通话天数、日均通话次数、近3个月通话次数变异系数、近3个月通话次数趋势、工作日通话次数、工作日通话天数、工作日通话时长、休息日与节假日通话次数、休息日与节假日通话天数、休息日与节假日通话时长、工作日非工作时间(21:00-7:00)段通话次数、工作日非工作时间(21:00-7:00)段通话天数、工作日非工作时间(21:00-7:00)段通话时长、短时通话次数(通话时间小于60s)、休息日与节假日通话次数标准差/工作日工作时间段通话次数标准差、通话圈重合度、是否核心交往圈(半年内每月持续互通电话)、通话最短时长、通话最长时长;位置数据:夜间(0:00-6:00)基站相同个数、常驻top10相同基站个数、工作日常驻top10基站相同个数、工作日非工作时间(21:00-7:00)常驻top10基站相同个数、工作日工作时间(7:00-21:00)常驻top10基站相同个数、节假日常驻top10基站相同个数。
   
检验所述初始宽表数据的字段质量及分布情况,对字段的缺失值、异常值进行处理,例如,工作日非工作时间通话次数、工作日非工作时间通话天数两个字段存在缺失值,通过对数据分布特征分析,对符合正态分布的工作日非工作时间通话次数用均值进行填补;对左偏分布的工作日非工作时间通话天数,用中位数进行填补。还对各号码字段的异常值进行处理,对于通话最短时长字段出现负值情况,用大于0的最小值进行填补。为减少指标多重共线性影响,需要通过特征选择,筛选出最终入模特征。首先使用统计检验方法计算各变量与目标变量之间的P值(P值是用来判定假设检验结果的一个参数P值(P value)就是当原假设为真时,比所得到的样本观察结果更极端的结果出现的概率),初步筛选出P值小于0.05的变量。(注:其中连续变量使用皮尔逊相关系数检验,分类变量使用卡方检验。当P值小于0.05时,说明该变量对目标变量显著相关。)删除信息熵为0的变量,经过统计分析,是否终端互换字段取值均为“否”,入模没有意义,因此删除。删除相关性强的变量,对变量进行两两相关系数检验,p值设为0.05;对检验未通过的变量对,计算iv值(特征筛选之—iv值,定义:iv(infromation value),信息价值,用来表示特征对目标预测的贡献程度,即特征的预测能力,一般来说,iv值越高,该特征的预测能力越强,信息贡献程度越高),剔除变量对中iv值较低的变量,最后得到预处理数据。最终入模特征如表3所示。
   从预处理数据中选取全部正样本,并抽取设定比例的负样本。具体的,预处理数据划分为70%训练数据和30%测试数据,以正样本:负样本=1:3~1:10,分别从训练数据和测试数据中抽取样本。将正样本、负样本输入决策树算法模型进行训练得到家庭圈智能识别模型。其中,正样本的号码对之间同时满足以下3个条件:存在主副卡关系、有通话行为、同一常住小区或同一常连宽带wifi账号下;负样本为非主副卡关系的号码对。使用家庭圈智能识别模型预测实际数据的家庭关系概率,对概率大于设定阈值的家庭圈打上潜在家庭圈标签。进一步地,将正样本、负样本输入多种决策树算法模型进行训练得到多种预选模型,使用测试样本对各预选模型的效果进行测试,通过评价指标对各预选模型的性能进行评估,如评价指标包括精确率、命中率、覆盖率、f1值、auc值、提升度、ROC曲线下面积,以及对各预选模型的结果进行stacking处理得到家庭圈智能识别模型。本实施中的多种决策树算法模型至少包括LightGBM、RandomForest、xgboost算法模型。最终确定的LightGBM、RandomForest、xgboost最优模型的主要参数如下。
。 
。 
使用上述参数,分别对同一份数据集进行训练,得到对应的预选模型。最终可以得到3个预选模型g_1^*、g_2^*、g_3^*,对模型结果进行软投票(概率平均)得到最终的家庭关系概率,即为家庭圈智能识别模型g。
进一步地,考虑到通话行为数据量较大,故采用五折交叉验证方法来进行模型预测稳健性综合评估,即根据设定好的训练集和测试集在不同的模型参数空间进行模型的评估与选择,使得模型的复杂度趋于合理,避免了模型的参数空间过于复杂,降低了模型过拟合的风险,使得模型在实际的线上应用时也能取得不错的预测效果。
[0024] 为了证明家庭圈智能识别模型g相较现有技术的优势,本实施例还使用LightGBM、RandomForest、xgboost算法、传统的家庭圈识别模型f,采用前述参数直接在表2所示数据上训练模型;然后采用一份具有已知标记的数据集用于测试各个模型(即g_1^*、g_2^*、g_3^*、f),计算ROC曲线下面积、绘制ROC曲线并进行比较。各模型ROC曲线下面积如表4所示,ROC曲线如图5所示。
。 
从结果可以看出,经本发明创造所提出的技术得到的最终模g的ROC曲线下面积明显高于经现有技术得到的模型f的ROC曲线下面积,即本发明的技术效果更加优异。同时结合搜集运营商内部员工家庭号码作为验证数据也验证了这一点。在实际应用时,将新数据集中待预测用户的相关数据通过相同的特征工程操作,整理成表2的形式,然后将其中所含的特征 分别输入三个模型,可以输出3个代表待测号码对属于家庭圈的概率 取其均值 即可作为最终的概率值输出。本实施例将潜在家庭圈概率阈值定在0.5,对概率大于该值的家庭圈打上潜在家庭圈标签。结果见表5。
。 
进一步地,将家庭圈智能识别模型的预测结果和原始数据进一步整合,并导入知识图谱中得到家庭关系图谱。优选的,知识图谱为Neoj4。具体的,将家庭圈智能识别模型的预测结果和原始数据进一步整合形成输入Neoj4符合的数据样式,其中家庭圈人物关系及关系概率信息、人物属性信息如表6、7所示。
将上述数据信息放入到本地Neo4j的import文件,加载数据后执行程序进行数据可视化得到家庭关系图谱,如图7所示,可以方便查看人物关系。根据家庭关系图谱结果进一步分析多人家庭图谱关系,并以标签形式派送给营销人员,以便其有选择性地开展营销活动。在家庭关系图谱中,用“实体”来表达图中的节点,用“关系”来表达图中的“边”及“箭头指向”。其中,节点的出现的次数代表识别出与该用户有家庭关系的用户量,用户量越多,节点就越大,在网络中就会突出显示。用节点的颜色表示用户是否异网用户,若是异网用户则用红色标注,若是本网用户则用蓝色标注。并以边的粗细表示用户间通话次数多少,越粗代表了用户间的通话越频繁。以箭头指向表示用户对之间主被叫通话时长占比情况,以主叫占比高用户指向主叫占比低用户。通过以上步骤创建知识推理规则,完成知识推理,识别出家庭单元关系。分别依次创建知识推理规则,完成家庭关系推理,并通过进一步分析实际图谱结构发现,有效的家庭单元关系结构如表8所示。
。 
若一个号码出现在5人家庭单元中,则该号码要从4、3、2人家庭单元剔除。4人家庭单元按同样的递归过程进行剔除。家庭单元的号码重复存在两种情况:一是不同单元包含共同号码,可以通过对比不同单元的概率和,优先选择概率和大的家庭单元;二是同一家庭单元有多种排列组合,只保留其中一条记录。在家庭圈的知识图谱网络中,采用属性图数据库Neo4j对其进行存储,这一数据库使用代替了传统全局索引的局部索引技术,以实现对图结构数据的组织,使得在查询实体的邻接实体、关系及其属性时,可以较大幅度的减少计算的空间复杂度,实现知识图谱的快速响应。一种通信行业家庭圈智能识别的装置,包括第一获取模块,用于从数据库中提取宽带dpi类数据,对宽带dpi类数据进行去极值和MinMax标准化处理后,进行聚类分析对比得到宽带分类模型结果;第二获取模块,用于从数据库中提取有通话行为的号码对,获取号码对的通话行为数据和号码的位置数据;预处理模块,用于将通话行为数据和号码的位置数据关联宽带分类模型结果,并计算不同配对号码在其中的重合度得到初始宽表数据;检验所述初始宽表数据的字段质量及分布情况,对字段的缺失值、异常值进行处理,再对变量两两进行相关系数检验,对检验未通过的变量对,计算iv值,剔除变量对中iv值较低的变量,最后得到预处理数据;训练模块,用于从预处理数据中选取全部正样本,并抽取设定比例的负样本;将正样本、负样本输入决策树算法模型进行训练得到家庭圈智能识别模型;预测模块,用于使用家庭圈智能识别模型预测实际数据的家庭关系概率,对概率大于设定阈值的家庭圈打上潜在家庭圈标签。一种电子设备,设备包括处理器以及存储器:存储器用于存储程序代码,并将程序代码传输给处理器;处理器用于根据程序代码中的指令执行上述的一种通信行业家庭圈智能识别的方法。以上仅是本发明的优选实施方式,应当指出对于本领域的技术人员来说,在不脱离本发明结构的前提下,还可以作出若干变形和改进,这些都不会影响本发明实施的效果和专利的实用性。

Claims (9)

  1. 一种通信行业家庭圈智能识别的方法,其特征在于,包括:设计宽带分类模型宽表需求,并从数据库中提取宽带dpi类数据;对所述宽带dpi类数据进行去极值和MinMax标准化处理后,进行聚类分析对比得到宽带分类模型结果;从数据库中提取有通话行为的号码对,获取号码对的通话行为数据和号码的位置数据;将所述通话行为数据和号码的位置数据关联所述宽带分类模型结果,并计算不同配对号码在其中的重合度得到初始宽表数据;检验所述初始宽表数据的字段质量及分布情况,对字段的缺失值、异常值进行处理,再对变量两两进行相关系数检验,对检验未通过的变量对,计算iv值,剔除变量对中iv值较低的变量,最后得到预处理数据;从所述预处理数据中选取全部正样本,并抽取设定比例的负样本;将所述正样本、负样本输入决策树算法模型进行训练得到家庭圈智能识别模型;使用所述家庭圈智能识别模型预测实际数据的家庭关系概率,对概率大于设定阈值的家庭圈打上潜在家庭圈标签。
  2. 根据权利要求1所述的一种通信行业家庭圈智能识别的方法,其特征在于,将所述家庭圈智能识别模型的预测结果和原始数据进一步整合,并导入知识图谱中得到家庭关系图谱。根据权利要求1所述的一种通信行业家庭圈智能识别的方法,其特征在于,将所述正样本、负样本输入多种决策树算法模型进行训练得到多种预选模型,使用测试样本对各预选模型的效果进行测试,通过评价指标对各预选模型的性能进行评估,以及对各预选模型的结果进行stacking处理得到家庭圈智能识别模型。
  3. 根据权利要求3 所述的一种通信行业家庭圈智能识别的方法,其特征在于,多种决策树算法模型至少包括LightGBM 、RandomForest 、xgboost 算法模型。
  4. 根据权利要求1所述的一种通信行业家庭圈智能识别的方法,其特征在于,采用五折交叉验证方法来对所述家庭圈智能识别模型进行模型预测稳健性综合评估。
  5. 根据权利要求1所述的一种通信行业家庭圈智能识别的方法,其特征在于,所述正样本的号码对之间同时满足以下3个条件:存在主副卡关系、有通话行为、同一常住小区或同一常连宽带wifi账号下;所述负样本为非主副卡关系的号码对。
  6. 根据权利要求1所述的一种通信行业家庭圈智能识别的方法,其特征在于,使用K-means算法进行聚类分析对比得到宽带分类模型结果的3个类别:家庭wifi、工作场所wifi、消费场所wifi。
  7. 根据权利要求1所述的一种通信行业家庭圈智能识别的方法,其特征在于,所述宽带dpi类数据包括:宽带账号、连接设备数、连接设备平均使用时长、新增连接设备数、减少连接设备数、平均设备连接频率、7:00-21:00连接设备数占比、21:00-7:00连接设备数占比字段。
  8. 第二获取模块,用于从数据库中提取有通话行为的号码对,获取号码对的通话行为数据和号码的位置数据;预处理模块,用于将所述通话行为数据和号码的位置数据关联所述宽带分类模型结果,并计算不同配对号码在其中的重合度得到初始宽表数据;检验所述初始宽表数据的字段质量及分布情况,对字段的缺失值、异常值进行处理,再对变量两两进行相关系数检验,对检验未通过的变量对,计算iv 值,剔除变量对中iv 值较低的变量,最后得到预处理数据;训练模块,用于从所述预处理数据中选取全部正样本,并抽取设定比例的负样本;将所述正样本、负样本输入决策树算法模型进行训练得到家庭圈智能识别模型;
    预测模块,用于使用所述家庭圈智能识别模型预测实际数据的家庭关系概率,对概率大于设定阈值的家庭圈打上潜在家庭圈标签。
  9. 一种电子设备,其特征在于,所述设备包括处理器以及存储器:所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;所述处理器用于根据所述程序代码中的指令执行权利要求1-8任意一项所述的一种通信行业家庭圈智能识别的方法。
PCT/CN2022/141223 2022-06-30 2022-12-23 一种通信行业家庭圈智能识别的方法、装置及设备 WO2024001102A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210769422.X 2022-06-30
CN202210769422.XA CN115048472A (zh) 2022-06-30 2022-06-30 一种通信行业家庭圈智能识别的方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2024001102A1 true WO2024001102A1 (zh) 2024-01-04

Family

ID=83165916

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141223 WO2024001102A1 (zh) 2022-06-30 2022-12-23 一种通信行业家庭圈智能识别的方法、装置及设备

Country Status (2)

Country Link
CN (1) CN115048472A (zh)
WO (1) WO2024001102A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048472A (zh) * 2022-06-30 2022-09-13 广东亿迅科技有限公司 一种通信行业家庭圈智能识别的方法、装置及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160086185A1 (en) * 2014-10-15 2016-03-24 Brighterion, Inc. Method of alerting all financial channels about risk in real-time
CN109639478A (zh) * 2018-12-07 2019-04-16 中国移动通信集团江苏有限公司 识别存在家庭关系客户的方法、装置、设备及介质
CN109784393A (zh) * 2019-01-07 2019-05-21 闽江学院 一种基于电信大数据的家庭成员识别聚类方法
CN109829485A (zh) * 2019-01-08 2019-05-31 科大国创软件股份有限公司 一种基于移动通信数据的用户关系挖掘方法及系统
CN115048472A (zh) * 2022-06-30 2022-09-13 广东亿迅科技有限公司 一种通信行业家庭圈智能识别的方法、装置及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160086185A1 (en) * 2014-10-15 2016-03-24 Brighterion, Inc. Method of alerting all financial channels about risk in real-time
CN109639478A (zh) * 2018-12-07 2019-04-16 中国移动通信集团江苏有限公司 识别存在家庭关系客户的方法、装置、设备及介质
CN109784393A (zh) * 2019-01-07 2019-05-21 闽江学院 一种基于电信大数据的家庭成员识别聚类方法
CN109829485A (zh) * 2019-01-08 2019-05-31 科大国创软件股份有限公司 一种基于移动通信数据的用户关系挖掘方法及系统
CN115048472A (zh) * 2022-06-30 2022-09-13 广东亿迅科技有限公司 一种通信行业家庭圈智能识别的方法、装置及设备

Also Published As

Publication number Publication date
CN115048472A (zh) 2022-09-13

Similar Documents

Publication Publication Date Title
CN109492026B (zh) 一种基于改进的主动学习技术的电信欺诈分类检测方法
CN108154425B (zh) 一种结合社会网络和位置的线下商户推荐方法
CN111274338B (zh) 一种基于移动大数据的预出境用户识别方法
CN109783639A (zh) 一种基于特征提取的调解案件智能分派方法及系统
US8255392B2 (en) Real time data collection system and method
CN105824813B (zh) 一种挖掘核心用户的方法及装置
CN109684373B (zh) 基于出行和话单数据分析的重点关系人发现方法
CN107527240A (zh) 一种运营商行业产品口碑营销效果鉴定系统及方法
CN108924371B (zh) 电力客服过程中通过来电号码识别户号的方法
CN112053222A (zh) 一种基于知识图谱的互联网金融团伙欺诈行为检测方法
CN111221868A (zh) 一种应用于电力客户渠道偏好的数据挖掘与分析方法
WO2024001102A1 (zh) 一种通信行业家庭圈智能识别的方法、装置及设备
CN112101807A (zh) 一种电信行业集团客户价值综合评估的方法及相关装置
CN111510368A (zh) 家庭群组识别方法、装置、设备及计算机可读存储介质
CN104217088B (zh) 运营商移动业务资源的优化方法与系统
CN113435627A (zh) 基于工单轨迹信息的电力客户投诉预测方法及装置
Zubiaga et al. Political homophily in independence movements: analyzing and classifying social media users by national identity
CN111428092B (zh) 基于图模型的银行精准营销方法
JP7291100B2 (ja) 複数の投稿時系列データを用いた異常・変化推定方法、プログラム及び装置
Wang et al. A Comparative Study on Contract Recommendation Model: Using Macao Mobile Phone Datasets
CN109274834B (zh) 一种基于通话行为的快递号码识别方法
Caridi et al. A framework to approach problems of forensic anthropology using complex networks
He et al. Multi-dimensional boundary effects and regional economic integration: Evidence from the Yangtze River Economic Belt
EP3493082A1 (en) A method of exploring databases of time-stamped data in order to discover dependencies between the data and predict future trends
CN114387005A (zh) 一种基于图分类的套利团伙识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22949178

Country of ref document: EP

Kind code of ref document: A1