CN112417076A

CN112417076A - Building personnel affiliation identification method based on big data mining technology

Info

Publication number: CN112417076A
Application number: CN202011330345.5A
Authority: CN
Inventors: 王彦青; 张清竹; 严莲; 郑紫薇; 赵海秀; 高梓枫; 王为强
Original assignee: EB INFORMATION TECHNOLOGY Ltd
Current assignee: Xinxun Digital Technology Hangzhou Co ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-02-26
Anticipated expiration: 2040-11-24
Also published as: CN112417076B

Abstract

A building personnel affiliation identification method based on big data mining technology comprises the following steps: extracting base station data in the working time period of a user, determining a base station to which the user works, acquiring a building to which the user works, and dividing all the users into different building user groups; building-user grouping model is established and trained, input is user characteristic data in each building user group, output is a plurality of enterprise user groups obtained by dividing users, and the working process is as follows: calculating enterprise similarity between every two users, then constructing a graph by using the user as a node and the enterprise similarity between the users as an edge by adopting a community discovery Louvain algorithm, and dividing all the users into a plurality of communities; and inputting the user characteristic data in the building user group to be identified into the building-user grouping model, and outputting enterprise user groups to which all users in the building user group to be identified respectively belong. The invention belongs to the field of communication, and can realize automatic identification of enterprise user groups in a building by utilizing user data and signaling data.

Description

Building personnel affiliation identification method based on big data mining technology

Technical Field

The invention relates to a building personnel affiliation identification method based on a big data mining technology, and belongs to the field of communication.

Background

Through the development of the last thirty years, electronic commerce has gone into thousands of households in China, and an irreplaceable industrial form is formed. Currently, the mainstream and mature user grouping mode in the e-commerce market is based on user behaviors, and the specific application mode is commodity recommendation. Enterprises strive to use big data to portray users, divide user groups with different characteristics, provide exclusive marketing service for different user groups, and the personalized recommendation mode can be mainly summarized into two types:

(a) and carrying out classification labeling on the commodities, identifying the interest categories of the users through behavior information of browsing, collecting and the like of the users, and recommending the commodities of the same category for the users.

(b) And (3) portraying all users, adopting different recommendation modes and contents aiming at different types of users, and recommending other user interest products to the same type of users.

The above method mainly focuses on the personal behaviors of the users and the front and back influences on the time sequence, the group division of the users only depends on the characteristics of the network behaviors of the users, the available information is limited, and the dimensionality is relatively single. However, a large amount of information is still required to be found and applied in the current relevant big data such as user behaviors, the influence among group behavior characteristics and the division dimension of user groups are required to be comprehensively researched, and how to discover and apply a new user grouping dimension becomes a big hotspot in the era background.

In the past, the group users are the key points of enterprise customer maintenance, and unlike individual users, the group users have the advantages of convenience in centralized maintenance, high benefit and low maintenance cost. Research shows that the client concentration and the financial benefits of enterprises have an inverse U-shaped change relationship, namely, as the client concentration increases, the financial benefits of the enterprises tend to increase first and then decrease. Therefore, enterprise group users are reasonably developed, the client concentration is improved, and the financial benefits of the electric commerce and the enterprise are improved; meanwhile, because the users of the same group have the characteristics of similar characteristics, adjacent geographic positions and the like, the existing group users are managed in a unified manner, and the maintenance cost and the supply cost of the clients are reduced, so that the clustering division of the clients becomes a major key point for the maintenance of enterprise clients.

Meanwhile, research shows that the characteristics of local social environment can influence the thought and behavior of people, which is called neighborhood effect. Research shows that groups have great influence on individual behaviors, online shopping sharing can greatly increase the implicit demands of users, shopping behaviors of adjacent groups can influence each other, and product selection is convergent. Therefore, the geographical concentration condition of the user group is analyzed, the occupation space is analyzed for the user group, the mutual influence among shopping behaviors of adjacent groups is favorably found, the electronic commerce and enterprise are assisted to carry out accurate recommendation and marketing, the reference dimension of user recommendation is perfected, and the big data information is deeply utilized, so that the customer satisfaction is improved, and customers are better maintained.

However, the definition of the group attribution is complex, so that the user grouping also has certain difficulty according to the group attribution, and how to reasonably determine the group attribution condition of the user and whether the algorithm is effective are problems to be solved. Therefore, how to fully utilize the user data and the signaling data to divide the user into groups according to the attribution enterprise and the geographic position so as to realize the automatic identification of the enterprise user groups in the building becomes a technical problem which is a key focus of technical personnel.

Disclosure of Invention

In view of the above, the present invention provides a building personnel affiliation identification method based on big data mining technology, which can make full use of user data and signaling data to group-divide users according to affiliation enterprises and geographic locations, thereby implementing automatic identification of enterprise user groups in a building.

In order to achieve the purpose, the invention provides a building personnel affiliation identification method based on big data mining technology, which comprises the following steps:

step one, setting a working period, extracting base station data of each user in the working period to determine a base station to which each user belongs during working according to the base station data, acquiring a home building of each user during working according to a building name contained in base station information, and finally dividing all users into different building user groups according to the home building of each user during working;

step two, building and training a building-user grouping model, wherein the input of the building-user grouping model is the characteristic data of all users in each building user group, the output of the building-user grouping model is a plurality of enterprise user groups formed by dividing all the users in the building user group, and the working flow of the building-user grouping model is as follows: calculating enterprise similarity between every two users according to input feature data of each user, and then constructing a graph by using a community discovery Louvain algorithm and taking each user as a node and the enterprise similarity between every two users as an edge, so that all users in a building user group are divided into a plurality of communities, wherein one community is an enterprise user group;

and step three, inputting the characteristic data of all users in the building user group to be identified into the trained building-user clustering model, and outputting and obtaining a plurality of enterprise user groups to which all the users in the building user group to be identified respectively belong.

Compared with the prior art, the invention has the beneficial effects that: because users with different enterprise affiliation conditions are mixed in the same building, the invention carries out diving and homing research on the users based on signaling data and big data mining technology, and adopts graph theory community discovery algorithm to divide the groups of the users in the same building according to the enterprise affiliation and geographic position, and finally realizes homing of each user to the enterprise to which the user belongs, thereby helping accurate marketing, personalized recommendation and improving the customer satisfaction.

Drawings

FIG. 1 is a flow chart of a building personnel affiliation identification method based on big data mining technology.

Fig. 2 is a specific flowchart of the step one in fig. 1, which extracts the base station data of each user in the working period to determine the base station to which each user belongs when working.

FIG. 3 is a specific flowchart of calculating the business similarity between each two users according to the feature data input to each user in step two of FIG. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

As shown in fig. 1, the building personnel affiliation identification method based on big data mining technology of the present invention includes:

In the first step, when the user is at home or in a working state, a general rule is to stay in a specific place for a long time, so that the attachment time of the base station can be used as a characteristic for judging the user state. The 24 hours of the day are first divided into 24 periods: starting from 0, sequentially selecting T1, T2, T... and T24, according to research, and in order to enable data to cover most of people and ensure the accuracy of feature extraction, working time intervals of T9-T12 and T14-T17 can be selected, and then, for each user, the base station to which the user possibly belongs in working time is screened according to the time of the user attaching to each base station in the working time intervals. The membership degree represents the degree of a certain element belonging to a certain fuzzy set, and is a key problem in fuzzy pattern recognition; the invention can convert the attribute vector of each user to each staying base station into the membership degree vector according to the membership degree function. Therefore, as shown in fig. 2, in step one of fig. 1, extracting the base station data of each user in the working period to determine the base station to which each user works according to the base station data, may further include:

step 11, obtaining a plurality of base stations where each user stays in the working period, and constructing an attribute vector of each user to each staying base station: x_ij＝(x_ij1，x_ij2，...，x_ijn)^TWherein X is_ijIs the attribute vector, x, of user i for its jth dwell base station_ij1、x_ij2、...、x_ijnRespectively, the 1 st, 2 nd, and n th base station data of the j th staying base station of the user i, where n is the total number of the base station data, and the base station data includes but is not limited to: the number of calls in a time period, the number of calls in the time period, the number of basic position updates in the time period, the number of periodic position updates, the number of short message receptions in the time period, the number of short message transmissions in the time period, the total number of communication time in the time period and the stay time in the time period are shown in the following table, and are base station data tables for each stay base station in the working period of the same user:

step 12, calculating membership degree vectors of each user to each staying base station and a standard working state base station according to the attribute vector of each user to each staying base station: u shape_ij＝(μ_ij1，μ_ij2，...，μ_ijn)^TWherein, U_ijIs the membership of the user i to the jth stay base station and the standard working state base stationDegree vector, U_ijThe calculation formula of each element value in (1) is as follows:

μ_ijzis U_ijZ element value of (1, u)]，x_ijzIs the z-th base station data, a, of user i for its j-th dwell base station_zIs the standard value, σ, of the z-th base station data_zIs the standard deviation of the data of the z-th base station, a_z、σ_zThe value of (c) can be obtained by calculation according to the mean value of all users in the sample data to the data of the z-th base station of all the staying base stations;

step 13, calculating the membership evaluation value of each user to each staying base station and the standard working state base station:

wherein N is_ijIs the membership evaluation value of the user i to the jth stay base station and the standard working state base station, alpha_zThe weight corresponding to the data of the z-th base station is determined according to training of a building-user clustering model, and then a minimum value is selected from membership evaluation values of all the staying base stations and standard working state base stations of each user, wherein the staying base station corresponding to the minimum value is the base station to which each user belongs when working.

The feature data of the users can be obtained from different dimensions such as call delivery, payroll income, group building, and dining position between the members of the enterprise, as shown in fig. 3, the enterprise similarity between every two users is calculated according to the feature data input to each user in step two of fig. 1, which is described by taking users p and q as an example, and may further include:

step 21, calculating the call feature similarity of the users p and q:

wherein, theta_cIs the weight of the c-th call feature, the value of which may be determined from training of the building-user clustering model,

is the attribute value of users p and q on the C-th call feature, C is the number of call features, which may include but is not limited to: the method comprises the following steps of (1) total call times, total call duration, number of common contacts and call times of the common contacts;

step 22, calculating the salary income characteristic similarity of the users p and q:

wherein, delta_bIs the weight of the b-th payroll income feature, the value of which can be determined from training of the building-user clustering model,

is the similarity of users p and q on the B-th payroll income characteristic, B is the payroll income characteristic number, which may include but is not limited to: using a short message interface of a bank with the frequency of the first three, fixing the delivery date and the delivery times per month;

since the payroll income feature may be discrete or continuous attribute data, taking the b-th payroll income feature as an example, when the b-th payroll income feature is discrete attribute data,

the calculation formula of (a) is as follows:

wherein the content of the first and second substances,

the b-th payroll income characteristic values of users p and q respectively; when the b-th payroll income characteristic is continuity attribute data,

the calculation formula of (a) is as follows:

wherein, uban_max、uban_minThe maximum value and the minimum value of the b-th payroll income characteristic are respectively set according to the actual business needs;

step 23, calculating the similarity of the clustering features of the users p and q

Respectively extracting TM base stations with long stay time of users p and q in a certain period of each historical holiday, respectively sequencing the TM base stations extracted for the users p and q according to the stay time from long to short, then comparing whether the stay base stations of the users p and q on each sequencing position are the same one by one, thereby obtaining the same stay base station number of the users p and q, then calculating the clustering feature similarity value of the users p and q on each historical holiday, wherein the clustering feature similarity value is the ratio of the same stay base station number of the users p and q to the TM, and finally calculating the clustering feature similarity of the users p and q, namely the average value of the clustering feature similarity values of the users p and q on all historical holidays; the TM may be set according to actual service needs, for example, selecting base stations three before the user stays for time between each holiday T13-T17, where in a certain historical holiday, the base station with the first ranking of the user p is consistent with the user q, but the base stations with the second and third ranking are inconsistent, and the clustering feature similarity value of the users p and q in the historical holiday is 1/3;

step 24, calculating the dinner party feature similarity of the users p and q

Comparing whether the base stations with the longest residence time of the users p and q in a certain time period of each working day in a statistical period are the same one by one, and counting the number of days of the base stations which are the same, and then calculating the dinner gathering similarity of the users p and q, namely the ratio of the number of days of the base stations which are the same to the total number of days of all working days in the statistical period;

step 25, calculating the enterprise similarity of the users p and q:

where ρ is₁、ρ₂、ρ₃、ρ₄The weights of the call feature similarity, the payroll income feature similarity, the group building feature similarity and the party meal feature similarity are determined according to training of a building-user clustering model.

After the building-user clustering model divides all users in a building user group into a plurality of communities by adopting a community discovery Louvain algorithm, the method of intra-group splitting and inter-group aggregation can be adopted aiming at the condition that a plurality of enterprises exist in the same enterprise user group in the personnel distribution mode and a plurality of enterprise user groups exist in the same enterprise personnel distribution mode, so that the accurate identification of the single enterprise user group in the building is realized, wherein:

1) aiming at the condition that a plurality of enterprises exist in the personnel distribution in the same enterprise user group, the method also comprises the following steps:

a1, selecting a plurality of users with low enterprise similarity from each enterprise user group as reselected users according to the enterprise similarity between every two users in each enterprise user group in the building user group, forming reselected user groups by all reselected users, and deleting the reselected users from the enterprise user groups to which the reselected users belong;

step A2, calculating the similarity between each user in the reselecting user group and each enterprise user group in the building user group, wherein the similarity between the user and the enterprise user group is the mean value of the enterprise similarities between the user and all the users in the enterprise user group, selecting the enterprise user group with the highest similarity for each user in the reselecting user group, then judging whether the similarity between each user and the selected enterprise user group is greater than the enterprise similarity between a certain number of users in the selected enterprise user group, and if so, adding the user into the selected enterprise user group; if not, a new enterprise user group is constructed for the user, and the user is added into the new enterprise user group.

2) Aiming at the condition that the same enterprise personnel are distributed in a plurality of enterprise user groups, the method also comprises the following steps:

step B1, calculating the similarity between every two enterprise user groups in the building, wherein the similarity between the two enterprise user groups is the mean value of the enterprise similarities between all the users in the two enterprise user groups, and then combining a plurality of enterprise user groups with high similarity into one enterprise user group;

and step B2, judging whether the number of users of each enterprise user group is smaller than the threshold number of people, if so, calculating the similarity between the enterprise user group and other enterprise user groups in the building, and merging the enterprise user group into other enterprise user groups with the highest similarity.

After determining each weight parameter in the building-user clustering model by training the two building-user clustering models in the step, the model effect can be evaluated by using the test sample, and the method also comprises the following steps:

step C1, inputting the characteristic data of all users in the tested building user group into the trained building-user clustering model, and outputting and obtaining a plurality of enterprise user groups to which all users in the tested building user group respectively belong;

step C2, acquiring the names of the users in each enterprise user group in the test building user group and the enterprise to which the users belong, and selecting the enterprise name with the largest number of users for each enterprise user group as the name of each enterprise user group;

step C3, calculating the clustering accuracy and the mixing rate:

wherein Accuracy is the clustering Accuracy, Mess is the clustering confounding rate, N_ucIs the correct number of users to be grouped, N_uIs the number of users tested, NC is the number of enterprises tested, x belongs to [1, NC]，m_xIs the number of users, M, in the x-th enterprise user group that do not belong to the enterprise corresponding to the name of the enterprise user group_xIs the number of users in the x-th enterprise user group;

and step C4, judging whether the calculated clustering accuracy is greater than the accuracy threshold and the clustering mixing rate is less than the mixing rate threshold, if not, indicating that the model effect does not meet the requirement, and continuing to adjust the model.

If the model effect does not meet the requirements, model optimization can be carried out by perfecting a similarity measurement characteristic system and perfecting subdivision rules. On one hand, more non-contact characteristics are introduced to describe the similarity between users without direct contact; on the other hand, when the user grouping and the user subdivision are carried out, rules are defined to describe and divide users in different departments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A building personnel affiliation identification method based on big data mining technology is characterized by comprising the following steps:

2. The method of claim 1, wherein in step one, the base station data of each user in the working period is extracted to determine the base station to which each user works, and further comprising:

step 11, obtaining a plurality of base stations where each user stays in the working period, and constructing an attribute vector of each user to each staying base station: x_ij＝(x_ij1，x_ij2，...，x_ijn)^TWherein X is_ijIs the attribute vector, x, of user i for its jth dwell base station_ij1、x_ij2、...、x_ijnRespectively 1 st, 2 nd, n base station data of the j-th staying base station of the user i, wherein n is the total number of the base station data, and the base station data comprises: the number of calls in a time period, the number of calls in the time period, the number of basic position updates in the time period, the number of periodic position updates, the number of short message receptions in the time period, the number of short message transmissions in the time period, the total number of communication time in the time period and the stay time in the time period;

step 12, calculating membership degree vectors of each user to each staying base station and a standard working state base station according to the attribute vector of each user to each staying base station: u shape_ij＝(μ_ij1，μ_ij2，...，μ_ijn)^TWherein, U_ijIs the membership vector, U, of the user i to the jth station and the standard station in working state_ijThe calculation formula of each element value in (1) is as follows:

μ_ijzis U_ijZ-th element value of (1, n), z ∈ [1, n ]]，x_ijzIs the z-th base station data, a, of user i for its j-th dwell base station_zIs the standard value, σ, of the z-th base station data_zIs the standard deviation of the data of the z-th base station;

wherein N is_ijIs the degree of membership of the user i to the jth stay base station and the standard working state base stationEvaluation value, α_zThe weight corresponding to the data of the z-th base station, and then selecting a minimum value from the membership evaluation values of all the staying base stations and the standard working state base station of each user, wherein the staying base station corresponding to the minimum value is the base station to which each user belongs when working.

3. The method according to claim 1, wherein in the second step, the enterprise similarity between every two users is calculated according to the feature data input to every user, which is described by taking users p and q as examples, and further comprising:

step 21, calculating the call feature similarity of the users p and q:

wherein, theta_cIs the weight of the c-th call feature,

is the attribute value of the users p and q on the C-th call feature, C is the number of call features, and the call features include: the method comprises the following steps of (1) total call times, total call duration, number of common contacts and call times of the common contacts;

wherein, delta_bIs the weight of the b-th payroll income characteristic,

is the similarity value of users p and q on the B-th payroll income characteristic, B is the payroll income characteristic number, and the payroll income characteristic comprises: using a short message interface of a bank with the frequency of the first three, fixing the delivery date and the delivery times per month;

Respectively extracting TM base stations with long stay time of users p and q in a certain period of each historical holiday, respectively sequencing the TM base stations extracted for the users p and q according to the stay time from long to short, then comparing whether the stay base stations of the users p and q on each sequencing position are the same one by one, thereby obtaining the same stay base station number of the users p and q, then calculating the clustering feature similarity value of the users p and q on each historical holiday, wherein the clustering feature similarity value is the ratio of the same stay base station number of the users p and q to the TM, and finally calculating the clustering feature similarity of the users p and q, namely the average value of the clustering feature similarity values of the users p and q on all historical holidays;

step 24, calculating the dinner party feature similarity of the users p and q

step 25, calculating the enterprise similarity of the users p and q:

where ρ is₁、ρ₂、ρ₃、ρ₄Respectively are the weights of the call feature similarity, the payroll income feature similarity, the group building feature similarity and the party meal feature similarity.

4. The method of claim 3, wherein, in step 22, for example, when the b-th payroll income characteristic is the discrete attribute data,

the calculation formula of (a) is as follows:

wherein the content of the first and second substances,

the calculation formula of (a) is as follows:

wherein, uban_max、uban_minThe maximum value and the minimum value of the b-th payroll income characteristic are respectively.

5. The method as claimed in claim 1, wherein the building-user clustering model further comprises, after dividing all users in the building user group into a plurality of communities by using a community discovery Louvain algorithm:

6. The method as claimed in claim 1, wherein the building-user clustering model further comprises, after dividing all users in the building user group into a plurality of communities by using a community discovery Louvain algorithm:

step B1, calculating the similarity between every two enterprise user groups in the building user group, wherein the similarity between the two enterprise user groups is the mean value of the enterprise similarities between all the users in the two enterprise user groups, and then combining a plurality of enterprise user groups with high similarity into one enterprise user group;

and step B2, judging whether the number of users of each enterprise user group is smaller than the threshold number of people one by one, if so, calculating the similarity between the enterprise user group and other enterprise user groups in the building user group, and merging the enterprise user group into other enterprise user groups with the highest similarity.

7. The method of claim 1, wherein the training of the building-user clustering model by two steps, after determining the respective weight parameters in the building-user clustering model, further comprises:

step C3, calculating the clustering accuracy and the mixing rate: