CN111553390A

CN111553390A - User classification method and device, computer equipment and storage medium

Info

Publication number: CN111553390A
Application number: CN202010273736.1A
Authority: CN
Inventors: 孔清扬
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-18
Also published as: WO2021203854A1

Abstract

The application relates to a user classification method and device based on intelligent decision, computer equipment and a storage medium. The method comprises the following steps: and acquiring a cluster distribution characteristic diagram of the SOM network trained according to a sample data set, wherein the sample data set comprises sample user data. And determining the clustering number of the trained SOM based on the clustering distribution characteristic graph, adjusting the initial node number of the original SOM according to the clustering number, determining the node number of the available SOM, and obtaining the available SOM. And performing clustering analysis on the data set to be processed according to the available SOM network, determining a clustering result of the data set to be processed, and generating the data set to be processed according to the user data to be processed. And obtaining a corresponding user classification result according to the clustering result. By adopting the method, the self-organizing mapping function of the SOM network is utilized, the aggregation cluster distribution contained in the high latitude data is excavated, the efficiency and the accuracy of the whole clustering process are improved, and the accurate classification of the user is realized.

Description

User classification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a user classification method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology and the wide application of the internet, mobile terminals and the like in the life and work of people, more and more people choose to acquire data and information wanted by themselves through a network. Since different users have different internet access behaviors, in the financial industry or the telecommunication industry, accurate marketing can be realized by analyzing the user behaviors. The key point of realizing accurate marketing is effective user information segmentation, and the key problem of user information segmentation is how to dig out the characteristics of the user hidden in the data and classify according to the dug-out characteristics to obtain the classification condition of the user information.

The traditional user information segmentation mode is usually performed according to one-dimensional attributes of users, for example, in the financial industry, the users can be divided into high-end customers, middle-end customers and low-end customers according to the assets of the users, and the segmentation method can accept or reject target groups in marketing activities according to marketing resource budgets. However, with the increasing diversification of user demands and the continuous innovation of enterprise products, even if the same end user is a high-end user, the demands of different users on the same product or service are obviously different. Therefore, in the conventional one-dimensional user information-based subdivision mode, the characteristics and the user information of the user in various aspects cannot be reflected, dynamic tracking of user behavior changes is realized, and further diversified requirements of the user for products or services cannot be met, so that the obtained user information is not accurately classified.

Disclosure of Invention

In view of the above, it is necessary to provide a user classification method, apparatus, computer device and storage medium capable of improving user classification accuracy.

A method of user classification, the method comprising:

acquiring a clustering distribution characteristic diagram of the SOM network trained according to the sample data set; the sample data set comprises sample user data, the SOM network is a self-organizing mapping network, and the cluster distribution map is used for reflecting the number and distribution of clusters included in the trained SOM network;

determining the clustering number of the trained SOM network based on the clustering distribution characteristic diagram;

adjusting the initial node number of the original SOM network according to the cluster number, and determining the node number of the available SOM network to obtain the available SOM network; the node number of the available SOM network represents the node number which is obtained by adjusting the initial node number and is consistent with the cluster number;

performing clustering analysis on the data set to be processed according to the available SOM network, and determining a clustering result of the data set to be processed; the data set to be processed is generated according to the user data to be processed;

and obtaining a corresponding user classification result according to the clustering result.

In one embodiment, the method further comprises:

acquiring each node of an original SOM network output layer, and initializing each node;

acquiring a preset training end condition; the training end condition is that the error limit of the weight obtained by two continuous training reaches a preset threshold value;

acquiring each sample user data in a sample data set, and carrying out normalization processing on each sample user data;

determining first sample data from the sample user data after normalization processing, and determining the best matching node of the first sample data from each node of an original SOM network output layer;

acquiring any topological neighborhood of the optimal matching node, and determining an optimal matching neighborhood taking the optimal matching node as a center from the topological neighborhoods;

and returning to the step of determining the first sample data from the sample user data after the normalization processing until the training end condition is reached, and obtaining the SOM network trained according to the sample data set.

In one embodiment, the obtaining any topological neighborhood of the best matching node and determining a best matching neighborhood centered around the best matching node from the topological neighborhoods includes:

acquiring any topological neighborhood of the optimal matching node, wherein the topological neighborhood is an initial neighborhood with a preset range;

and according to the training time and the position of the optimal matching node in the initial neighborhood, shrinking the initial neighborhood, and determining the optimal matching neighborhood taking the optimal matching node as the center.

In one embodiment, the method further comprises: and adjusting the weight values of the optimal matching node and each node in the corresponding initial neighborhood by adopting the following formula:

m_i(t+1)＝m_i(t)+α(t)h_ci(t)[x(t)-m_i(t)]；

where t is the step size, i represents a node, m_i(t) represents the weight of the i node at the t-th step, α (t) represents the learning efficiency, and is a monotonically decreasing learning coefficient, where 0<α(t)<1，h_ci(t) is the domain function, and x (t) represents the output vector.

In one embodiment, before performing cluster analysis on the data set to be processed according to the available SOM network and determining a clustering result of the data set to be processed, the method further includes:

acquiring user data to be processed, and determining the data type of each user data to be processed; the data types comprise continuous variable data types and category variable data types;

according to the data type, performing data preprocessing on the user data to be processed to obtain initial user data; the preprocessing comprises data normalization processing corresponding to the continuous variable data type and label coding processing corresponding to the category variable data type;

and performing missing value filling processing on the initial user data to generate a data set to be processed.

In one embodiment, the performing missing value padding processing on the initial user data to generate a to-be-processed data set includes:

determining a data loss type of the initial user data; the data missing types comprise information missing and behavior missing, and the information missing comprises continuous variable data missing and category variable data missing;

determining the mean value of initial user data belonging to the continuous variable data type, and performing missing value filling on the initial user data belonging to the continuous variable data missing according to the mean value to generate a data set to be processed;

or

Determining a mode of initial user data belonging to the category variable data type, and performing missing value filling on the initial user data which belongs to the category variable data missing according to the mode to generate a to-be-processed data set;

or

And newly building a constant, and filling missing values of initial user data belonging to behavior missing according to the constant to generate a data set to be processed.

An apparatus for user classification, the apparatus comprising:

the cluster distribution characteristic map acquisition module is used for acquiring a cluster distribution characteristic map of the SOM network trained according to the sample data set; the sample data set comprises sample user data, the SOM network is a self-organizing mapping network, and the cluster distribution map is used for reflecting the number and distribution of clusters included in the trained SOM network;

the cluster number determining module is used for determining the cluster number of the trained SOM network based on the cluster distribution characteristic diagram;

the available SOM network generation module is used for adjusting the initial node number of the original SOM network according to the clustering number, determining the node number of the available SOM network and obtaining the available SOM network; the node number of the available SOM network represents the node number which is obtained by adjusting the initial node number and is consistent with the cluster number;

the cluster result determining module is used for carrying out cluster analysis on the data set to be processed according to the available SOM network and determining the cluster result of the data set to be processed; the data set to be processed is generated according to the user data to be processed;

and the user classification result generation module is used for obtaining a corresponding user classification result according to the clustering result.

In one embodiment, the apparatus further comprises:

the node acquisition module is used for acquiring each node of an original SOM network output layer and initializing each node;

the training end condition acquisition module is used for acquiring a preset training end condition;

the sample user data acquisition module is used for acquiring each sample user data in the sample data set and carrying out normalization processing on each sample user data;

the optimal matching node determining module is used for determining first sample data from the sample user data after normalization processing and determining the optimal matching node of the first sample data from each node of the output layer of the original SOM network;

the optimal matching neighborhood determining module is used for acquiring any topological neighborhood of the optimal matching node and determining an optimal matching neighborhood taking the optimal matching node as a center from the topological neighborhood;

and the trained SOM network generation module is used for returning the step of determining the first sample data from the sample user data after the normalization processing until a training end condition is reached, and obtaining the SOM network trained according to the sample data set.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the user classification method, the user classification device, the computer equipment and the storage medium, the clustering distribution characteristic diagram of the SOM network trained according to the sample data set is obtained, wherein the sample data set comprises sample user data, and the clustering number of the trained SOM network is determined based on the clustering distribution characteristic diagram. And adjusting the initial node number of the original SOM network according to the clustering number, determining the node number of the available SOM network, and obtaining the available SOM network, so as to perform clustering analysis on the data set to be processed according to the available SOM network and determine the clustering result of the data set to be processed, wherein the data set to be processed is generated according to the user data to be processed. And obtaining a corresponding user classification result according to the clustering result. The method and the device realize the self-organizing mapping function of the SOM network, dig out the aggregation cluster distribution contained in the high latitude data, improve the efficiency and accuracy of the whole clustering process and realize the accurate classification of users.

Drawings

FIG. 1 is a diagram illustrating an exemplary user classification method;

FIG. 2 is a flow diagram that illustrates a method for user classification in one embodiment;

FIG. 3 is a diagram illustrating cluster distribution characteristics of an SOM network in one embodiment;

FIG. 4 is a flowchart illustrating a user classification method according to another embodiment;

FIG. 5 is a flowchart illustrating a user classification method according to still another embodiment;

FIG. 6 is a block diagram showing the structure of a user classifying device in one embodiment;

FIG. 7 is a block diagram of another embodiment of a user classification device;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The user classification method provided by the present application can be applied to the application environment shown in fig. 1, including the terminal 102 and the server 104, and particularly can be applied to the server 104, where the terminal 102 and the server 104 communicate through a network. The server 104 determines the number of clusters of the trained SOM network by obtaining the cluster distribution feature map of the SOM network trained according to the sample data set and based on the cluster distribution feature map. The sample data set comprises sample user data, and the clustering distribution map is used for reflecting the number and distribution of clustering clusters included in the trained SOM network. The server 104 adjusts the initial node number of the original SOM network according to the cluster number, determines the node number of the available SOM network, and obtains the available SOM network, wherein the node number of the available SOM network represents the node number which is obtained by adjusting the initial node number and is consistent with the cluster number. And performing clustering analysis on the data set to be processed according to the available SOM network to determine a clustering result of the data set to be processed. Wherein the data set to be processed is generated from the user data to be processed. Thereby obtaining a user classification result which can correspond to the clustering result and sending the user classification result to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a user classification method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S202, a cluster distribution characteristic diagram of the SOM network trained according to a sample data set is obtained, wherein the sample data set comprises sample user data.

Specifically, an original SOM network is trained by using a sample data set obtained according to sample user data to obtain a trained SOM network, and a clustering distribution characteristic diagram of the trained SOM network is obtained. The SOM network is a self-organizing mapping network, and clustering means that tuples with the same attribute, namely the same value on a cluster code, are intensively placed in continuous physical blocks in order to improve the query speed of a certain attribute or a certain attribute group. And the cluster distribution map is used for reflecting the number of the clusters included in the trained SOM network and the distribution condition of each cluster.

Furthermore, because the SOM network structure emphasizes the adjacent relation between cluster center points, the correlation between adjacent clusters is stronger, the cluster can be analyzed according to the color boundary, a cluster is formed by selecting the clusters with similar colors, and then a plurality of clusters are formed according to a plurality of different colors, so that the cluster distribution characteristic diagram of the trained SOM network can be obtained.

In this embodiment, referring to fig. 3, a plurality of cluster clusters may be formed according to the trained SOM network, including a first cluster 301, a second cluster 302, a third cluster 303, a fourth cluster 304, a fifth cluster 305, and a sixth cluster 306, to obtain a cluster distribution characteristic diagram of the trained SOM network.

The SOM network represents a Self-organizing mapping (SOM) network, and is an artificial neural network simulating the characteristics of human brain on signal processing, and can map an artificial input mode into a one-dimensional, two-dimensional or even higher-dimensional discrete graph on an output layer, and keep the topological structure of the graph unchanged. The system is composed of an input and output layer (competition layer), the number of neurons in the output layer is n, the output layer is a one-dimensional or two-dimensional planar array composed of m neurons, and the network is fully connected, namely each input node is connected with all the output nodes. The SOM network can make the weight vector space and the probability distribution of the input pattern tend to be consistent through repeated learning of the input pattern, namely probability retentivity. The method is characterized in that each neuron of an output layer of the SOM network competes for the response opportunity of an input mode, each weight related to a winning neuron is adjusted towards the direction which is more beneficial to the competition of the winning neuron, namely, the winning neuron is taken as the center of a circle, excitatory side feedback is shown for the adjacent neurons, inhibitory side feedback is shown for the far adjacent neurons, the neighbors mutually stimulate, and the far adjacent neurons mutually inhibit.

And step S204, determining the clustering number of the trained SOM network based on the clustering distribution characteristic diagram.

Specifically, the cluster distribution characteristic diagram is analyzed, the analysis is carried out according to the color boundary, and when the color boundary of the output picture is clear, the category number can be reflected, wherein the category number is the cluster number. And selecting the similar colors to form a cluster, and further forming a plurality of clusters according to a plurality of different colors.

In this embodiment, referring to fig. 3, 6 cluster clusters can be obtained according to different fillings or color distributions by analyzing a cluster distribution characteristic diagram of the trained SOM network, and are respectively a first cluster 301, a second cluster 302, a third cluster 303, a fourth cluster 304, a fifth cluster 305, and a sixth cluster 306, so as to determine that the number of clusters of the trained SOM network is 6.

Further, when the clustering number cannot be determined through a clustering feature distribution map for some complex data sets, a plurality of SOM networks are designed according to the possible category number, clustering analysis is respectively carried out, the contour coefficient of each SOM network is calculated, and the specific clustering number of the trained SOM networks is determined.

And S206, adjusting the initial node number of the original SOM network according to the cluster number, and determining the node number of the available SOM network to obtain the available SOM network.

Specifically, the initial node number of the original SOM network is obtained, and the initial node number of the original SOM network is adjusted according to the cluster number to obtain the available SOM network node number. The node number of the available SOM network represents the node number which is obtained by adjusting the initial node number and is consistent with the cluster number. The initial node number of the original SOM network is required to be adjusted under the condition that the initial node number of the original SOM network does not accord with the cluster number, when the adjusted node number is consistent with the cluster number, the available SOM network nodes are obtained, and the available SOM network is formed according to the available SOM network nodes.

And S208, performing cluster analysis on the data set to be processed according to the available SOM network, determining a cluster result of the data set to be processed, and generating the data set to be processed according to the user data to be processed.

Specifically, a to-be-processed data set generated according to-be-processed user data is obtained, and clustering analysis is performed on the to-be-processed data based on an available SOM network, so that a clustering result of the to-be-processed data is obtained. The clustering result comprises a plurality of clusters and the composition condition of each different cluster.

And step S210, obtaining a corresponding user classification result according to the clustering result.

Specifically, the clustering result is analyzed, the number of clustering clusters included in the clustering result and the size of each clustering cluster are determined, statistical indexes of each clustering cluster in each dimension are calculated, and user requirements in different clustering clusters are determined based on the statistical indexes.

The clustering result includes the number of clusters and the sizes of different clusters, for example, the clustering result includes 10 clusters, specifically including a cluster, b cluster, c cluster … …, etc., where the total size of the a cluster is 1 ten thousand clients, and the size of the b cluster is 5 thousand clients, etc.

Further, by calculating the statistical indexes of the clustering clusters on all dimensions, for example, the age range of the cluster a is 18-25, 65% of the people are high school, the average amount of purchased products is 1 ten thousand yuan, and 90% of people purchase small amount of financing. Based on the statistical indexes obtained through calculation, the user requirements in different clustering clusters can be determined, specifically, the user results can be analyzed to determine the requirements of corresponding users in different clustering clusters, for example, a cluster a is determined to be a young person who has a demand for small amount financing through the statistical indexes, then the small amount financing products can be pushed to the user in the cluster a based on the user requirements, and accurate marketing is realized.

In the user classification method, the clustering distribution characteristic diagram of the SOM network trained according to the sample data set is obtained, wherein the sample data set comprises sample user data, and the clustering number of the trained SOM network is determined based on the clustering distribution characteristic diagram. And adjusting the initial node number of the original SOM network according to the clustering number, determining the node number of the available SOM network, and obtaining the available SOM network, so as to perform clustering analysis on the data set to be processed according to the available SOM network and determine the clustering result of the data set to be processed, wherein the data set to be processed is generated according to the user data to be processed. And obtaining a corresponding user classification result according to the clustering result. The method and the device realize the self-organizing mapping function of the SOM network, dig out the aggregation cluster distribution contained in the high latitude data, improve the efficiency and accuracy of the whole clustering process and realize the accurate classification of users.

In one implementation, as shown in fig. 4, a user classification method is provided, which specifically includes the following steps:

step S402, acquiring each node of the original SOM network output layer, and initializing each node.

Specifically, by acquiring each node of an output layer of the original SOM network and giving an initial weight to each base point, the method includes initializing a connection weight, learning efficiency and a field of the SOM network.

The initial weight is obtained according to initialization operation, namely, each node randomly initializes corresponding parameters, and the initial field is a large-range area comprising a plurality of nodes.

Step S404, acquiring a preset training end condition.

The training end condition is that the weight error limit obtained by two continuous training reaches a preset threshold, the predefined training length can be obtained by obtaining the weight error limit of two continuous training and determining the training end condition according to the weight error limit of two continuous training.

Specifically, the training end condition refers to setting an error limit of the training weight for two consecutive times to be a preset threshold, and when the error limit of the training weight for two consecutive times reaches the preset threshold, the training is ended. For example, the training end condition may be that the weight error in two consecutive training processes is less than 0.03, and when the weight error in two consecutive training processes is 0.02, that is, less than 0.03, the training is ended.

Step S406, obtaining each sample user data in the sample data set, and performing normalization processing on each sample user data.

Specifically, the song sample user data in the sample data set is obtained, the data type of the sample user data is determined, normalization processing is carried out on the user data corresponding to the continuous variable data type, the continuous variable data type can comprise deposit amount and age, consistency of the continuous variable dimension is guaranteed by carrying out normalization processing on the continuous variable data, and the deposit amount dimension is larger than the age. The category variable data is subjected to label encoding processing, such as encoding to 0, 1, 2 for elementary school, middle school and university in the school calendar, for subsequent calculation of hamming distance of the category variable.

The user data specifically includes age, gender, academic calendar, occupation, region, and behavior data, including product types, total product money and product purchasing times. Wherein, the age, the total amount of purchased products and the times of purchasing the products are determined as continuous variables, and other data including sex, academic calendar, occupation, region, purchased product type and the like are all determined as category variables.

Step S408, determining first sample data from the normalized sample user data, and determining a best matching node of the first sample data from each node of the original SOM network output layer.

Specifically, one sample user data is arbitrarily selected from the sample user data after normalization processing, the sample user data is determined to be first sample data, the distance between the first sample data and each node of the output layer of the SOM network is calculated, the node with the minimum distance from the first sample data is selected, and the node is determined to be the best matching node of the input sample data.

The node having the smallest distance to the sample data is called the Best-match node (BMU) for inputting the sample data. The distance between the sample data and each output node can be calculated by adopting a distance formula between the attributes of the mixed variables, namely the distance between the sample data and each output node is calculated separately by dividing into a continuous variable and a category variable, and then the distance between the sample data and each output node is obtained by adding. The euclidean distance is used for continuous variables and the hamming distance is used for category variables for calculations.

Further, in the process of determining the best matching node, different distance formulas are adopted, Euclidean clustering is Euclidean distance, and the neighborhood function is used for judging whether the node is in the best matching field and conforms to the retrieved information. That is, by calculating the radius of the domain, then traversing all nodes to see if they are within the radius, and performing a weight vector update operation on each node within the best matching domain.

Step S410, any topological neighborhood of the best matching node is obtained, and the best matching neighborhood taking the best matching node as the center is determined from the topological neighborhood.

Specifically, any topological neighborhood of the best matching node is obtained, wherein the topological neighborhood is an initial neighborhood with a preset range, the initial neighborhood is contracted according to training time and the position of the best matching node in the initial neighborhood, and the best matching neighborhood with the best matching node as the center is determined.

Further, any topological domain of the best matching node is predefined, and the best matching domain centered on the best matching node is determined. In the training process, because corresponding best matching fields exist at different moments, the initial field with a larger range can be preset, in the training process, the best matching point is taken as the center, and the initial field is shrunk according to the training time and the position of the best matching node, so that the best matching node field is obtained.

In one implementation, the weight of the best matching node and each node in the corresponding initial neighborhood is adjusted using the following formula:

m_i(t+1)＝m_i(t)+α(t)h_ci(t)[x(t)-m_i(t)]；

In step S412, when the training end condition is not met, the step of determining the first sample data from the normalized sample user data is returned. And when the training ending condition is reached, obtaining the SOM network trained according to the sample data set.

Specifically, the training end condition is that the weight error limit obtained by two consecutive times of training reaches a preset threshold, and when the weight error limit reaches the preset threshold, the training end condition is reached, and the training is ended. And when the error limit of the training weight for two consecutive times does not reach the preset threshold value, the training end condition is not met, the first sample data is determined from the sample user data after normalization processing again, the best matching node for determining the first sample data is executed, any topological neighborhood of the best matching node is obtained, and the training process of the best matching neighborhood taking the best matching node as the center is determined from the topological neighborhood until the training end condition is reached.

According to the user classification method, normalization processing is carried out on the user data of each sample in the data sample set, and first sample data is determined from the user data of the samples after normalization processing. The best matching nodes of the first sample data are determined from all nodes of an output layer of the original SOM network, the best matching neighborhood with the best matching nodes as the center is determined from the topological neighborhood of the best matching nodes until the training end condition is reached, the SOM network trained according to the sample data set is obtained, the number of clusters corresponding to the user data to be processed can be determined by the trained SOM network, the original SOM network is adjusted, the clustering requirement of the user data to be processed is met, and the accuracy of user classification is further improved.

In an embodiment, as shown in fig. 5, a user classification method is provided, which specifically includes the following steps:

step S502, user data to be processed is obtained, and the data type of each user data to be processed is determined.

Specifically, user data to be processed is acquired, and the data type of each user data to be processed is determined. The data types comprise continuous variable data types and category variable data types, the user data specifically comprise age, gender, academic calendar, occupation and region, and the behavior data comprise purchased product types, total purchased product amount and product purchasing times. Wherein, the age, the total amount of purchased products and the times of purchasing the products are determined as continuous variables, and other data including sex, academic calendar, occupation, region, purchased product type and the like are all determined as category variables.

Step S504, according to the data type, data preprocessing is carried out on the user data to be processed, and initial user data are obtained.

Specifically, the preprocessing includes a data normalization process corresponding to a continuous variable data type, and a tag encoding process corresponding to a category variable data type. The user data corresponding to the continuous variable data type is normalized, wherein the continuous variable data type can comprise deposit amount and age, and the consistency of the continuous variable dimension is ensured by normalizing the continuous variable data, wherein the deposit amount dimension is larger than the age. The category variable data is subjected to label encoding processing, such as encoding to 0, 1, 2 for elementary school, middle school and university in the school calendar, for subsequent calculation of hamming distance of the category variable.

In one embodiment, the data preprocessing further includes variable screening, specifically including:

user attributes and behavior data are obtained and determined as variables, wherein the user attributes comprise sex, age and school calendar, and the behavior data comprise deposit amount, loan amount, client login times and the like. And screening the variables to remove the variables with the missing values larger than the preset missing value threshold value, calculating the variance of the variables, and deleting the variables with the variance lower than the preset variance threshold value.

Step S506, performing missing value filling processing on the initial user data to generate a to-be-processed data set.

Specifically, the data missing type of the initial user data is determined, and the missing value filling processing is performed on the corresponding initial user data according to the data missing type. The data missing types comprise information missing and behavior missing, and the information missing comprises continuous variable data missing and category variable data missing.

Further, when it is determined that the initial user data belongs to the continuous variable data loss, determining a mean value of the initial user data of the corresponding continuous variable data type, and performing missing value filling on the initial user data belonging to the continuous variable data loss according to the mean value to generate a data set to be processed.

When the initial user data is determined to be missing and belong to the category variable data, determining the mode of the initial user data of the corresponding category variable data type, and filling missing values of the initial user data missing and belonging to the category variable data according to the mode to generate a to-be-processed data set;

and when the initial user data is determined to belong to the behavior loss, a constant is newly established, and the initial user data belonging to the behavior loss is filled with the loss value according to the constant to generate a data set to be processed.

In the user classification method, the data type of each user data to be processed is determined by acquiring the user data to be processed, data preprocessing is performed on each user data to be processed according to the data type to obtain initial user data, missing value filling processing is performed on the initial user data, and a data set to be processed is generated. The method and the device realize the preprocessing and missing value filling of the user data, avoid the processing of invalid data in the subsequent classification process, reduce the workload in the classification process and further improve the working efficiency of user classification.

It should be understood that although the steps in the flowcharts of fig. 2, 4 and 5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4, and 5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a user classification apparatus including: a cluster distribution characteristic map obtaining module 602, a cluster number determining module 604, an available SOM network generating module 606, a cluster result determining module 608, and a user classification result generating module 610, wherein:

the cluster distribution feature map obtaining module 602 is configured to obtain a cluster distribution feature map of an SOM network trained according to a sample data set, where the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and a cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network.

And a cluster number determining module 604, configured to determine the cluster number of the trained SOM network based on the cluster distribution feature map.

And an available SOM network generation module 606, configured to adjust an initial node number of the original SOM network according to the number of clusters, and determine an available SOM network node number to obtain an available SOM network. The node number of the available SOM network represents the node number which is obtained by adjusting the initial node number and is consistent with the cluster number.

The clustering result determining module 608 is configured to perform clustering analysis on the data set to be processed according to the available SOM network, determine a clustering result of the data set to be processed, and generate the data set to be processed according to the user data to be processed.

And the user classification result generating module 610 is configured to obtain a corresponding user classification result according to the clustering result.

The user classification device obtains the clustering distribution characteristic diagram of the SOM network trained according to the sample data set, wherein the sample data set comprises sample user data, and the clustering number of the trained SOM network is determined based on the clustering distribution characteristic diagram. And adjusting the initial node number of the original SOM network according to the clustering number, determining the node number of the available SOM network, and obtaining the available SOM network, so as to perform clustering analysis on the data set to be processed according to the available SOM network and determine the clustering result of the data set to be processed, wherein the data set to be processed is generated according to the user data to be processed. And obtaining a corresponding user classification result according to the clustering result. The method and the device realize the self-organizing mapping function of the SOM network, dig out the aggregation cluster distribution contained in the high latitude data, improve the efficiency and accuracy of the whole clustering process and realize the accurate classification of users.

In one embodiment, as shown in fig. 7, there is provided a user classification apparatus including: a node obtaining module 702, a training end condition obtaining module 704, a sample user data obtaining module 706, a best matching node determining module 708, a best matching neighborhood determining module 710, and a trained SOM network generating module 712, wherein:

the node obtaining module 702 obtains each node of the original SOM network output layer and initializes each node.

A training end condition obtaining module 704, configured to obtain a preset training end condition.

The sample user data obtaining module 706 is configured to obtain each sample user data in the sample data set, and perform normalization processing on each sample user data.

And a best matching node determining module 708, configured to determine first sample data from the normalized sample user data, and determine a best matching node of the first sample data from each node of the output layer of the original SOM network.

The best matching neighborhood determining module 710 is configured to obtain any topological neighborhood of the best matching node, and determine a best matching neighborhood centered on the best matching node from the topological neighborhood.

And the trained SOM network generating module 712 is configured to return to the step of determining the first sample data from the sample user data after the normalization processing until a training end condition is reached, so as to obtain the SOM network trained according to the sample data set.

The user classification device normalizes the user data of each sample in the data sample set, and determines the first sample data from the normalized user data of the sample. The best matching nodes of the first sample data are determined from all nodes of an output layer of the original SOM network, the best matching neighborhood with the best matching nodes as the center is determined from the topological neighborhood of the best matching nodes until the training end condition is reached, the SOM network trained according to the sample data set is obtained, the number of clusters corresponding to the user data to be processed can be determined by the trained SOM network, the original SOM network is adjusted, the clustering requirement of the user data to be processed is met, and the accuracy of user classification is further improved.

In one implementation, there is provided a user classification apparatus, further comprising:

and the data type determining module is used for acquiring the user data to be processed and determining the data type of each user data to be processed, wherein the data type comprises a continuous variable data type and a category variable data type.

And the data preprocessing module is used for preprocessing the user data to be processed according to the data types to obtain initial user data, and the preprocessing comprises data normalization processing corresponding to the continuous variable data types and label coding processing corresponding to the category variable data types.

And the to-be-processed data set generating module is used for performing missing value filling processing on the initial user data to generate a to-be-processed data set.

The user classification device determines the data type of each user data to be processed by acquiring the user data to be processed, performs data preprocessing on each user data to be processed according to the data type to obtain initial user data, and performs missing value filling processing on the initial user data to generate a data set to be processed. The method and the device realize the preprocessing and missing value filling of the user data, avoid the processing of invalid data in the subsequent classification process, reduce the workload in the classification process and further improve the working efficiency of user classification.

For the specific definition of the user classification device, reference may be made to the above definition of the user classification method, which is not described herein again. The modules in the user classification device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing user data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a user classification method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:

acquiring a clustering distribution characteristic diagram of the SOM network trained according to the sample data set; the sample data set comprises sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used for reflecting the number and distribution of clusters included in the trained SOM network;

performing clustering analysis on the data set to be processed according to the available SOM network, and determining a clustering result of the data set to be processed; generating a data set to be processed according to user data to be processed;

In one embodiment, the processor, when executing the computer program, further performs the steps of:

acquiring a preset training end condition; the training end condition is that the error limit of the weight obtained by two continuous training reaches a preset threshold;

acquiring user data of each sample in the sample data set, and carrying out normalization processing on the user data of each sample;

determining first sample data from the sample user data after normalization processing, and determining the best matching node of the first sample data from each node of the output layer of the original SOM network;

acquiring any topological neighborhood of the optimal matching node, and determining an optimal matching neighborhood taking the optimal matching node as a center from the topological neighborhood;

and contracting the initial neighborhood according to the training time and the position of the optimal matching node in the initial neighborhood, and determining the optimal matching neighborhood taking the optimal matching node as the center.

and (3) adjusting the weight of the optimal matching node and each node in the corresponding initial neighborhood by adopting the following formula:

m_i(t+1)＝m_i(t)+α(t)h_ci(t)[x(t)-m_i(t)]；

acquiring user data to be processed, and determining the data type of each user data to be processed; the data type comprises a continuous variable data type and a category variable data type;

according to the data type, carrying out data preprocessing on user data to be processed to obtain initial user data; the preprocessing comprises data normalization processing corresponding to continuous variable data types and label coding processing corresponding to category variable data types;

determining a data missing type of initial user data; the data loss type comprises information loss and behavior loss, and the information loss comprises continuous variable data loss and category variable data loss;

or determining the mode of the initial user data belonging to the category variable data type, and performing missing value filling on the initial user data which belongs to the category variable data missing according to the mode to generate a data set to be processed;

or a constant is newly established, and missing value filling is carried out on initial user data belonging to behavior missing according to the constant to generate a data set to be processed.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

m_i(t+1)＝m_i(t)+α(t)h_ci(t)[x(t)-m_i(t)]；

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of user classification, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the obtaining any topological neighborhood of the best matching node and determining a best matching neighborhood centered around the best matching node from the topological neighborhoods comprises:

4. The method of claim 3, further comprising: and adjusting the weight values of the optimal matching node and each node in the corresponding initial neighborhood by adopting the following formula:

m_i(t+1)＝m_i(t)+α(t)h_ci(t)[x(t)-m_i(t)]；

5. The method of claim 1, further comprising, before performing cluster analysis on the to-be-processed data set according to the available SOM network and determining a clustering result of the to-be-processed data set:

6. The method according to claim 5, wherein the performing missing value padding processing on the initial user data to generate a to-be-processed data set comprises:

or

7. An apparatus for classifying a user, the apparatus comprising:

8. The apparatus of claim 7, further comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.