WO2021203854A1

WO2021203854A1 - User classification method and apparatus, computer device and storage medium

Info

Publication number: WO2021203854A1
Application number: PCT/CN2021/077380
Authority: WO
Inventors: 孔清扬
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-04-09
Filing date: 2021-02-23
Publication date: 2021-10-14
Also published as: CN111553390A

Abstract

An intelligent decision-based user classification method and apparatus, a computer device and a storage medium. Said method comprises: acquiring a cluster distribution feature map of an SOM network trained according to a sample data set, the sample data set comprising sample user data; determining the number of clusters of the trained SOM network on the basis of the cluster distribution feature map; adjusting the number of initial nodes of an original SOM network according to the number of clusters, so as to determine the number of available SOM network nodes, and obtain an available SOM network; performing, according to the available SOM network, clustering analysis on a data set to be processed, and determining a clustering result of said data set, said data set being generated according to user data to be processed; and obtaining a corresponding user classification result according to the clustering result. The method implements the self-organizing mapping function using the SOM network, and digs out the cluster distribution contained in high-dimensional data, thereby improving the efficiency and accuracy of the whole clustering process, and achieving the accurate classification of users.

Description

User classification method, device, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 9, 2020 with the application number 202010273736.1 and the invention title "User Classification Method, Device, Computer Equipment and Storage Medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of artificial intelligence technology, in particular to a user classification method, device, computer equipment, and storage medium.

Background technique

With the development of computer technology and the wide application of the Internet and mobile terminals in people's lives and work, more and more people choose to obtain the data and information they want through the Internet. Since different users have different Internet access behaviors, in the financial industry or the telecommunications industry, accurate marketing can be achieved by analyzing user behavior. Among them, the key to realizing precision marketing lies in effective user information segmentation. The key issue of user information segmentation is how to dig out the characteristics of users hidden in the data, and classify them according to the excavated characteristics to get the classification of user information.

The inventor found that traditional user information segmentation methods are often based on the one-dimensional attributes of users. For example, in the financial industry, users can be divided into high, medium, and low-end customers according to the amount of user assets. This segmentation method can According to the marketing resource budget, choose the target group in marketing activities. However, with the increasing diversification of user needs and the continuous innovation of enterprise products, the inventor realized that even if the same high-end users, different users' needs for the same product or service also have obvious differences. Therefore, the traditional subdivision method based on one-dimensional user information cannot reflect the various characteristics of users and user information, and realize the dynamic tracking of user behavior changes, and thus cannot meet the diversified needs of users for products or services. , Resulting in inaccurate user information classification.

Summary of the invention

Based on this, it is necessary to provide a user classification method, device, computer equipment, and storage medium that can improve the accuracy of user classification in response to the above technical problems.

A user classification method, the method includes:

Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;

Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;

According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;

Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;

A corresponding user classification result is obtained according to the clustering result.

A user classification device, the device includes:

The cluster distribution feature map acquisition module is used to acquire the cluster distribution feature map of the SOM network trained according to the sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the cluster The cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network;

A cluster number determination module, configured to determine the number of clusters of the SOM network after training based on the cluster distribution feature map;

The available SOM network generation module is used to adjust the initial node number of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network; the number of nodes of the available SOM network indicates The number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes;

A clustering result determination module, configured to perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;

The user classification result generating module is used to obtain the corresponding user classification result according to the clustering result.

A computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when the processor executes the computer program:

A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the following steps are implemented:

This application realizes the use of the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.

Description of the drawings

Fig. 1 is an application scenario diagram of a user classification method in an embodiment;

Figure 2 is a schematic flowchart of a user classification method in an embodiment;

FIG. 3 is a schematic diagram of the clustering distribution characteristics of the SOM network in an embodiment;

FIG. 4 is a schematic flowchart of a user classification method in another embodiment;

FIG. 5 is a schematic flowchart of a user classification method in another embodiment;

Figure 6 is a structural block diagram of a user classification device in an embodiment;

Figure 7 is a structural block diagram of a user classification device in another implementation;

Fig. 8 is an internal structure diagram of a computer device in an embodiment.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

The technical solution of the present application relates to the field of artificial intelligence and/or big data technology, such as neural network technology, to realize intelligent user classification. Optionally, the data involved in this application, such as sample user data and/or classification results, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.

The user classification method provided in this application can be applied to the application environment shown in FIG. 1, including the terminal 102 and the server 104, and specifically can be applied to the server 104, where the terminal 102 and the server 104 communicate through a network. The server 104 obtains the clustering distribution feature map of the SOM network trained according to the sample data set, and determines the number of clusters of the trained SOM network based on the clustering distribution feature map. Among them, the sample data set includes sample user data, and the cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network. The server 104 adjusts the number of initial nodes of the original SOM network according to the number of clusters, determines the number of available SOM network nodes, and obtains the available SOM network, where the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes consistent with the number of clusters. Perform cluster analysis on the data set to be processed according to the available SOM network, and determine the clustering result of the data set to be processed. Among them, the data set to be processed is generated based on the user data to be processed. Therefore, a corresponding user classification result is obtained according to the clustering result, and the user classification result is sent to the terminal 102. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers.

In an embodiment, as shown in FIG. 2, a method for user classification is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:

Step S202: Obtain a clustering distribution feature map of the SOM network trained according to a sample data set, where the sample data set includes sample user data.

Specifically, by using the sample data set obtained according to the sample user data, the original SOM network is trained to obtain the trained SOM network, and the clustering distribution feature map of the trained SOM network is obtained. Among them, the SOM network is a self-organizing mapping network, and clustering means that in order to improve the query speed of a certain attribute or a certain attribute group, tuples with the same value on the same attribute, that is, the clustering code, are concentrated in a continuous physical block. The cluster distribution map is used to reflect the number of clusters and the distribution of each cluster in the SOM network after training.

Furthermore, because the SOM network structure emphasizes the proximity relationship between the cluster centers, the correlation between adjacent clusters is stronger, and the color boundary can be analyzed according to the color boundary. Different colors form multiple clusters, so that the cluster distribution feature map of the trained SOM network can be obtained.

In this embodiment, referring to FIG. 3, multiple clusters can be formed according to the trained SOM network, including a first cluster 301, a second cluster 302, a third cluster 303, and a fourth cluster. The cluster 304, the fifth cluster 305, and the sixth cluster 306 obtain the cluster distribution feature map of the trained SOM network.

Among them, the SOM network stands for Self-organizing Maps (SOM) network, which is an artificial neural network that simulates the characteristics of signal processing by the human brain, and can map artificial input patterns into one-dimensional, two-dimensional or even more in the output layer. High-dimensional discrete graphics, and keep its topological structure unchanged. It is composed of input and output layers (competitive layer), the number of neurons in the output layer is n, and the output layer is a one-dimensional or two-dimensional planar array composed of m neurons. The network is fully connected, that is, each input node has Connect with all output nodes. The SOM network can make the weight vector space and the probability distribution of the input mode converge through repeated learning of the input mode, that is, probability retention. Each neuron in the output layer of the SOM network competes for the response opportunity to the input pattern, and the weights related to the winning neuron are adjusted in a direction that is more conducive to its competition, that is, the winning neuron is the center of the circle, and the neighboring neurons are excited. Sexual feedback, while showing inhibitory feedback to the neurons in the distant neighbors, the neighbors stimulate each other, and the distant neighbors inhibit each other.

Step S204: Determine the number of clusters of the trained SOM network based on the cluster distribution feature map.

Specifically, the cluster distribution feature map is analyzed, and the analysis is performed according to the color boundary. When the color boundary of the output image is clear, the number of categories can be reflected, where the number of categories is the number of clusters. Select clusters with similar colors to form a cluster, and then form multiple clusters according to multiple different colors.

In this embodiment, referring to FIG. 3, by analyzing the cluster distribution feature map of the trained SOM network, according to the distribution of different fillings or colors, 6 clusters can be obtained, which are the first cluster 301 and the first cluster respectively. The second cluster cluster 302, the third cluster cluster 303, the fourth cluster category 304, the fifth cluster cluster 305, and the sixth cluster cluster 306, and then determine that the number of clusters of the trained SOM network is six.

Further, for some complex data sets, when the number of clusters cannot be clarified through the cluster feature distribution map, design multiple SOM networks according to the number of possible categories, perform cluster analysis, and calculate the contours of each SOM network The coefficient determines the specific number of clusters of the SOM network after training.

Step S206: According to the number of clusters, the initial number of nodes of the original SOM network is adjusted to determine the number of available SOM network nodes to obtain the available SOM network.

Specifically, the initial number of nodes of the original SOM network is obtained, and the initial number of nodes of the original SOM network is adjusted according to the number of clusters to obtain the number of available SOM network nodes. Among them, the number of nodes of the available SOM network represents the number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes. Since the initial number of nodes of the original SOM network does not match the number of clusters, the initial number of nodes of the original SOM network needs to be adjusted. When the adjusted number of nodes is consistent with the number of clusters, an available SOM network is obtained Nodes, and form an available SOM network according to the available SOM network nodes.

Step S208: Perform cluster analysis on the data set to be processed according to the available SOM network, determine the clustering result of the data set to be processed, and generate the data set to be processed based on the user data to be processed.

Specifically, the to-be-processed data set generated according to the to-be-processed user data is obtained, and clustering analysis is performed on the to-be-processed data based on the available SOM network to obtain the clustering result of the to-be-processed data. Among them, the clustering result includes multiple clusters and the composition of each different cluster.

Step S210: Obtain a corresponding user classification result according to the clustering result.

Specifically, the clustering results are analyzed, the number of clusters included in the clustering results and the size of each cluster are determined, and the statistical indicators of each cluster in each dimension are calculated, and different clusters are determined based on the statistical indicators. User needs in.

Among them, the clustering result includes the number of clusters and the size of different clusters. For example, there are 10 clusters included in the clustering result, specifically including cluster a, cluster b, cluster c, etc., where a The cluster has a total size of 10,000 customers, and the size of cluster b is 5,000 customers.

Further, by calculating the statistical indicators of the clusters in each dimension, for example, the age range of cluster a is 18-25, 65% of people have high school education, the average purchase amount of products is 10,000 yuan, and 90% of people have purchased small Amount of financial management. Based on the calculated statistical indicators, the needs of users in different clusters can be determined. Specifically, the needs of users in different clusters can be determined by analyzing the user results. For example, through statistical indicators, cluster a can be determined as a pair Young people with small financial needs can push small financial products to users of cluster a based on user needs to achieve precision marketing.

In the above user classification method, the clustering distribution feature map of the SOM network trained according to the sample data set is obtained, where the sample data set includes sample user data, and based on the clustering distribution feature map, the clustering of the trained SOM network is determined Number. According to the number of clusters, adjust the initial number of nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network, so as to perform cluster analysis on the data set to be processed according to the available SOM network to determine the data set to be processed Clustering result, where the data set to be processed is generated based on the user data to be processed. According to the clustering result, the corresponding user classification result is obtained. It realizes the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.

In an implementation, as shown in Figure 4, a user classification method is provided, which specifically includes the following steps:

In step S402, each node of the output layer of the original SOM network is obtained, and each node is initialized.

Specifically, by obtaining each node of the output layer of the original SOM network, and assigning an initial weight to each base point, including initializing the connection weight, learning efficiency, and domain of the SOM network.

Wherein, the initial weight value is obtained according to the initialization operation, that is, each node randomly initializes the corresponding parameter, and the initial field is a relatively large area including multiple nodes.

Step S404: Acquire preset training ending conditions.

Among them, the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold. The weight error limit of two consecutive trainings can be obtained, and the training end condition can be determined according to the weight error limit of the two consecutive trainings. Get the predefined training length.

Specifically, the training end condition refers to setting the error limit of two consecutive training weights to a preset threshold, and the training ends when the error limit of the two consecutive training weights reaches the preset threshold. For example, the training end condition may be that the weight error during two consecutive training processes is less than 0.03. When the weight error during two consecutive training processes is 0.02, that is, less than 0.03, the training ends at this time.

Step S406: Obtain each sample user data in the sample data set, and perform normalization processing on each sample user data.

Specifically, by obtaining the user data of the song samples in the sample data set, and determining the data type of the sample user data, the user data corresponding to the continuous variable data type is normalized, where the continuous variable data type may include deposit amount and age , Through the normalization of continuous variable data, to ensure the consistency of the continuous variable dimension, where the dimension of the deposit amount is greater than the age. Perform label coding processing on the categorical variable data, such as the primary school, middle school and university in the academic qualifications coded into 0, 1, 2 for subsequent calculation of the Hamming distance of the categorical variable.

Among them, the user data specifically includes age, gender, education, occupation, region, and behavioral data, including the types of products purchased, the total amount of purchased products, and the number of times purchased products. Among them, age, the total amount of purchased products, and the number of purchases of products are determined as continuous variables, and other data, including gender, education, occupation, region, and type of purchased products, are all determined as categorical variables.

Step S408: Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network.

Specifically, by randomly selecting a sample user data from the normalized sample user data, it is determined as the first sample data, and the distance between the first sample data and each node of the SOM network output layer is calculated , Select the node with the smallest distance from the first sample data, and determine it as the best matching node of the input sample data.

Among them, the node with the smallest distance from the sample data is called the best matching node (Best-MatchUnit, or BMU) of the input sample data. The distance between the sample data and each output node can be calculated by using the distance formula between the attributes of the mixed variable, that is, the distance between the continuous variable and the categorical variable is calculated separately, and then added to obtain the distance between the sample data and each output node distance. Use Euclidean distance for continuous variables and Hamming distance for categorical variables.

Further, in the process of determining the best matching node, different distance formulas are used. Euclidean clustering is the Euclidean distance. The neighborhood function is used to determine whether the node is in the best matching field, and whether it is in the same The information matches. That is, by calculating the radius of the field, and then traversing all nodes to see if they are within the radius, and updating the weight vector of each node in the best matching field.

Step S410: Obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighbors.

Specifically, obtain any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size, and according to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood The domain is contracted to determine the best matching neighborhood centered on the best matching node.

Further, by predefining any topological field of the best matching node, and then determining the best matching field centered on the best matching node. In the training process, since there are corresponding best-matching fields at different times, a larger range of initial fields can be preset. The initial field is shrunk to get the best matching node field.

In one implementation, the following formula is used to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:

_{m i (t + 1) =} m i (t) + α (t) h ci (t) [x (t) - m i (t)];

Among them, t is the step size, i is the node, _mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h _ci (t) is the domain function, and x(t) represents the output vector.

Step S412, when the training end condition is not reached, return to the step of determining the first sample data from the normalized sample user data. When the training end condition is reached, the SOM network trained according to the sample data set is obtained.

Specifically, the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold, and when the weight error limit of two consecutive trainings reaches the preset threshold, the training end condition is reached, and the training ends. When the error limit of two consecutive training weights does not reach the preset threshold, and the training end condition is not met, the first sample data is determined from the normalized sample user data again, and the first sample data determination is performed. The best matching node is obtained, and any topological neighborhood of the best matching node is obtained, and the training process of the best matching neighborhood centered on the best matching node is determined from the topological neighborhood, until the training end condition is reached.

The user classification method described above is to normalize each sample user data in the data sample set, and determine the first sample data from the normalized sample user data. By determining the best matching node of the first sample data from each node of the original SOM network output layer, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood of the best matching node, until The training end condition is reached, and the SOM network trained according to the sample data set is obtained. The trained SOM network can be used to determine the number of clusters corresponding to the user data to be processed, and the original SOM network can be adjusted to meet the needs of the processing. The clustering requirements of user data further improve the accuracy of user classification.

In an embodiment, as shown in FIG. 5, a user classification method is provided, which specifically includes the following steps:

Step S502: Obtain user data to be processed, and determine the data type of each user data to be processed.

Specifically, the user data to be processed is acquired, and the data type of each user data to be processed is determined. Among them, the data types include continuous variable data types and categorical variable data types. User data specifically includes age, gender, education, occupation, region, and behavioral data including the types of products purchased, the total amount of products purchased, and the number of times the products are purchased. Among them, age, the total amount of purchased products, and the number of purchases of products are determined as continuous variables, and other data, including gender, education, occupation, region, and type of purchased products, are all determined as categorical variables.

Step S504: Perform data preprocessing on each user data to be processed according to the data type to obtain initial user data.

Specifically, the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type. The user data corresponding to the continuous variable data type is normalized. The continuous variable data type can include deposit amount and age. The continuous variable data is normalized to ensure the consistency of the continuous variable dimension. Among them, The dimension of the deposit amount is greater than the age. Perform label coding processing on the categorical variable data, such as the primary school, middle school and university in the academic qualifications coded into 0, 1, 2 for subsequent calculation of the Hamming distance of the categorical variable.

In one embodiment, the data preprocessing also includes variable filtering, which specifically includes:

Obtain user attributes and behavior data and determine them as variables. Among them, user attributes include gender, age, and educational background, and behavior data include deposit amount, loan amount, and client login times. By filtering each variable, the variables whose missing value is greater than the preset missing value threshold are removed, the variance of each variable is calculated, and the variables whose variance is lower than the preset variance threshold are deleted.

Step S506: Perform missing value filling processing on the initial user data to generate a data set to be processed.

Specifically, the data missing type of the initial user data is determined, and the corresponding initial user data is filled with missing values according to the data missing type. Among them, the types of data missing include information missing and behavior missing, and information missing includes continuous variable data missing and categorical variable data missing.

Further, when it is determined that the initial user data belongs to the continuous variable data missing, the mean value of the initial user data of the corresponding continuous variable data type is determined, and the initial user data belonging to the continuous variable data missing is filled with missing values according to the mean to generate the data to be processed set.

When it is determined that the initial user data is determined to belong to the categorical variable data is missing, determine the mode of the initial user data of the corresponding categorical variable data type, and fill in the missing values of the initial user data belonging to the categorical variable data according to the mode to generate the data to be processed set;

When it is determined that the initial user data is a behavioral absence, a constant is created, and the initial user data that is a behavioral absence is filled with missing values according to the constant to generate a data set to be processed.

In the above user classification method, the data type of each user data to be processed is determined by obtaining the user data to be processed, and data preprocessing is performed on each user data to be processed according to the data type to obtain the initial user data, and the initial user data is obtained. The data is filled with missing values to generate a data set to be processed. It realizes the preprocessing of user data and the filling of missing values, avoids processing invalid data in the subsequent classification process, reduces the workload in the classification process, and further improves the efficiency of user classification.

It should be understood that although the steps in the flowcharts of FIGS. 2, 4, and 5 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in Figure 2, Figure 4, and Figure 5 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The order of execution of the sub-steps or stages is not necessarily performed sequentially, but may be executed alternately or alternately with other steps or at least a part of the sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 6, a user classification device is provided, including: a cluster distribution feature map acquisition module 602, a cluster number determination module 604, an available SOM network generation module 606, and a clustering result determination module The module 608 and the user classification result generation module 610, wherein:

The clustering distribution feature map acquisition module 602 is used to acquire the clustering distribution feature map of the SOM network trained according to the sample data set. The sample data set includes sample user data. The SOM network is a self-organizing mapping network, and the clustering distribution map is used for Reflect the number and distribution of clusters included in the trained SOM network.

The cluster number determining module 604 is configured to determine the number of clusters of the trained SOM network based on the cluster distribution feature map.

The available SOM network generation module 606 is used to adjust the initial number of nodes of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network. Among them, the number of nodes of the available SOM network represents the number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes.

The clustering result determination module 608 is configured to perform cluster analysis on the data set to be processed according to the available SOM network, and determine the clustering result of the data set to be processed, and the data set to be processed is generated based on the user data to be processed.

The user classification result generating module 610 is configured to obtain a corresponding user classification result according to the clustering result.

The user classification device described above obtains a clustered distribution feature map of the SOM network trained on a sample data set, where the sample data set includes sample user data, and based on the clustered distribution feature map, determines the number of clusters of the trained SOM network number. According to the number of clusters, adjust the initial number of nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network, so as to perform cluster analysis on the data set to be processed according to the available SOM network to determine the data set to be processed Clustering result, where the data set to be processed is generated based on the user data to be processed. According to the clustering result, the corresponding user classification result is obtained. It realizes the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.

In one embodiment, as shown in FIG. 7, a user classification device is provided, which includes: a node acquisition module 702, a training end condition acquisition module 704, a sample user data acquisition module 706, a best matching node determination module 708, and the most The best matching neighborhood determination module 710 and the trained SOM network generation module 712, where:

The node obtaining module 702 obtains each node of the output layer of the original SOM network, and initializes each node.

The training end condition obtaining module 704 is used to obtain preset training end conditions.

The sample user data acquisition module 706 is used to acquire each sample user data in the sample data set, and perform normalization processing on each sample user data.

The best matching node determination module 708 is used to determine the first sample data from the normalized sample user data, and to determine the best matching node of the first sample data from each node in the output layer of the original SOM network .

The best matching neighborhood determining module 710 is configured to obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighborhood.

The trained SOM network generation module 712 is used to return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached to obtain the SOM network trained according to the sample data set.

The user classification device described above performs normalization processing on each sample user data in the data sample set, and determines the first sample data from the sample user data after the normalization processing. By determining the best matching node of the first sample data from each node of the original SOM network output layer, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood of the best matching node, until The training end condition is reached, and the SOM network trained according to the sample data set is obtained. The trained SOM network can be used to determine the number of clusters corresponding to the user data to be processed, and the original SOM network can be adjusted to meet the needs of the processing. The clustering requirements of user data further improve the accuracy of user classification.

In an implementation, a user classification device is provided, which further includes:

The data type determination module is used to obtain the user data to be processed and determine the data type of each user data to be processed. The data type includes the continuous variable data type and the categorical variable data type.

The data preprocessing module is used to preprocess the user data to be processed according to the data type to obtain the initial user data. The preprocessing includes the data normalization processing corresponding to the continuous variable data type and the data type of the category variable Corresponding label encoding processing.

The to-be-processed data set generation module is used to perform missing value filling processing on the initial user data to generate the to-be-processed data set.

The above-mentioned user classification device determines the data type of each user data to be processed by acquiring the user data to be processed, and performs data preprocessing on each user data to be processed according to the data type to obtain the initial user data, and then calculate the initial user data. Perform missing value filling processing to generate a data set to be processed. It realizes the preprocessing of user data and the filling of missing values, avoids processing invalid data in the subsequent classification process, reduces the workload in the classification process, and further improves the efficiency of user classification.

For the specific limitation of the user classification device, please refer to the above limitation on the user classification method, which will not be repeated here. Each module in the above user classification device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store user data. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a user classification method.

Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

In one embodiment, a computer device is provided, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when the processor executes the computer program:

Obtain the clustering distribution feature map of the SOM network trained according to the sample data set; the sample data set includes sample user data, the SOM network is a self-organizing map network, and the clustering distribution map is used to reflect the clustering clusters included in the trained SOM network Number and distribution;

Based on the cluster distribution feature map, determine the number of clusters of the trained SOM network;

According to the number of clusters, adjust the number of initial nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network; the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes with the same number;

According to the clustering result, the corresponding user classification result is obtained.

In an embodiment, the processor further implements the following steps when executing the computer program:

Obtain each node of the output layer of the original SOM network, and initialize each node;

Obtain a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold;

Obtain each sample user data in the sample data set, and normalize each sample user data;

Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;

Obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighborhood;

Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.

Obtain any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;

According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.

The following formula is used to adjust the weights of the best matching node and the corresponding nodes in the initial neighborhood:

m _i (t+1)=m _i (t)+α(t)h _ci (t)[x(t)-m _i (t)];

Among them, t is the step size, i is the node, mi(t) is the weight of the i node in the t-th step, α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α(t )<1, hci(t) is the domain function, x(t) represents the output vector.

Obtain the user data to be processed, and determine the data type of each user data to be processed; the data types include continuous variable data types and categorical variable data types;

According to the data type, perform data preprocessing on each user data to be processed to obtain the initial user data; the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type;

The initial user data is filled with missing values to generate a data set to be processed.

Determine the type of data missing for initial user data; types of missing data include missing information and missing behavior, missing information includes missing continuous variable data and missing data for categorical variables;

Determine the mean value of the initial user data belonging to the continuous variable data type, fill in the initial user data with missing continuous variable data according to the mean value, and generate a data set to be processed;

Or determine the mode of the initial user data belonging to the categorical variable data type, and fill in the missing values of the initial user data with missing data belonging to the categorical variable according to the mode to generate a data set to be processed;

Or create a new constant, and fill in the initial user data with missing behavior based on the constant to generate a data set to be processed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

In an embodiment, when the computer program is executed by the processor, the following steps are further implemented:

m _i (t+1)=m _i (t)+α(t)h _ci (t)[x(t)-m _i (t)];

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A user classification method, the method includes:

Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;

Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;

According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;

Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;

A corresponding user classification result is obtained according to the clustering result.
The method according to claim 1, wherein the method further comprises:

Obtain each node of the output layer of the original SOM network, and initialize each node;

Acquiring a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold;

Acquiring each sample user data in the sample data set, and normalizing each of the sample user data;

Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;

Acquiring any topological neighborhood of the best matching node, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood;

Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
The method according to claim 2, wherein the obtaining any topological neighborhood of the best matching node, and determining the best matching neighbor centered on the best matching node from the topological neighborhood Domain, including:

Acquiring any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;

According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
The method according to claim 3, wherein the method further comprises: using the following formula to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:

m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];

Among them, t is the step size, i is the node, mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h ci (t) is the domain function, and x(t) represents the output vector.
The method according to claim 1, wherein before performing cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed, the method further comprises:

Obtain the user data to be processed, and determine the data type of each user data to be processed; the data type includes a continuous variable data type and a categorical variable data type;

According to the data type, perform data preprocessing on each of the user data to be processed to obtain initial user data; the preprocessing includes normalizing data corresponding to the continuous variable data type, and processing the category Label coding processing corresponding to variable data type;

Perform missing value filling processing on the initial user data to generate a data set to be processed.
The method according to claim 5, wherein said performing missing value filling processing on said initial user data to generate a data set to be processed comprises:

Determine the data missing type of the initial user data; the data missing type includes information missing and behavior missing, and the information missing includes continuous variable data missing and categorical variable data missing;

Determining the mean value of the initial user data belonging to the continuous variable data type, and filling in the initial user data belonging to the continuous variable data missing with missing values according to the mean value, to generate a data set to be processed;

or

Determine the mode of the initial user data belonging to the categorical variable data type, and fill in the initial user data with missing data belonging to the categorical variable with missing values according to the mode, to generate a data set to be processed;

or

Create a new constant, fill in the initial user data with behavior missing according to the constant, and generate a data set to be processed.
A user classification device, wherein the device includes:

The cluster distribution feature map acquisition module is used to acquire the cluster distribution feature map of the SOM network trained according to the sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the cluster The cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network;

A cluster number determination module, configured to determine the number of clusters of the SOM network after training based on the cluster distribution feature map;

The available SOM network generation module is used to adjust the initial node number of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network; the number of nodes of the available SOM network indicates The number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes;

The clustering result determination module is configured to perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;

The user classification result generating module is used to obtain the corresponding user classification result according to the clustering result.
The device according to claim 7, wherein the device further comprises:

The node acquisition module acquires each node in the output layer of the original SOM network and initializes each node; the training end condition acquisition module is used to acquire preset training end conditions.

The sample user data acquisition module is used to acquire each sample user data in the sample data set, and perform normalization processing on each sample user data.

The best matching node determination module is used to determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network.

The best matching neighborhood determining module is used to obtain any topological neighborhood of the best matching node, and to determine the best matching neighborhood centered on the best matching node from the topological neighborhood.

The trained SOM network generation module is used to return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached to obtain the SOM network trained according to the sample data set.
A computer device includes a memory and a processor, the memory stores a computer program, wherein the processor implements a user classification method when the computer program is executed, and the user classification method includes the following steps:

Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;

Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;

According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;

Performing cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; generating the data set to be processed based on the user data to be processed;

A corresponding user classification result is obtained according to the clustering result.
The computer device according to claim 9, wherein when the processor executes the user classification method, it further comprises:

Obtain each node of the output layer of the original SOM network, and initialize each node;

Acquiring a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold;

Acquiring each sample user data in the sample data set, and normalizing each of the sample user data;

Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;

Acquiring any topological neighborhood of the best matching node, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood;

Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
The computer device according to claim 10, wherein the obtaining of any topological neighborhood of the best matching node is performed, and the best one centered on the best matching node is determined from the topological neighborhoods When matching neighborhoods, include:

Acquiring any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;

According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
The computer device according to claim 11, wherein, when the processor executes the user classification method, further comprising:

The following formula is used to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:

m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];

Among them, t is the step size, i is the node, mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h ci (t) is the domain function, and x(t) represents the output vector.
9. The computer device according to claim 9, wherein before the processor performs clustering analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed, the method further comprises:

Obtain the user data to be processed, and determine the data type of each user data to be processed; the data type includes a continuous variable data type and a categorical variable data type;

According to the data type, perform data preprocessing on each of the user data to be processed to obtain initial user data; the preprocessing includes normalizing data corresponding to the continuous variable data type, and processing the category Label coding processing corresponding to variable data type;

Perform missing value filling processing on the initial user data to generate a data set to be processed.
The computer device according to claim 13, wherein, when performing the missing value filling processing on the initial user data to generate the to-be-processed data set, it comprises:

Determine the data missing type of the initial user data; the data missing type includes information missing and behavior missing, and the information missing includes continuous variable data missing and categorical variable data missing;

Determining the mean value of the initial user data belonging to the continuous variable data type, and filling in the initial user data belonging to the continuous variable data missing with missing values according to the mean value, to generate a data set to be processed;

or

Determine the mode of the initial user data belonging to the categorical variable data type, fill in the initial user data with missing data belonging to the category variable according to the mode, and generate a data set to be processed;

or

Create a new constant, fill in the initial user data with behavior missing according to the constant, and generate a data set to be processed.
A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, a user classification method is implemented, and the user classification method includes the following steps:

Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;

Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;

According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;

Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;

A corresponding user classification result is obtained according to the clustering result.
The computer-readable storage medium according to claim 15, wherein, when the computer program is executed by a processor to implement the user classification method, the method further comprises:

Obtain each node of the output layer of the original SOM network, and initialize each node;

Acquiring a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold;

Acquiring each sample user data in the sample data set, and normalizing each of the sample user data;

Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;

Acquiring any topological neighborhood of the best matching node, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood;

Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
The computer-readable storage medium according to claim 16, wherein the obtaining of any topological neighborhood of the best matching node is performed, and it is determined from the topological neighborhood that the best matching node is the center The best matching neighborhoods include:

Acquiring any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;

According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
The computer-readable storage medium according to claim 17, wherein, when the computer program is executed by a processor to implement the user classification method, the method further comprises:

The following formula is used to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:

m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];

Among them, t is the step size, i is the node, mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h ci (t) is the domain function, and x(t) represents the output vector.
15. The computer-readable storage medium according to claim 15, wherein before the processor performs cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed, further comprising:

Obtain the user data to be processed, and determine the data type of each user data to be processed; the data type includes a continuous variable data type and a categorical variable data type;

According to the data type, perform data preprocessing on each of the user data to be processed to obtain initial user data; the preprocessing includes normalizing data corresponding to the continuous variable data type, and processing the category Label coding processing corresponding to variable data type;

Perform missing value filling processing on the initial user data to generate a data set to be processed.
The computer-readable storage medium according to claim 19, wherein, when performing the missing value filling processing on the initial user data to generate a data set to be processed, it comprises:

Determine the data missing type of the initial user data; the data missing type includes information missing and behavior missing, and the information missing includes continuous variable data missing and categorical variable data missing;

Determining the mean value of the initial user data belonging to the continuous variable data type, and filling in the initial user data belonging to the continuous variable data missing with missing values according to the mean value, to generate a data set to be processed;

or

Determine the mode of the initial user data belonging to the categorical variable data type, and fill in the initial user data with missing data belonging to the categorical variable with missing values according to the mode, to generate a data set to be processed;

or

Create a new constant, fill in the initial user data with behavior missing according to the constant, and generate a data set to be processed.