WO2021203854A1 - User classification method and apparatus, computer device and storage medium - Google Patents

User classification method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2021203854A1
WO2021203854A1 PCT/CN2021/077380 CN2021077380W WO2021203854A1 WO 2021203854 A1 WO2021203854 A1 WO 2021203854A1 CN 2021077380 W CN2021077380 W CN 2021077380W WO 2021203854 A1 WO2021203854 A1 WO 2021203854A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processed
user data
som network
sample
Prior art date
Application number
PCT/CN2021/077380
Other languages
French (fr)
Chinese (zh)
Inventor
孔清扬
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021203854A1 publication Critical patent/WO2021203854A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to a user classification method, device, computer equipment, and storage medium.
  • the traditional subdivision method based on one-dimensional user information cannot reflect the various characteristics of users and user information, and realize the dynamic tracking of user behavior changes, and thus cannot meet the diversified needs of users for products or services. , Resulting in inaccurate user information classification.
  • a user classification method includes:
  • the sample data set includes sample user data
  • the SOM network is a self-organizing mapping network
  • the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network
  • the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
  • a corresponding user classification result is obtained according to the clustering result.
  • a user classification device includes:
  • the cluster distribution feature map acquisition module is used to acquire the cluster distribution feature map of the SOM network trained according to the sample data set;
  • the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the cluster
  • the cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network;
  • a cluster number determination module configured to determine the number of clusters of the SOM network after training based on the cluster distribution feature map
  • the available SOM network generation module is used to adjust the initial node number of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network; the number of nodes of the available SOM network indicates The number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes;
  • a clustering result determination module configured to perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
  • the user classification result generating module is used to obtain the corresponding user classification result according to the clustering result.
  • a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when the processor executes the computer program:
  • the sample data set includes sample user data
  • the SOM network is a self-organizing mapping network
  • the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network
  • the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
  • a corresponding user classification result is obtained according to the clustering result.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the following steps are implemented:
  • the sample data set includes sample user data
  • the SOM network is a self-organizing mapping network
  • the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network
  • the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
  • a corresponding user classification result is obtained according to the clustering result.
  • This application realizes the use of the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.
  • Fig. 1 is an application scenario diagram of a user classification method in an embodiment
  • Figure 2 is a schematic flowchart of a user classification method in an embodiment
  • FIG. 3 is a schematic diagram of the clustering distribution characteristics of the SOM network in an embodiment
  • FIG. 4 is a schematic flowchart of a user classification method in another embodiment
  • FIG. 5 is a schematic flowchart of a user classification method in another embodiment
  • Figure 6 is a structural block diagram of a user classification device in an embodiment
  • Figure 7 is a structural block diagram of a user classification device in another implementation
  • Fig. 8 is an internal structure diagram of a computer device in an embodiment.
  • the technical solution of the present application relates to the field of artificial intelligence and/or big data technology, such as neural network technology, to realize intelligent user classification.
  • the data involved in this application can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.
  • the user classification method provided in this application can be applied to the application environment shown in FIG. 1, including the terminal 102 and the server 104, and specifically can be applied to the server 104, where the terminal 102 and the server 104 communicate through a network.
  • the server 104 obtains the clustering distribution feature map of the SOM network trained according to the sample data set, and determines the number of clusters of the trained SOM network based on the clustering distribution feature map.
  • the sample data set includes sample user data
  • the cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network.
  • the server 104 adjusts the number of initial nodes of the original SOM network according to the number of clusters, determines the number of available SOM network nodes, and obtains the available SOM network, where the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes consistent with the number of clusters.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for user classification is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
  • Step S202 Obtain a clustering distribution feature map of the SOM network trained according to a sample data set, where the sample data set includes sample user data.
  • the original SOM network is trained to obtain the trained SOM network, and the clustering distribution feature map of the trained SOM network is obtained.
  • the SOM network is a self-organizing mapping network, and clustering means that in order to improve the query speed of a certain attribute or a certain attribute group, tuples with the same value on the same attribute, that is, the clustering code, are concentrated in a continuous physical block.
  • the cluster distribution map is used to reflect the number of clusters and the distribution of each cluster in the SOM network after training.
  • the SOM network structure emphasizes the proximity relationship between the cluster centers, the correlation between adjacent clusters is stronger, and the color boundary can be analyzed according to the color boundary. Different colors form multiple clusters, so that the cluster distribution feature map of the trained SOM network can be obtained.
  • multiple clusters can be formed according to the trained SOM network, including a first cluster 301, a second cluster 302, a third cluster 303, and a fourth cluster.
  • the cluster 304, the fifth cluster 305, and the sixth cluster 306 obtain the cluster distribution feature map of the trained SOM network.
  • the SOM network stands for Self-organizing Maps (SOM) network, which is an artificial neural network that simulates the characteristics of signal processing by the human brain, and can map artificial input patterns into one-dimensional, two-dimensional or even more in the output layer. High-dimensional discrete graphics, and keep its topological structure unchanged. It is composed of input and output layers (competitive layer), the number of neurons in the output layer is n, and the output layer is a one-dimensional or two-dimensional planar array composed of m neurons.
  • the network is fully connected, that is, each input node has Connect with all output nodes.
  • the SOM network can make the weight vector space and the probability distribution of the input mode converge through repeated learning of the input mode, that is, probability retention.
  • Each neuron in the output layer of the SOM network competes for the response opportunity to the input pattern, and the weights related to the winning neuron are adjusted in a direction that is more conducive to its competition, that is, the winning neuron is the center of the circle, and the neighboring neurons are excited.
  • Sexual feedback while showing inhibitory feedback to the neurons in the distant neighbors, the neighbors stimulate each other, and the distant neighbors inhibit each other.
  • Step S204 Determine the number of clusters of the trained SOM network based on the cluster distribution feature map.
  • the cluster distribution feature map is analyzed, and the analysis is performed according to the color boundary.
  • the number of categories can be reflected, where the number of categories is the number of clusters. Select clusters with similar colors to form a cluster, and then form multiple clusters according to multiple different colors.
  • 6 clusters can be obtained, which are the first cluster 301 and the first cluster respectively.
  • Step S206 According to the number of clusters, the initial number of nodes of the original SOM network is adjusted to determine the number of available SOM network nodes to obtain the available SOM network.
  • the initial number of nodes of the original SOM network is obtained, and the initial number of nodes of the original SOM network is adjusted according to the number of clusters to obtain the number of available SOM network nodes.
  • the number of nodes of the available SOM network represents the number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes. Since the initial number of nodes of the original SOM network does not match the number of clusters, the initial number of nodes of the original SOM network needs to be adjusted.
  • an available SOM network is obtained Nodes, and form an available SOM network according to the available SOM network nodes.
  • Step S208 Perform cluster analysis on the data set to be processed according to the available SOM network, determine the clustering result of the data set to be processed, and generate the data set to be processed based on the user data to be processed.
  • the to-be-processed data set generated according to the to-be-processed user data is obtained, and clustering analysis is performed on the to-be-processed data based on the available SOM network to obtain the clustering result of the to-be-processed data.
  • the clustering result includes multiple clusters and the composition of each different cluster.
  • Step S210 Obtain a corresponding user classification result according to the clustering result.
  • the clustering results are analyzed, the number of clusters included in the clustering results and the size of each cluster are determined, and the statistical indicators of each cluster in each dimension are calculated, and different clusters are determined based on the statistical indicators. User needs in.
  • the clustering result includes the number of clusters and the size of different clusters. For example, there are 10 clusters included in the clustering result, specifically including cluster a, cluster b, cluster c, etc., where a The cluster has a total size of 10,000 customers, and the size of cluster b is 5,000 customers.
  • the needs of users in different clusters can be determined. Specifically, the needs of users in different clusters can be determined by analyzing the user results. For example, through statistical indicators, cluster a can be determined as a pair Young people with small financial needs can push small financial products to users of cluster a based on user needs to achieve precision marketing.
  • the clustering distribution feature map of the SOM network trained according to the sample data set is obtained, where the sample data set includes sample user data, and based on the clustering distribution feature map, the clustering of the trained SOM network is determined Number.
  • the number of clusters adjust the initial number of nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network, so as to perform cluster analysis on the data set to be processed according to the available SOM network to determine the data set to be processed Clustering result, where the data set to be processed is generated based on the user data to be processed.
  • the clustering result the corresponding user classification result is obtained. It realizes the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.
  • a user classification method which specifically includes the following steps:
  • step S402 each node of the output layer of the original SOM network is obtained, and each node is initialized.
  • each node of the output layer of the original SOM network by obtaining each node of the output layer of the original SOM network, and assigning an initial weight to each base point, including initializing the connection weight, learning efficiency, and domain of the SOM network.
  • the initial weight value is obtained according to the initialization operation, that is, each node randomly initializes the corresponding parameter, and the initial field is a relatively large area including multiple nodes.
  • Step S404 Acquire preset training ending conditions.
  • the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold.
  • the weight error limit of two consecutive trainings can be obtained, and the training end condition can be determined according to the weight error limit of the two consecutive trainings. Get the predefined training length.
  • the training end condition refers to setting the error limit of two consecutive training weights to a preset threshold, and the training ends when the error limit of the two consecutive training weights reaches the preset threshold.
  • the training end condition may be that the weight error during two consecutive training processes is less than 0.03.
  • the weight error during two consecutive training processes is 0.02, that is, less than 0.03, the training ends at this time.
  • Step S406 Obtain each sample user data in the sample data set, and perform normalization processing on each sample user data.
  • the user data corresponding to the continuous variable data type is normalized, where the continuous variable data type may include deposit amount and age .
  • the continuous variable data type may include deposit amount and age
  • the categorical variable data such as the primary school, middle school and university in the academic qualifications coded into 0, 1, 2 for subsequent calculation of the Hamming distance of the categorical variable.
  • the user data specifically includes age, gender, education, occupation, region, and behavioral data, including the types of products purchased, the total amount of purchased products, and the number of times purchased products.
  • age, the total amount of purchased products, and the number of purchases of products are determined as continuous variables, and other data, including gender, education, occupation, region, and type of purchased products, are all determined as categorical variables.
  • Step S408 Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network.
  • the first sample data by randomly selecting a sample user data from the normalized sample user data, it is determined as the first sample data, and the distance between the first sample data and each node of the SOM network output layer is calculated , Select the node with the smallest distance from the first sample data, and determine it as the best matching node of the input sample data.
  • the node with the smallest distance from the sample data is called the best matching node (Best-MatchUnit, or BMU) of the input sample data.
  • BMU best matching node
  • the distance between the sample data and each output node can be calculated by using the distance formula between the attributes of the mixed variable, that is, the distance between the continuous variable and the categorical variable is calculated separately, and then added to obtain the distance between the sample data and each output node distance.
  • Euclidean clustering is the Euclidean distance.
  • the neighborhood function is used to determine whether the node is in the best matching field, and whether it is in the same The information matches. That is, by calculating the radius of the field, and then traversing all nodes to see if they are within the radius, and updating the weight vector of each node in the best matching field.
  • Step S410 Obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighbors.
  • the topological neighborhood is an initial neighborhood with a preset range size
  • the initial neighborhood The domain is contracted to determine the best matching neighborhood centered on the best matching node.
  • the best matching node by predefining any topological field of the best matching node, and then determining the best matching field centered on the best matching node.
  • a larger range of initial fields can be preset. The initial field is shrunk to get the best matching node field.
  • the following formula is used to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:
  • m i (t + 1) m i (t) + ⁇ (t) h ci (t) [x (t) - m i (t)];
  • t is the step size
  • i is the node
  • mi (t) is the weight of the i-node in the t-th step
  • ⁇ (t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0 ⁇ ( t) ⁇ 1, h ci (t) is the domain function, and x(t) represents the output vector.
  • Step S412 when the training end condition is not reached, return to the step of determining the first sample data from the normalized sample user data.
  • the SOM network trained according to the sample data set is obtained.
  • the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold, and when the weight error limit of two consecutive trainings reaches the preset threshold, the training end condition is reached, and the training ends.
  • the first sample data is determined from the normalized sample user data again, and the first sample data determination is performed.
  • the best matching node is obtained, and any topological neighborhood of the best matching node is obtained, and the training process of the best matching neighborhood centered on the best matching node is determined from the topological neighborhood, until the training end condition is reached.
  • the user classification method described above is to normalize each sample user data in the data sample set, and determine the first sample data from the normalized sample user data. By determining the best matching node of the first sample data from each node of the original SOM network output layer, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood of the best matching node, until The training end condition is reached, and the SOM network trained according to the sample data set is obtained.
  • the trained SOM network can be used to determine the number of clusters corresponding to the user data to be processed, and the original SOM network can be adjusted to meet the needs of the processing.
  • the clustering requirements of user data further improve the accuracy of user classification.
  • a user classification method which specifically includes the following steps:
  • Step S502 Obtain user data to be processed, and determine the data type of each user data to be processed.
  • the user data to be processed is acquired, and the data type of each user data to be processed is determined.
  • the data types include continuous variable data types and categorical variable data types.
  • User data specifically includes age, gender, education, occupation, region, and behavioral data including the types of products purchased, the total amount of products purchased, and the number of times the products are purchased.
  • age, the total amount of purchased products, and the number of purchases of products are determined as continuous variables, and other data, including gender, education, occupation, region, and type of purchased products, are all determined as categorical variables.
  • Step S504 Perform data preprocessing on each user data to be processed according to the data type to obtain initial user data.
  • the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type.
  • the user data corresponding to the continuous variable data type is normalized.
  • the continuous variable data type can include deposit amount and age.
  • the continuous variable data is normalized to ensure the consistency of the continuous variable dimension. Among them, The dimension of the deposit amount is greater than the age.
  • Perform label coding processing on the categorical variable data such as the primary school, middle school and university in the academic qualifications coded into 0, 1, 2 for subsequent calculation of the Hamming distance of the categorical variable.
  • variable filtering which specifically includes:
  • user attributes and behavior data and determine them as variables.
  • user attributes include gender, age, and educational background
  • behavior data include deposit amount, loan amount, and client login times.
  • Step S506 Perform missing value filling processing on the initial user data to generate a data set to be processed.
  • the data missing type of the initial user data is determined, and the corresponding initial user data is filled with missing values according to the data missing type.
  • the types of data missing include information missing and behavior missing, and information missing includes continuous variable data missing and categorical variable data missing.
  • the mean value of the initial user data of the corresponding continuous variable data type is determined, and the initial user data belonging to the continuous variable data missing is filled with missing values according to the mean to generate the data to be processed set.
  • the initial user data is a behavioral absence
  • a constant is created, and the initial user data that is a behavioral absence is filled with missing values according to the constant to generate a data set to be processed.
  • the data type of each user data to be processed is determined by obtaining the user data to be processed, and data preprocessing is performed on each user data to be processed according to the data type to obtain the initial user data, and the initial user data is obtained.
  • the data is filled with missing values to generate a data set to be processed. It realizes the preprocessing of user data and the filling of missing values, avoids processing invalid data in the subsequent classification process, reduces the workload in the classification process, and further improves the efficiency of user classification.
  • a user classification device including: a cluster distribution feature map acquisition module 602, a cluster number determination module 604, an available SOM network generation module 606, and a clustering result determination module The module 608 and the user classification result generation module 610, wherein:
  • the clustering distribution feature map acquisition module 602 is used to acquire the clustering distribution feature map of the SOM network trained according to the sample data set.
  • the sample data set includes sample user data.
  • the SOM network is a self-organizing mapping network, and the clustering distribution map is used for Reflect the number and distribution of clusters included in the trained SOM network.
  • the cluster number determining module 604 is configured to determine the number of clusters of the trained SOM network based on the cluster distribution feature map.
  • the available SOM network generation module 606 is used to adjust the initial number of nodes of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network. Among them, the number of nodes of the available SOM network represents the number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes.
  • the clustering result determination module 608 is configured to perform cluster analysis on the data set to be processed according to the available SOM network, and determine the clustering result of the data set to be processed, and the data set to be processed is generated based on the user data to be processed.
  • the user classification result generating module 610 is configured to obtain a corresponding user classification result according to the clustering result.
  • the user classification device described above obtains a clustered distribution feature map of the SOM network trained on a sample data set, where the sample data set includes sample user data, and based on the clustered distribution feature map, determines the number of clusters of the trained SOM network number. According to the number of clusters, adjust the initial number of nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network, so as to perform cluster analysis on the data set to be processed according to the available SOM network to determine the data set to be processed Clustering result, where the data set to be processed is generated based on the user data to be processed. According to the clustering result, the corresponding user classification result is obtained. It realizes the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.
  • a user classification device which includes: a node acquisition module 702, a training end condition acquisition module 704, a sample user data acquisition module 706, a best matching node determination module 708, and the most The best matching neighborhood determination module 710 and the trained SOM network generation module 712, where:
  • the node obtaining module 702 obtains each node of the output layer of the original SOM network, and initializes each node.
  • the training end condition obtaining module 704 is used to obtain preset training end conditions.
  • the sample user data acquisition module 706 is used to acquire each sample user data in the sample data set, and perform normalization processing on each sample user data.
  • the best matching node determination module 708 is used to determine the first sample data from the normalized sample user data, and to determine the best matching node of the first sample data from each node in the output layer of the original SOM network .
  • the best matching neighborhood determining module 710 is configured to obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighborhood.
  • the trained SOM network generation module 712 is used to return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached to obtain the SOM network trained according to the sample data set.
  • the user classification device described above performs normalization processing on each sample user data in the data sample set, and determines the first sample data from the sample user data after the normalization processing. By determining the best matching node of the first sample data from each node of the original SOM network output layer, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood of the best matching node, until The training end condition is reached, and the SOM network trained according to the sample data set is obtained.
  • the trained SOM network can be used to determine the number of clusters corresponding to the user data to be processed, and the original SOM network can be adjusted to meet the needs of the processing.
  • the clustering requirements of user data further improve the accuracy of user classification.
  • a user classification device which further includes:
  • the data type determination module is used to obtain the user data to be processed and determine the data type of each user data to be processed.
  • the data type includes the continuous variable data type and the categorical variable data type.
  • the data preprocessing module is used to preprocess the user data to be processed according to the data type to obtain the initial user data.
  • the preprocessing includes the data normalization processing corresponding to the continuous variable data type and the data type of the category variable Corresponding label encoding processing.
  • the to-be-processed data set generation module is used to perform missing value filling processing on the initial user data to generate the to-be-processed data set.
  • the above-mentioned user classification device determines the data type of each user data to be processed by acquiring the user data to be processed, and performs data preprocessing on each user data to be processed according to the data type to obtain the initial user data, and then calculate the initial user data. Perform missing value filling processing to generate a data set to be processed. It realizes the preprocessing of user data and the filling of missing values, avoids processing invalid data in the subsequent classification process, reduces the workload in the classification process, and further improves the efficiency of user classification.
  • Each module in the above user classification device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store user data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a user classification method.
  • FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when the processor executes the computer program:
  • the sample data set includes sample user data
  • the SOM network is a self-organizing map network
  • the clustering distribution map is used to reflect the clustering clusters included in the trained SOM network Number and distribution
  • the number of clusters adjust the number of initial nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network; the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes with the same number;
  • the processor further implements the following steps when executing the computer program:
  • the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold
  • the processor further implements the following steps when executing the computer program:
  • topological neighborhood of the best matching node where the topological neighborhood is an initial neighborhood with a preset range size
  • the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
  • the processor further implements the following steps when executing the computer program:
  • m i (t+1) m i (t)+ ⁇ (t)h ci (t)[x(t)-m i (t)];
  • t is the step size
  • i is the node
  • mi(t) is the weight of the i node in the t-th step
  • ⁇ (t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0 ⁇ (t ) ⁇ 1, hci(t) is the domain function, x(t) represents the output vector.
  • the processor further implements the following steps when executing the computer program:
  • the data types include continuous variable data types and categorical variable data types
  • the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type;
  • the initial user data is filled with missing values to generate a data set to be processed.
  • the processor further implements the following steps when executing the computer program:
  • types of missing data include missing information and missing behavior, missing information includes missing continuous variable data and missing data for categorical variables;
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
  • the sample data set includes sample user data
  • the SOM network is a self-organizing map network
  • the clustering distribution map is used to reflect the clustering clusters included in the trained SOM network Number and distribution
  • the number of clusters adjust the number of initial nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network; the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes with the same number;
  • the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold
  • topological neighborhood of the best matching node where the topological neighborhood is an initial neighborhood with a preset range size
  • the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
  • m i (t+1) m i (t)+ ⁇ (t)h ci (t)[x(t)-m i (t)];
  • t is the step size
  • i is the node
  • mi(t) is the weight of the i node in the t-th step
  • ⁇ (t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0 ⁇ (t ) ⁇ 1, hci(t) is the domain function, x(t) represents the output vector.
  • the data types include continuous variable data types and categorical variable data types
  • the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type;
  • the initial user data is filled with missing values to generate a data set to be processed.
  • types of missing data include missing information and missing behavior, missing information includes missing continuous variable data and missing data for categorical variables;
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An intelligent decision-based user classification method and apparatus, a computer device and a storage medium. Said method comprises: acquiring a cluster distribution feature map of an SOM network trained according to a sample data set, the sample data set comprising sample user data; determining the number of clusters of the trained SOM network on the basis of the cluster distribution feature map; adjusting the number of initial nodes of an original SOM network according to the number of clusters, so as to determine the number of available SOM network nodes, and obtain an available SOM network; performing, according to the available SOM network, clustering analysis on a data set to be processed, and determining a clustering result of said data set, said data set being generated according to user data to be processed; and obtaining a corresponding user classification result according to the clustering result. The method implements the self-organizing mapping function using the SOM network, and digs out the cluster distribution contained in high-dimensional data, thereby improving the efficiency and accuracy of the whole clustering process, and achieving the accurate classification of users.

Description

用户分类方法、装置、计算机设备和存储介质User classification method, device, computer equipment and storage medium
本申请要求于2020年4月9日提交中国专利局、申请号为202010273736.1,发明名称为“用户分类方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 9, 2020 with the application number 202010273736.1 and the invention title "User Classification Method, Device, Computer Equipment and Storage Medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能技术领域,特别是涉及一种用户分类方法、装置、计算机设备和存储介质。This application relates to the field of artificial intelligence technology, in particular to a user classification method, device, computer equipment, and storage medium.
背景技术Background technique
随着计算机技术的发展,以及互联网和移动终端等在人们生活工作中的广泛应用,越来越多的人选择通过网络来获取自己想要的数据和信息。由于不同用户具有不同的互联网访问行为,在金融行业或者电信行业,均可通过分析用户行为来实现精准营销。其中,实现精准营销的关键在于有效的用户信息细分,用户信息细分的关键问题在于如何挖掘出用户隐藏在数据中的特征,并根据挖掘出的特征进行分类,得到用户信息的分类情况。With the development of computer technology and the wide application of the Internet and mobile terminals in people's lives and work, more and more people choose to obtain the data and information they want through the Internet. Since different users have different Internet access behaviors, in the financial industry or the telecommunications industry, accurate marketing can be achieved by analyzing user behavior. Among them, the key to realizing precision marketing lies in effective user information segmentation. The key issue of user information segmentation is how to dig out the characteristics of users hidden in the data, and classify them according to the excavated characteristics to get the classification of user information.
发明人发现,传统的用户信息细分方式,往往根据用户的一维属性来进行,比如在金融行业,根据用户资产多少,可以将用户分为高、中、低端客户,该细分方法可以根据营销资源预算,取舍营销活动中的目标群体。但随着用户需求的日趋多样化,以及企业产品的不断创新,发明人意识到,即使同是高端用户,不同用户对同一产品或服务的需求也存在着明显差别。因此,传统的基于一维的用户信息的细分方式,由于无法反映用户多方面的特征和用户信息,以及实现对用户行为变化的动态跟踪,进而也无法满足用户针对产品或服务的多样化需求,导致所得到的用户信息分类不够精确。The inventor found that traditional user information segmentation methods are often based on the one-dimensional attributes of users. For example, in the financial industry, users can be divided into high, medium, and low-end customers according to the amount of user assets. This segmentation method can According to the marketing resource budget, choose the target group in marketing activities. However, with the increasing diversification of user needs and the continuous innovation of enterprise products, the inventor realized that even if the same high-end users, different users' needs for the same product or service also have obvious differences. Therefore, the traditional subdivision method based on one-dimensional user information cannot reflect the various characteristics of users and user information, and realize the dynamic tracking of user behavior changes, and thus cannot meet the diversified needs of users for products or services. , Resulting in inaccurate user information classification.
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种能够提高用户分类精确度的用户分类方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a user classification method, device, computer equipment, and storage medium that can improve the accuracy of user classification in response to the above technical problems.
一种用户分类方法,所述方法包括:A user classification method, the method includes:
获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;
基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;所述待处理数据集根据待处理的用户数据生成;Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
根据所述聚类结果得到对应的用户分类结果。A corresponding user classification result is obtained according to the clustering result.
一种用户分类装置,所述装置包括:A user classification device, the device includes:
聚簇分布特征图获取模块,用于获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;The cluster distribution feature map acquisition module is used to acquire the cluster distribution feature map of the SOM network trained according to the sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the cluster The cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network;
聚类个数确定模块,用于基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;A cluster number determination module, configured to determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
可用SOM网络生成模块,用于根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;The available SOM network generation module is used to adjust the initial node number of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network; the number of nodes of the available SOM network indicates The number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes;
聚类结果确定模块,用于根据所述可用SOM网络对待处理数据集进行聚类分析,确定 待处理数据集的聚类结果;所述待处理数据集根据待处理的用户数据生成;A clustering result determination module, configured to perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
用户分类结果生成模块,用于根据所述聚类结果得到对应的用户分类结果。The user classification result generating module is used to obtain the corresponding user classification result according to the clustering result.
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when the processor executes the computer program:
获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;
基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;所述待处理数据集根据待处理的用户数据生成;Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
根据所述聚类结果得到对应的用户分类结果。A corresponding user classification result is obtained according to the clustering result.
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the following steps are implemented:
获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;
基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;所述待处理数据集根据待处理的用户数据生成;Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
根据所述聚类结果得到对应的用户分类结果。A corresponding user classification result is obtained according to the clustering result.
本申请实现了利用SOM网络的自组织映射功能,挖掘出高纬数据中所包含的聚合簇分布,提高了整个聚类过程的效率和准确度,实现用户的精准分类。This application realizes the use of the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.
附图说明Description of the drawings
图1为一个实施例中用户分类方法的应用场景图;Fig. 1 is an application scenario diagram of a user classification method in an embodiment;
图2为一个实施例中用户分类方法的流程示意图;Figure 2 is a schematic flowchart of a user classification method in an embodiment;
图3为一个实施例中SOM网络的聚簇分布特征示意图;FIG. 3 is a schematic diagram of the clustering distribution characteristics of the SOM network in an embodiment;
图4为另一个实施例中用户分类方法的流程示意图;FIG. 4 is a schematic flowchart of a user classification method in another embodiment;
图5为再一个实施例中用户分类方法的流程示意图;FIG. 5 is a schematic flowchart of a user classification method in another embodiment;
图6为一个实施例中用户分类装置的结构框图;Figure 6 is a structural block diagram of a user classification device in an embodiment;
图7为另一个实施中用户分类装置的结构框图;Figure 7 is a structural block diagram of a user classification device in another implementation;
图8为一个实施例中计算机设备的内部结构图。Fig. 8 is an internal structure diagram of a computer device in an embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
本申请的技术方案涉及人工智能和/或大数据技术领域,如可具体涉及神经网络技术,以实现智能化用户分类。可选的,本申请涉及的数据如样本用户数据和/或分类结果等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。The technical solution of the present application relates to the field of artificial intelligence and/or big data technology, such as neural network technology, to realize intelligent user classification. Optionally, the data involved in this application, such as sample user data and/or classification results, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.
本申请提供的用户分类方法,可以应用于如图1所示的应用环境中,包括终端102和服务器104,具体可以应用于服务器104中,终端102与服务器104通过网络进行通信。其中,服务器104通过获取根据样本数据集训练后的SOM网络的聚簇分布特征图,并基于聚簇分布特征图,确定训练后的SOM网络的聚类个数。其中,样本数据集包括样本用户数据,聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布。服务器104根据聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络,其中,可用的SOM网络的节点数目表示对初始节点数目进行调整得到的与聚类个数一致的节点数目。根据可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果。其中,待处理数据集根据待处理的用户数据生成。从而根据聚类结果得到可对应的用户分类结果,并将用户分类结果发送至终端102。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The user classification method provided in this application can be applied to the application environment shown in FIG. 1, including the terminal 102 and the server 104, and specifically can be applied to the server 104, where the terminal 102 and the server 104 communicate through a network. The server 104 obtains the clustering distribution feature map of the SOM network trained according to the sample data set, and determines the number of clusters of the trained SOM network based on the clustering distribution feature map. Among them, the sample data set includes sample user data, and the cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network. The server 104 adjusts the number of initial nodes of the original SOM network according to the number of clusters, determines the number of available SOM network nodes, and obtains the available SOM network, where the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes consistent with the number of clusters. Perform cluster analysis on the data set to be processed according to the available SOM network, and determine the clustering result of the data set to be processed. Among them, the data set to be processed is generated based on the user data to be processed. Therefore, a corresponding user classification result is obtained according to the clustering result, and the user classification result is sent to the terminal 102. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
在一个实施例中,如图2所示,提供了一种用户分类方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:In an embodiment, as shown in FIG. 2, a method for user classification is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
步骤S202,获取根据样本数据集训练后的SOM网络的聚簇分布特征图,样本数据集包括样本用户数据。Step S202: Obtain a clustering distribution feature map of the SOM network trained according to a sample data set, where the sample data set includes sample user data.
具体地,通过利用根据样本用户数据得到的样本数据集,对原始SOM网络进行训练,得到训练后的SOM网络,并获取训练后的SOM网络的聚簇分布特征图。其中,SOM网络为自组织映射网络,聚簇表示为了提高某个属性或某个属性组的查询速度,将同一属性即聚簇码上具有相同值的元组集中放在连续物理块中。聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和各聚类簇的分布情况。Specifically, by using the sample data set obtained according to the sample user data, the original SOM network is trained to obtain the trained SOM network, and the clustering distribution feature map of the trained SOM network is obtained. Among them, the SOM network is a self-organizing mapping network, and clustering means that in order to improve the query speed of a certain attribute or a certain attribute group, tuples with the same value on the same attribute, that is, the clustering code, are concentrated in a continuous physical block. The cluster distribution map is used to reflect the number of clusters and the distribution of each cluster in the SOM network after training.
进一步地,由于SOM网络结构强调了聚簇中心点间的邻近关系,相邻的簇之间相关性更强,可根据颜色的边界进行分析,选取颜色相近的构成一个聚类簇,进而根据多个不同颜色分别形成多个聚类簇,从而可得到训练后的SOM网络的聚簇分布特征图。Furthermore, because the SOM network structure emphasizes the proximity relationship between the cluster centers, the correlation between adjacent clusters is stronger, and the color boundary can be analyzed according to the color boundary. Different colors form multiple clusters, so that the cluster distribution feature map of the trained SOM network can be obtained.
在本实施例中,参照图3,可根据训练后的SOM网络形成多个聚类簇,包括第一聚类簇301、第二聚类簇302、第三聚类簇303、第四聚簇类304、第五聚类簇305以及第六聚类簇306,得到训练后的SOM网络的聚簇分布特征图。In this embodiment, referring to FIG. 3, multiple clusters can be formed according to the trained SOM network, including a first cluster 301, a second cluster 302, a third cluster 303, and a fourth cluster. The cluster 304, the fifth cluster 305, and the sixth cluster 306 obtain the cluster distribution feature map of the trained SOM network.
其中,SOM网络表示自组织映射(Self-organizing Maps,SOM)网络,是通过模拟人脑对信号处理的特点的人工神经网络,可以将人为的输入模式在输出层映射成一维、二维甚至更高维的离散图形,并保持其拓扑结构不变。由输出入和输出层(竞争层)组成,输出层神经元数为n,输出层由m个神经元组成的一维或者二维平面阵列,网络是全连接的,即每个输入结点都同所有的输出结点相连接。SOM网络通过对输入模式的反复学习可以使权重向量空间与输入模式的概率分布趋于一致,即概率保持性。SOM网络的输出层各神经元竞争对输入模式的响应机会,获胜神经元有关的各权重朝着更有利于它竞争的方向调整,即以获胜神经元为圆心,对近邻的神经元表现出兴奋性侧反馈,而对远邻的神经元表现出抑制性侧反馈,近邻者相互激励,远邻者相互抑制。Among them, the SOM network stands for Self-organizing Maps (SOM) network, which is an artificial neural network that simulates the characteristics of signal processing by the human brain, and can map artificial input patterns into one-dimensional, two-dimensional or even more in the output layer. High-dimensional discrete graphics, and keep its topological structure unchanged. It is composed of input and output layers (competitive layer), the number of neurons in the output layer is n, and the output layer is a one-dimensional or two-dimensional planar array composed of m neurons. The network is fully connected, that is, each input node has Connect with all output nodes. The SOM network can make the weight vector space and the probability distribution of the input mode converge through repeated learning of the input mode, that is, probability retention. Each neuron in the output layer of the SOM network competes for the response opportunity to the input pattern, and the weights related to the winning neuron are adjusted in a direction that is more conducive to its competition, that is, the winning neuron is the center of the circle, and the neighboring neurons are excited. Sexual feedback, while showing inhibitory feedback to the neurons in the distant neighbors, the neighbors stimulate each other, and the distant neighbors inhibit each other.
步骤S204,基于聚簇分布特征图,确定训练后的SOM网络的聚类个数。Step S204: Determine the number of clusters of the trained SOM network based on the cluster distribution feature map.
具体地,分析聚簇分布特征图,根据颜色的边界进行分析,当输出图片颜色边界较为清晰,即可反映类别数,其中,类别数为聚类个数。选取颜色相近的构成一个聚类簇,进而根据多个不同颜色分别形成多个聚类簇。Specifically, the cluster distribution feature map is analyzed, and the analysis is performed according to the color boundary. When the color boundary of the output image is clear, the number of categories can be reflected, where the number of categories is the number of clusters. Select clusters with similar colors to form a cluster, and then form multiple clusters according to multiple different colors.
在本实施例中,参照图3,可通过分析训练后的SOM网络的聚簇分布特征图,根据不同填充或颜色的分布,得到6个聚类簇,分别为第一聚类簇301、第二聚类簇302、第三聚类簇303、第四聚簇类304、第五聚类簇305以及第六聚类簇306,进而确定训练后的SOM网络的聚类个数为6个。In this embodiment, referring to FIG. 3, by analyzing the cluster distribution feature map of the trained SOM network, according to the distribution of different fillings or colors, 6 clusters can be obtained, which are the first cluster 301 and the first cluster respectively. The second cluster cluster 302, the third cluster cluster 303, the fourth cluster category 304, the fifth cluster cluster 305, and the sixth cluster cluster 306, and then determine that the number of clusters of the trained SOM network is six.
进一步地,当对于一些复杂的数据集,通过聚类簇特征分布图无法明确聚类个数时,根据可能的类别数设计多个SOM网络,分别进行聚类分析,并计算各个SOM网络的轮廓系数,确定训练后的SOM网络具体的聚类个数。Further, for some complex data sets, when the number of clusters cannot be clarified through the cluster feature distribution map, design multiple SOM networks according to the number of possible categories, perform cluster analysis, and calculate the contours of each SOM network The coefficient determines the specific number of clusters of the SOM network after training.
步骤S206,根据聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络。Step S206: According to the number of clusters, the initial number of nodes of the original SOM network is adjusted to determine the number of available SOM network nodes to obtain the available SOM network.
具体地,获取原始SOM网络的初始节点数目,并根据聚类个数对原始的SOM网络的初始节点数目进行调整,得到可用的SOM网络节点数目。其中,可用的SOM网络的节点数目表示对初始节点数目进行调整得到的与聚类个数一致的节点数目。由于原始SOM网络的初始节点数目存在不符合聚类个数的情况,需要对原始SOM网络的初始节点数目进行调整,当调整后的节点数目与聚类个数一致时,得到可用的SOM网络节点,并根据可用的SOM网络节点形成可用的SOM网络。Specifically, the initial number of nodes of the original SOM network is obtained, and the initial number of nodes of the original SOM network is adjusted according to the number of clusters to obtain the number of available SOM network nodes. Among them, the number of nodes of the available SOM network represents the number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes. Since the initial number of nodes of the original SOM network does not match the number of clusters, the initial number of nodes of the original SOM network needs to be adjusted. When the adjusted number of nodes is consistent with the number of clusters, an available SOM network is obtained Nodes, and form an available SOM network according to the available SOM network nodes.
步骤S208,根据可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果,待处理数据集根据待处理的用户数据生成。Step S208: Perform cluster analysis on the data set to be processed according to the available SOM network, determine the clustering result of the data set to be processed, and generate the data set to be processed based on the user data to be processed.
具体地,获取根据待处理的用户数据生成的待处理数据集,并基于可用的SOM网络对待处理数据进行聚类分析,得到待处理数据的聚类结果。其中,聚类结果包括多个聚类,以及各个不同聚类的组成情况。Specifically, the to-be-processed data set generated according to the to-be-processed user data is obtained, and clustering analysis is performed on the to-be-processed data based on the available SOM network to obtain the clustering result of the to-be-processed data. Among them, the clustering result includes multiple clusters and the composition of each different cluster.
步骤S210,根据聚类结果得到对应的用户分类结果。Step S210: Obtain a corresponding user classification result according to the clustering result.
具体地,解析聚类结果,确定聚类结果包括的聚类簇个数和各聚类簇的大小,并计算各聚类簇在各个维度上的统计指标,基于统计指标,确定不同聚类簇中的用户需求。Specifically, the clustering results are analyzed, the number of clusters included in the clustering results and the size of each cluster are determined, and the statistical indicators of each cluster in each dimension are calculated, and different clusters are determined based on the statistical indicators. User needs in.
其中,聚类结果包括聚类簇个数以及不同聚类簇的大小,比如,聚类结果中包括的聚类簇有10个,具体包括a簇、b簇、c簇……等,其中a簇共有大小为1万个客户,b簇大小为5千个客户等。Among them, the clustering result includes the number of clusters and the size of different clusters. For example, there are 10 clusters included in the clustering result, specifically including cluster a, cluster b, cluster c, etc., where a The cluster has a total size of 10,000 customers, and the size of cluster b is 5,000 customers.
进一步地,通过计算聚类簇在各个维度上的统计指标,比如a簇年龄范围在18-25,65%的人学历为高中,平均购买产品金额为1万元,90%的人购买过小额理财。基于计算得到的统计指标,可确定不同聚类簇中的用户需求,具体来说,可通过对用户结果进行分析,确定不同聚类下对应用户的需求,比如通过统计指标,确定a簇为对小额理财有需求的青年人,则可基于用户需求对a簇的用户进行小额理财产品的推送,实现精准营销。Further, by calculating the statistical indicators of the clusters in each dimension, for example, the age range of cluster a is 18-25, 65% of people have high school education, the average purchase amount of products is 10,000 yuan, and 90% of people have purchased small Amount of financial management. Based on the calculated statistical indicators, the needs of users in different clusters can be determined. Specifically, the needs of users in different clusters can be determined by analyzing the user results. For example, through statistical indicators, cluster a can be determined as a pair Young people with small financial needs can push small financial products to users of cluster a based on user needs to achieve precision marketing.
上述用户分类方法中,通过获取根据样本数据集训练后的SOM网络的聚簇分布特征图,其中样本数据集包括样本用户数据,并基于聚簇分布特征图,确定训练后的SOM网络的聚类个数。根据聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络,从而根据可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果,其中待处理数据集根据待处理的用户数据生成。根据聚类结果得到对应的用户分类结果。实现了利用SOM网络的自组织映射功能,挖掘出高纬数据中所包含的聚合簇分布,提高了整个聚类过程的效率和准确度,实现用户的精准分类。In the above user classification method, the clustering distribution feature map of the SOM network trained according to the sample data set is obtained, where the sample data set includes sample user data, and based on the clustering distribution feature map, the clustering of the trained SOM network is determined Number. According to the number of clusters, adjust the initial number of nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network, so as to perform cluster analysis on the data set to be processed according to the available SOM network to determine the data set to be processed Clustering result, where the data set to be processed is generated based on the user data to be processed. According to the clustering result, the corresponding user classification result is obtained. It realizes the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.
在一个实施中,如图4所示,提供了一种用户分类方法,具体包括以下步骤:In an implementation, as shown in Figure 4, a user classification method is provided, which specifically includes the following steps:
步骤S402,获取原始SOM网络输出层的各个节点,并初始化各节点。In step S402, each node of the output layer of the original SOM network is obtained, and each node is initialized.
具体地,通过获取原始SOM网络输出层的各个节点,并为各个基点赋予初始权值,包括将SOM网络的连接权重、学习效率以及领域进行初始化。Specifically, by obtaining each node of the output layer of the original SOM network, and assigning an initial weight to each base point, including initializing the connection weight, learning efficiency, and domain of the SOM network.
其中,初始权值根据初始化操作得到,即每个节点随机初始化对应的参数,初始领域为包括多个节点的较大范围的区域。Wherein, the initial weight value is obtained according to the initialization operation, that is, each node randomly initializes the corresponding parameter, and the initial field is a relatively large area including multiple nodes.
步骤S404,获取预设的训练结束条件。Step S404: Acquire preset training ending conditions.
其中,训练结束条件为连续两次训练得到的权值误差限达到预设阈值,可通过获取连续两次训练的权值误差限,并根据连续两次训练的权值误差限确定训练结束条件,即得到预定义的训练长度。Among them, the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold. The weight error limit of two consecutive trainings can be obtained, and the training end condition can be determined according to the weight error limit of the two consecutive trainings. Get the predefined training length.
具体地,训练结束条件指的是设置连续两次训练权值误差限为一个预设阈值,当连续两次训练权值误差限达到该预设阈值时,训练结束。举例来说,训练结束条件可以为连续两次训练过程中权值误差小于0.03,当连续两次训练过程中权值误差为0.02即小于0.03时,此时训练结束。Specifically, the training end condition refers to setting the error limit of two consecutive training weights to a preset threshold, and the training ends when the error limit of the two consecutive training weights reaches the preset threshold. For example, the training end condition may be that the weight error during two consecutive training processes is less than 0.03. When the weight error during two consecutive training processes is 0.02, that is, less than 0.03, the training ends at this time.
步骤S406,获取样本数据集内的各样本用户数据,并对各样本用户数据进行归一化处理。Step S406: Obtain each sample user data in the sample data set, and perform normalization processing on each sample user data.
具体地,通过获取样本数据集内的歌样本用户数据,并确定样本用户数据的数据类型,对连续变量数据类型对应的用户数据进行归一化处理,其中连续变量数据类型可以包括存款金额和年龄,通过对连续变量数据进行归一化处理,保证连续变量量纲的一致性,其中,存款金额量纲大于年龄。对类别变量数据进行标签编码处理,比如学历中的小学、中学和大学编码成0,1,2,用于后续计算类别变量的汉明距离。Specifically, by obtaining the user data of the song samples in the sample data set, and determining the data type of the sample user data, the user data corresponding to the continuous variable data type is normalized, where the continuous variable data type may include deposit amount and age , Through the normalization of continuous variable data, to ensure the consistency of the continuous variable dimension, where the dimension of the deposit amount is greater than the age. Perform label coding processing on the categorical variable data, such as the primary school, middle school and university in the academic qualifications coded into 0, 1, 2 for subsequent calculation of the Hamming distance of the categorical variable.
其中,用户数据具体包括年龄、性别、学历、职业、地域,以及行为类数据,包括购买产品种类,购买产品总金额,购买产品次数。其中,年龄,购买产品总金额和购买产品次数确定为连续变量,其他数据包括性别、学历、职业、地域以及购买产品种类等,均确定为类别变量。Among them, the user data specifically includes age, gender, education, occupation, region, and behavioral data, including the types of products purchased, the total amount of purchased products, and the number of times purchased products. Among them, age, the total amount of purchased products, and the number of purchases of products are determined as continuous variables, and other data, including gender, education, occupation, region, and type of purchased products, are all determined as categorical variables.
步骤S408,从归一化处理后的样本用户数据中确定第一样本数据,并从原始SOM网络输出层的各个节点中确定第一样本数据的最佳匹配节点。Step S408: Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network.
具体地,通过从归一化处理后的样本用户数据中任意选取一个样本用户数据,确定为第一样本数据,并计算该第一样本数据与SOM网络输出层每个节点之间的距离,选出与第一样本数据间的距离最小的节点,确定为输入样本数据的最匹配节点。Specifically, by randomly selecting a sample user data from the normalized sample user data, it is determined as the first sample data, and the distance between the first sample data and each node of the SOM network output layer is calculated , Select the node with the smallest distance from the first sample data, and determine it as the best matching node of the input sample data.
其中,与样本数据间的距离最小的节点称为输入样本数据的最匹配节点(Best-MatchUnit,即BMU)。可通过采用混合变量属性间的距离公式,计算样本数据与每个输出节点之间的距离,即分为连续变量和类别变量距离分开计算,然后相加得到样本数据与每个输出节点之间的距离。对于连续变量使用欧氏距离,对于类别变量使用汉明距离进行计算。Among them, the node with the smallest distance from the sample data is called the best matching node (Best-MatchUnit, or BMU) of the input sample data. The distance between the sample data and each output node can be calculated by using the distance formula between the attributes of the mixed variable, that is, the distance between the continuous variable and the categorical variable is calculated separately, and then added to obtain the distance between the sample data and each output node distance. Use Euclidean distance for continuous variables and Hamming distance for categorical variables.
进一步地,确定最佳匹配节点的过程中,采用了不同的距离公式,欧几里得聚类即为欧式距离,邻域函数用于判断节点是否处在最佳匹配领域内,是否与检索到的信息相符。即通过计算领域半径,然后遍历所有的节点,看是否处在半径内,并对在对处于最佳匹配领域内的各节点进行权重向量更新操作。Further, in the process of determining the best matching node, different distance formulas are used. Euclidean clustering is the Euclidean distance. The neighborhood function is used to determine whether the node is in the best matching field, and whether it is in the same The information matches. That is, by calculating the radius of the field, and then traversing all nodes to see if they are within the radius, and updating the weight vector of each node in the best matching field.
步骤S410,获取最佳匹配节点的任一拓扑邻域,并从拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域。Step S410: Obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighbors.
具体地,获取最佳匹配节点的任一拓扑邻域,其中,拓扑邻域为预设范围大小的初始邻域,并根据训练时间和最佳匹配节点在初始邻域中的位置,对初始邻域进行收缩,确定以最佳匹配节点为中心的最佳匹配邻域。Specifically, obtain any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size, and according to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood The domain is contracted to determine the best matching neighborhood centered on the best matching node.
进一步地,通过预定义最佳匹配节点的任一拓扑领域,并从确定以最佳匹配节点为中心的最匹配领域。在训练过程中,由于不同时刻存在相应的最匹配领域,可通过预设较大范围的初始领域,在训练过程中,以最佳匹配点为中心,并根据训练时间和最匹配节点的位置对初始领域进行收缩,得到最匹配节点领域。Further, by predefining any topological field of the best matching node, and then determining the best matching field centered on the best matching node. In the training process, since there are corresponding best-matching fields at different times, a larger range of initial fields can be preset. The initial field is shrunk to get the best matching node field.
在一个实施中,采用以下公式对最佳匹配节点以及对应的初始邻域内的各个节点的权值进行调整:In one implementation, the following formula is used to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:
m i(t+1)=m i(t)+α(t)h ci(t)[x(t) -m i(t)]; m i (t + 1) = m i (t) + α (t) h ci (t) [x (t) - m i (t)];
其中,t为步长,i表示节点,m i(t)表示第t步的i节点的权值,α(t)表示学习效率,是 一种单调递减的学习系数,其中,0<α(t)<1,h ci(t)为领域函数,x(t)表示输出向量。 Among them, t is the step size, i is the node, mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h ci (t) is the domain function, and x(t) represents the output vector.
步骤S412,当未达到训练结束条件时,返回从归一化处理后的样本用户数据中确定第一样本数据的步骤。当达到训练结束条件,得到根据样本数据集训练后的SOM网络。Step S412, when the training end condition is not reached, return to the step of determining the first sample data from the normalized sample user data. When the training end condition is reached, the SOM network trained according to the sample data set is obtained.
具体地,训练结束条件为连续两次训练得到的权值误差限达到预设阈值,当连续两次训练权值误差限达到该预设阈值时,达到训练结束条件,训练结束。当连续两次训练权值误差限未达到该预设阈值时,不满足训练结束条件,重新从归一化处理后的样本用户数据中确定第一样本数据,执行确定第一样本数据的最佳匹配节点,并获取最佳匹配节点的任一拓扑邻域,并从拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域的训练过程,直至达到训练结束条件。Specifically, the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold, and when the weight error limit of two consecutive trainings reaches the preset threshold, the training end condition is reached, and the training ends. When the error limit of two consecutive training weights does not reach the preset threshold, and the training end condition is not met, the first sample data is determined from the normalized sample user data again, and the first sample data determination is performed. The best matching node is obtained, and any topological neighborhood of the best matching node is obtained, and the training process of the best matching neighborhood centered on the best matching node is determined from the topological neighborhood, until the training end condition is reached.
上述用户分类方法,通过对数据样本集内的各样本用户数据进行归一化处理,并从归一化处理后的样本用户数据中确定第一样本数据。通过从原始SOM网络输出层的各个节点中确定第一样本数据的最佳匹配节点,并从最佳匹配节点的拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域,直至达到训练结束条件,得到根据样本数据集训练后的SOM网络,可利用训练后的SOM网络确定与待处理的用户数据对应的聚类个数,并对原始SOM网络进行调整,以满足待处理的用户数据的聚类需求,进一步提高用户分类的精准度。The user classification method described above is to normalize each sample user data in the data sample set, and determine the first sample data from the normalized sample user data. By determining the best matching node of the first sample data from each node of the original SOM network output layer, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood of the best matching node, until The training end condition is reached, and the SOM network trained according to the sample data set is obtained. The trained SOM network can be used to determine the number of clusters corresponding to the user data to be processed, and the original SOM network can be adjusted to meet the needs of the processing. The clustering requirements of user data further improve the accuracy of user classification.
在一个实施例中,如图5所示,提供一种用户分类方法,具体包括以下步骤:In an embodiment, as shown in FIG. 5, a user classification method is provided, which specifically includes the following steps:
步骤S502,获取待处理的用户数据,确定各待处理的用户数据的数据类型。Step S502: Obtain user data to be processed, and determine the data type of each user data to be processed.
具体地,获取待处理的用户数据,确定各待处理的用户数据的数据类型。其中,数据类型包括连续变量数据类型和类别变量数据类型,用户数据具体包括年龄、性别、学历、职业、地域,以及行为类数据包括购买产品种类,购买产品总金额,购买产品次数。其中,年龄,购买产品总金额和购买产品次数确定为连续变量,其他数据包括性别、学历、职业、地域以及购买产品种类等,均确定为类别变量。Specifically, the user data to be processed is acquired, and the data type of each user data to be processed is determined. Among them, the data types include continuous variable data types and categorical variable data types. User data specifically includes age, gender, education, occupation, region, and behavioral data including the types of products purchased, the total amount of products purchased, and the number of times the products are purchased. Among them, age, the total amount of purchased products, and the number of purchases of products are determined as continuous variables, and other data, including gender, education, occupation, region, and type of purchased products, are all determined as categorical variables.
步骤S504,根据数据类型,对各待处理的用户数据进行数据预处理,得到初始用户数据。Step S504: Perform data preprocessing on each user data to be processed according to the data type to obtain initial user data.
具体地,预处理包括与连续变量数据类型对应的数据归一化处理,以及与类别变量数据类型对应的标签编码处理。通过对连续变量数据类型对应的用户数据进行归一化处理,其中连续变量数据类型可以包括存款金额和年龄,通过对连续变量数据进行归一化处理,保证连续变量量纲的一致性,其中,存款金额量纲大于年龄。对类别变量数据进行标签编码处理,比如学历中的小学、中学和大学编码成0,1,2,用于后续计算类别变量的汉明距离。Specifically, the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type. The user data corresponding to the continuous variable data type is normalized. The continuous variable data type can include deposit amount and age. The continuous variable data is normalized to ensure the consistency of the continuous variable dimension. Among them, The dimension of the deposit amount is greater than the age. Perform label coding processing on the categorical variable data, such as the primary school, middle school and university in the academic qualifications coded into 0, 1, 2 for subsequent calculation of the Hamming distance of the categorical variable.
在一个实施例中,数据预处理还包括变量筛选,具体包括:In one embodiment, the data preprocessing also includes variable filtering, which specifically includes:
获取用户属性和行为数据,并确定为变量,其中,用户属性包括性别、年龄以及学历,行为数据包括存款金额、贷款金额以及客户端登录次数等。通过对各变量进行筛选操作,去除缺失值大于预设缺失值阈值的变量,并计算各变量的方差,删除方差低于预设方差阈值的变量。Obtain user attributes and behavior data and determine them as variables. Among them, user attributes include gender, age, and educational background, and behavior data include deposit amount, loan amount, and client login times. By filtering each variable, the variables whose missing value is greater than the preset missing value threshold are removed, the variance of each variable is calculated, and the variables whose variance is lower than the preset variance threshold are deleted.
步骤S506,对初始用户数据进行缺失值填充处理,生成待处理数据集。Step S506: Perform missing value filling processing on the initial user data to generate a data set to be processed.
具体地,通过确定初始用户数据的数据缺失类型,并根据数据缺失类型,对相应初始用户数据进行缺失值填充处理。其中,数据缺失类型包括信息缺失和行为缺失,信息缺失包括连续变量数据缺失和类别变量数据缺失。Specifically, the data missing type of the initial user data is determined, and the corresponding initial user data is filled with missing values according to the data missing type. Among them, the types of data missing include information missing and behavior missing, and information missing includes continuous variable data missing and categorical variable data missing.
进一步地,当确定初始用户数据属于连续变量数据缺失时,确定相应连续变量数据类型的初始用户数据的均值,并根据均值对属于连续变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集。Further, when it is determined that the initial user data belongs to the continuous variable data missing, the mean value of the initial user data of the corresponding continuous variable data type is determined, and the initial user data belonging to the continuous variable data missing is filled with missing values according to the mean to generate the data to be processed set.
当确定初始用户数据确定属于类别变量数据缺失时,确定相应类别变量数据类型的初始用户数据的众数,并根据众数对属于类别变量数据缺失的初始用户数据进行缺失值填充, 生成待处理数据集;When it is determined that the initial user data is determined to belong to the categorical variable data is missing, determine the mode of the initial user data of the corresponding categorical variable data type, and fill in the missing values of the initial user data belonging to the categorical variable data according to the mode to generate the data to be processed set;
当确定初始用户数据属于行为缺失时,新建常量,根据常量对属于行为缺失的初始用户数据进行缺失值填充,生成待处理数据集。When it is determined that the initial user data is a behavioral absence, a constant is created, and the initial user data that is a behavioral absence is filled with missing values according to the constant to generate a data set to be processed.
上述用户分类方法中,通过获取待处理的用户数据,确定各待处理的用户数据的数据类型,根据数据类型,对各待处理的用户数据进行数据预处理,得到初始用户数据,并对初始用户数据进行缺失值填充处理,生成待处理数据集。实现了对用户数据的预处理和缺失值填充,避免在后续分类过程中对无效数据进行处理,减少分类过程中的工作量,进一步提高用户分类的工作效率。In the above user classification method, the data type of each user data to be processed is determined by obtaining the user data to be processed, and data preprocessing is performed on each user data to be processed according to the data type to obtain the initial user data, and the initial user data is obtained. The data is filled with missing values to generate a data set to be processed. It realizes the preprocessing of user data and the filling of missing values, avoids processing invalid data in the subsequent classification process, reduces the workload in the classification process, and further improves the efficiency of user classification.
应该理解的是,虽然图2、图4以及图5的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2、图4以及图5中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIGS. 2, 4, and 5 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in Figure 2, Figure 4, and Figure 5 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The order of execution of the sub-steps or stages is not necessarily performed sequentially, but may be executed alternately or alternately with other steps or at least a part of the sub-steps or stages of other steps.
在一个实施例中,如图6所示,提供了一种用户分类装置,包括:聚簇分布特征图获取模块602、聚类个数确定模块604、可用SOM网络生成模块606、聚类结果确定模块608以及用户分类结果生成模块610,其中:In one embodiment, as shown in FIG. 6, a user classification device is provided, including: a cluster distribution feature map acquisition module 602, a cluster number determination module 604, an available SOM network generation module 606, and a clustering result determination module The module 608 and the user classification result generation module 610, wherein:
聚簇分布特征图获取模块602,用于获取根据样本数据集训练后的SOM网络的聚簇分布特征图,样本数据集包括样本用户数据,SOM网络为自组织映射网络,聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布。The clustering distribution feature map acquisition module 602 is used to acquire the clustering distribution feature map of the SOM network trained according to the sample data set. The sample data set includes sample user data. The SOM network is a self-organizing mapping network, and the clustering distribution map is used for Reflect the number and distribution of clusters included in the trained SOM network.
聚类个数确定模块604,用于基于聚簇分布特征图,确定训练后的SOM网络的聚类个数。The cluster number determining module 604 is configured to determine the number of clusters of the trained SOM network based on the cluster distribution feature map.
可用SOM网络生成模块606,用于根据聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络。其中,可用的SOM网络的节点数目表示对初始节点数目进行调整得到的与聚类个数一致的节点数目。The available SOM network generation module 606 is used to adjust the initial number of nodes of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network. Among them, the number of nodes of the available SOM network represents the number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes.
聚类结果确定模块608,用于根据可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果,待处理数据集根据待处理的用户数据生成。The clustering result determination module 608 is configured to perform cluster analysis on the data set to be processed according to the available SOM network, and determine the clustering result of the data set to be processed, and the data set to be processed is generated based on the user data to be processed.
用户分类结果生成模块610,用于根据聚类结果得到对应的用户分类结果。The user classification result generating module 610 is configured to obtain a corresponding user classification result according to the clustering result.
上述用户分类装置,通过获取根据样本数据集训练后的SOM网络的聚簇分布特征图,其中样本数据集包括样本用户数据,并基于聚簇分布特征图,确定训练后的SOM网络的聚类个数。根据聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络,从而根据可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果,其中待处理数据集根据待处理的用户数据生成。根据聚类结果得到对应的用户分类结果。实现了利用SOM网络的自组织映射功能,挖掘出高纬数据中所包含的聚合簇分布,提高了整个聚类过程的效率和准确度,实现用户的精准分类。The user classification device described above obtains a clustered distribution feature map of the SOM network trained on a sample data set, where the sample data set includes sample user data, and based on the clustered distribution feature map, determines the number of clusters of the trained SOM network number. According to the number of clusters, adjust the initial number of nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network, so as to perform cluster analysis on the data set to be processed according to the available SOM network to determine the data set to be processed Clustering result, where the data set to be processed is generated based on the user data to be processed. According to the clustering result, the corresponding user classification result is obtained. It realizes the self-organizing mapping function of the SOM network to mine the distribution of aggregate clusters contained in high-latitude data, improves the efficiency and accuracy of the entire clustering process, and realizes accurate classification of users.
在一个实施例中,如图7所示,提供了一种用户分类装置,包括:节点获取模块702、训练结束条件获取模块704、样本用户数据获取模块706、最佳匹配节点确定模块708、最佳匹配邻域确定模块710以及训练后的SOM网络生成模块712,其中:In one embodiment, as shown in FIG. 7, a user classification device is provided, which includes: a node acquisition module 702, a training end condition acquisition module 704, a sample user data acquisition module 706, a best matching node determination module 708, and the most The best matching neighborhood determination module 710 and the trained SOM network generation module 712, where:
节点获取模块702,获取原始SOM网络输出层的各个节点,并初始化各节点。The node obtaining module 702 obtains each node of the output layer of the original SOM network, and initializes each node.
训练结束条件获取模块704,用于获取预设的训练结束条件。The training end condition obtaining module 704 is used to obtain preset training end conditions.
样本用户数据获取模块706,用于获取样本数据集内的各样本用户数据,并对各样本用户数据进行归一化处理。The sample user data acquisition module 706 is used to acquire each sample user data in the sample data set, and perform normalization processing on each sample user data.
最佳匹配节点确定模块708,用于从归一化处理后的样本用户数据中确定第一样本数 据,并从原始SOM网络输出层的各个节点中确定第一样本数据的最佳匹配节点。The best matching node determination module 708 is used to determine the first sample data from the normalized sample user data, and to determine the best matching node of the first sample data from each node in the output layer of the original SOM network .
最佳匹配邻域确定模块710,用于获取最佳匹配节点的任一拓扑邻域,并从拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域。The best matching neighborhood determining module 710 is configured to obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighborhood.
训练后的SOM网络生成模块712,用于返回从归一化处理后的样本用户数据中确定第一样本数据的步骤,直至达到训练结束条件,得到根据样本数据集训练后的SOM网络。The trained SOM network generation module 712 is used to return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached to obtain the SOM network trained according to the sample data set.
上述用户分类装置,通过对数据样本集内的各样本用户数据进行归一化处理,并从归一化处理后的样本用户数据中确定第一样本数据。通过从原始SOM网络输出层的各个节点中确定第一样本数据的最佳匹配节点,并从最佳匹配节点的拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域,直至达到训练结束条件,得到根据样本数据集训练后的SOM网络,可利用训练后的SOM网络确定与待处理的用户数据对应的聚类个数,并对原始SOM网络进行调整,以满足待处理的用户数据的聚类需求,进一步提高用户分类的精准度。The user classification device described above performs normalization processing on each sample user data in the data sample set, and determines the first sample data from the sample user data after the normalization processing. By determining the best matching node of the first sample data from each node of the original SOM network output layer, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood of the best matching node, until The training end condition is reached, and the SOM network trained according to the sample data set is obtained. The trained SOM network can be used to determine the number of clusters corresponding to the user data to be processed, and the original SOM network can be adjusted to meet the needs of the processing. The clustering requirements of user data further improve the accuracy of user classification.
在一个实施中,提供了一种用户分类装置,还包括:In an implementation, a user classification device is provided, which further includes:
数据类型确定模块,用于获取待处理的用户数据,确定各待处理的用户数据的数据类型,数据类型包括连续变量数据类型和类别变量数据类型。The data type determination module is used to obtain the user data to be processed and determine the data type of each user data to be processed. The data type includes the continuous variable data type and the categorical variable data type.
数据预处理模块,用于根据数据类型,对各待处理的用户数据进行数据预处理,得到初始用户数据,预处理包括与连续变量数据类型对应的数据归一化处理,以及与类别变量数据类型对应的标签编码处理。The data preprocessing module is used to preprocess the user data to be processed according to the data type to obtain the initial user data. The preprocessing includes the data normalization processing corresponding to the continuous variable data type and the data type of the category variable Corresponding label encoding processing.
待处理数据集生成模块,用于对初始用户数据进行缺失值填充处理,生成待处理数据集。The to-be-processed data set generation module is used to perform missing value filling processing on the initial user data to generate the to-be-processed data set.
上述用户分类装置,通过获取待处理的用户数据,确定各待处理的用户数据的数据类型,根据数据类型,对各待处理的用户数据进行数据预处理,得到初始用户数据,并对初始用户数据进行缺失值填充处理,生成待处理数据集。实现了对用户数据的预处理和缺失值填充,避免在后续分类过程中对无效数据进行处理,减少分类过程中的工作量,进一步提高用户分类的工作效率。The above-mentioned user classification device determines the data type of each user data to be processed by acquiring the user data to be processed, and performs data preprocessing on each user data to be processed according to the data type to obtain the initial user data, and then calculate the initial user data. Perform missing value filling processing to generate a data set to be processed. It realizes the preprocessing of user data and the filling of missing values, avoids processing invalid data in the subsequent classification process, reduces the workload in the classification process, and further improves the efficiency of user classification.
关于用户分类装置的具体限定可以参见上文中对于用户分类方法的限定,在此不再赘述。上述用户分类装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the user classification device, please refer to the above limitation on the user classification method, which will not be repeated here. Each module in the above user classification device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储用户数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种用户分类方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store user data. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a user classification method.
本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,该存储器存储有计算机程序,该处理器执行计算机程序时实现以下步骤:In one embodiment, a computer device is provided, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when the processor executes the computer program:
获取根据样本数据集训练后的SOM网络的聚簇分布特征图;样本数据集包括样本用户数据,SOM网络为自组织映射网络,聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain the clustering distribution feature map of the SOM network trained according to the sample data set; the sample data set includes sample user data, the SOM network is a self-organizing map network, and the clustering distribution map is used to reflect the clustering clusters included in the trained SOM network Number and distribution;
基于聚簇分布特征图,确定训练后的SOM网络的聚类个数;Based on the cluster distribution feature map, determine the number of clusters of the trained SOM network;
根据聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;可用的SOM网络的节点数目表示对初始节点数目进行调整得到的与聚类个数一致的节点数目;According to the number of clusters, adjust the number of initial nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network; the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes with the same number;
根据可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;待处理数据集根据待处理的用户数据生成;Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
根据聚类结果得到对应的用户分类结果。According to the clustering result, the corresponding user classification result is obtained.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:In an embodiment, the processor further implements the following steps when executing the computer program:
获取原始SOM网络输出层的各个节点,并初始化各节点;Obtain each node of the output layer of the original SOM network, and initialize each node;
获取预设的训练结束条件;训练结束条件为连续两次训练得到的权值误差限达到预设阈值;Obtain a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold;
获取样本数据集内的各样本用户数据,并对各样本用户数据进行归一化处理;Obtain each sample user data in the sample data set, and normalize each sample user data;
从归一化处理后的样本用户数据中确定第一样本数据,并从原始SOM网络输出层的各个节点中确定第一样本数据的最佳匹配节点;Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;
获取最佳匹配节点的任一拓扑邻域,并从拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域;Obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighborhood;
返回从归一化处理后的样本用户数据中确定第一样本数据的步骤,直至达到训练结束条件,得到根据样本数据集训练后的SOM网络。Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:In an embodiment, the processor further implements the following steps when executing the computer program:
获取最佳匹配节点的任一拓扑邻域,拓扑邻域为预设范围大小的初始邻域;Obtain any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;
根据训练时间和最佳匹配节点在初始邻域中的位置,对初始邻域进行收缩,确定以最佳匹配节点为中心的最佳匹配邻域。According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:In an embodiment, the processor further implements the following steps when executing the computer program:
采用以下公式对最佳匹配节点以及对应的初始邻域内的各个节点的权值进行调整:The following formula is used to adjust the weights of the best matching node and the corresponding nodes in the initial neighborhood:
m i(t+1)=m i(t)+α(t)h ci(t)[x(t)-m i(t)]; m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];
其中,t为步长,i表示节点,mi(t)表示第t步的i节点的权值,α(t)表示学习效率,是一种单调递减的学习系数,其中,0<α(t)<1,hci(t)为领域函数,x(t)表示输出向量。Among them, t is the step size, i is the node, mi(t) is the weight of the i node in the t-th step, α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α(t )<1, hci(t) is the domain function, x(t) represents the output vector.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:In an embodiment, the processor further implements the following steps when executing the computer program:
获取待处理的用户数据,确定各待处理的用户数据的数据类型;数据类型包括连续变量数据类型和类别变量数据类型;Obtain the user data to be processed, and determine the data type of each user data to be processed; the data types include continuous variable data types and categorical variable data types;
根据数据类型,对各待处理的用户数据进行数据预处理,得到初始用户数据;预处理包括与连续变量数据类型对应的数据归一化处理,以及与类别变量数据类型对应的标签编码处理;According to the data type, perform data preprocessing on each user data to be processed to obtain the initial user data; the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type;
对初始用户数据进行缺失值填充处理,生成待处理数据集。The initial user data is filled with missing values to generate a data set to be processed.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:In an embodiment, the processor further implements the following steps when executing the computer program:
确定初始用户数据的数据缺失类型;数据缺失类型包括信息缺失和行为缺失,信息缺失包括连续变量数据缺失和类别变量数据缺失;Determine the type of data missing for initial user data; types of missing data include missing information and missing behavior, missing information includes missing continuous variable data and missing data for categorical variables;
确定属于连续变量数据类型的初始用户数据的均值,根据均值对属于连续变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determine the mean value of the initial user data belonging to the continuous variable data type, fill in the initial user data with missing continuous variable data according to the mean value, and generate a data set to be processed;
或确定属于类别变量数据类型的初始用户数据的众数,根据众数对属于类别变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Or determine the mode of the initial user data belonging to the categorical variable data type, and fill in the missing values of the initial user data with missing data belonging to the categorical variable according to the mode to generate a data set to be processed;
或新建常量,根据常量对属于行为缺失的初始用户数据进行缺失值填充,生成待处理数据集。Or create a new constant, and fill in the initial user data with missing behavior based on the constant to generate a data set to be processed.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
获取根据样本数据集训练后的SOM网络的聚簇分布特征图;样本数据集包括样本用户数据,SOM网络为自组织映射网络,聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain the clustering distribution feature map of the SOM network trained according to the sample data set; the sample data set includes sample user data, the SOM network is a self-organizing map network, and the clustering distribution map is used to reflect the clustering clusters included in the trained SOM network Number and distribution;
基于聚簇分布特征图,确定训练后的SOM网络的聚类个数;Based on the cluster distribution feature map, determine the number of clusters of the trained SOM network;
根据聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;可用的SOM网络的节点数目表示对初始节点数目进行调整得到的与聚类个数一致的节点数目;According to the number of clusters, adjust the number of initial nodes of the original SOM network, determine the number of available SOM network nodes, and obtain the available SOM network; the number of available SOM network nodes represents the number of nodes obtained by adjusting the number of initial nodes The number of nodes with the same number;
根据可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;待处理数据集根据待处理的用户数据生成;Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
根据聚类结果得到对应的用户分类结果。According to the clustering result, the corresponding user classification result is obtained.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:In an embodiment, when the computer program is executed by the processor, the following steps are further implemented:
获取原始SOM网络输出层的各个节点,并初始化各节点;Obtain each node of the output layer of the original SOM network, and initialize each node;
获取预设的训练结束条件;训练结束条件为连续两次训练得到的权值误差限达到预设阈值;Obtain a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches the preset threshold;
获取样本数据集内的各样本用户数据,并对各样本用户数据进行归一化处理;Obtain each sample user data in the sample data set, and normalize each sample user data;
从归一化处理后的样本用户数据中确定第一样本数据,并从原始SOM网络输出层的各个节点中确定第一样本数据的最佳匹配节点;Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;
获取最佳匹配节点的任一拓扑邻域,并从拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域;Obtain any topological neighborhood of the best matching node, and determine the best matching neighborhood centered on the best matching node from the topological neighborhood;
返回从归一化处理后的样本用户数据中确定第一样本数据的步骤,直至达到训练结束条件,得到根据样本数据集训练后的SOM网络。Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:In an embodiment, when the computer program is executed by the processor, the following steps are further implemented:
获取最佳匹配节点的任一拓扑邻域,拓扑邻域为预设范围大小的初始邻域;Obtain any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;
根据训练时间和最佳匹配节点在初始邻域中的位置,对初始邻域进行收缩,确定以最佳匹配节点为中心的最佳匹配邻域。According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:In an embodiment, when the computer program is executed by the processor, the following steps are further implemented:
采用以下公式对最佳匹配节点以及对应的初始邻域内的各个节点的权值进行调整:The following formula is used to adjust the weights of the best matching node and the corresponding nodes in the initial neighborhood:
m i(t+1)=m i(t)+α(t)h ci(t)[x(t)-m i(t)]; m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];
其中,t为步长,i表示节点,mi(t)表示第t步的i节点的权值,α(t)表示学习效率,是一种单调递减的学习系数,其中,0<α(t)<1,hci(t)为领域函数,x(t)表示输出向量。Among them, t is the step size, i is the node, mi(t) is the weight of the i node in the t-th step, α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α(t )<1, hci(t) is the domain function, x(t) represents the output vector.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:In an embodiment, when the computer program is executed by the processor, the following steps are further implemented:
获取待处理的用户数据,确定各待处理的用户数据的数据类型;数据类型包括连续变量数据类型和类别变量数据类型;Obtain the user data to be processed, and determine the data type of each user data to be processed; the data types include continuous variable data types and categorical variable data types;
根据数据类型,对各待处理的用户数据进行数据预处理,得到初始用户数据;预处理包括与连续变量数据类型对应的数据归一化处理,以及与类别变量数据类型对应的标签编码处理;According to the data type, perform data preprocessing on each user data to be processed to obtain the initial user data; the preprocessing includes data normalization processing corresponding to the continuous variable data type, and label encoding processing corresponding to the categorical variable data type;
对初始用户数据进行缺失值填充处理,生成待处理数据集。The initial user data is filled with missing values to generate a data set to be processed.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:In an embodiment, when the computer program is executed by the processor, the following steps are further implemented:
确定初始用户数据的数据缺失类型;数据缺失类型包括信息缺失和行为缺失,信息缺失包括连续变量数据缺失和类别变量数据缺失;Determine the type of data missing for initial user data; types of missing data include missing information and missing behavior, missing information includes missing continuous variable data and missing data for categorical variables;
确定属于连续变量数据类型的初始用户数据的均值,根据均值对属于连续变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determine the mean value of the initial user data belonging to the continuous variable data type, fill in the initial user data with missing continuous variable data according to the mean value, and generate a data set to be processed;
或确定属于类别变量数据类型的初始用户数据的众数,根据众数对属于类别变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Or determine the mode of the initial user data belonging to the categorical variable data type, and fill in the missing values of the initial user data with missing data belonging to the categorical variable according to the mode to generate a data set to be processed;
或新建常量,根据常量对属于行为缺失的初始用户数据进行缺失值填充,生成待处理数据集。Or create a new constant, and fill in the initial user data with missing behavior based on the constant to generate a data set to be processed.
可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种用户分类方法,所述方法包括:A user classification method, the method includes:
    获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;
    基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
    根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
    根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;所述待处理数据集根据待处理的用户数据生成;Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
    根据所述聚类结果得到对应的用户分类结果。A corresponding user classification result is obtained according to the clustering result.
  2. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    获取原始SOM网络输出层的各个节点,并初始化各所述节点;Obtain each node of the output layer of the original SOM network, and initialize each node;
    获取预设的训练结束条件;所述训练结束条件为连续两次训练得到的权值误差限达到预设阈值;Acquiring a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold;
    获取样本数据集内的各样本用户数据,并对各所述样本用户数据进行归一化处理;Acquiring each sample user data in the sample data set, and normalizing each of the sample user data;
    从归一化处理后的样本用户数据中确定第一样本数据,并从原始SOM网络输出层的各个节点中确定所述第一样本数据的最佳匹配节点;Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;
    获取所述最佳匹配节点的任一拓扑邻域,并从所述拓扑邻域中确定以所述最佳匹配节点为中心的最佳匹配邻域;Acquiring any topological neighborhood of the best matching node, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood;
    返回所述从归一化处理后的样本用户数据中确定第一样本数据的步骤,直至达到所述训练结束条件,得到根据样本数据集训练后的SOM网络。Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
  3. 根据权利要求2所述的方法,其中,所述获取所述最佳匹配节点的任一拓扑邻域,并从所述拓扑邻域中确定以所述最佳匹配节点为中心的最佳匹配邻域,包括:The method according to claim 2, wherein the obtaining any topological neighborhood of the best matching node, and determining the best matching neighbor centered on the best matching node from the topological neighborhood Domain, including:
    获取所述最佳匹配节点的任一拓扑邻域,所述拓扑邻域为预设范围大小的初始邻域;Acquiring any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;
    根据训练时间和所述最佳匹配节点在所述初始邻域中的位置,对所述初始邻域进行收缩,确定以所述最佳匹配节点为中心的最佳匹配邻域。According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
  4. 根据权利要求3所述的方法,其中,所述方法还包括:采用以下公式对所述最佳匹配节点以及对应的初始邻域内的各个节点的权值进行调整:The method according to claim 3, wherein the method further comprises: using the following formula to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:
    m i(t+1)=m i(t)+α(t)h ci(t)[x(t)-m i(t)]; m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];
    其中,t为步长,i表示节点,m i(t)表示第t步的i节点的权值,α(t)表示学习效率,是一种单调递减的学习系数,其中,0<α(t)<1,h ci(t)为领域函数,x(t)表示输出向量。 Among them, t is the step size, i is the node, mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h ci (t) is the domain function, and x(t) represents the output vector.
  5. 根据权利要求1所述的方法,其中,在根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果之前,还包括:The method according to claim 1, wherein before performing cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed, the method further comprises:
    获取待处理的用户数据,确定各所述待处理的用户数据的数据类型;所述数据类型包括连续变量数据类型和类别变量数据类型;Obtain the user data to be processed, and determine the data type of each user data to be processed; the data type includes a continuous variable data type and a categorical variable data type;
    根据所述数据类型,对各所述待处理的用户数据进行数据预处理,得到初始用户数据;所述预处理包括对所述连续变量数据类型对应的数据归一化处理,以及对所述类别变量数据类型对应的标签编码处理;According to the data type, perform data preprocessing on each of the user data to be processed to obtain initial user data; the preprocessing includes normalizing data corresponding to the continuous variable data type, and processing the category Label coding processing corresponding to variable data type;
    对所述初始用户数据进行缺失值填充处理,生成待处理数据集。Perform missing value filling processing on the initial user data to generate a data set to be processed.
  6. 根据权利要求5所述的方法,其中,所述对所述初始用户数据进行缺失值填充处理, 生成待处理数据集,包括:The method according to claim 5, wherein said performing missing value filling processing on said initial user data to generate a data set to be processed comprises:
    确定所述初始用户数据的数据缺失类型;所述数据缺失类型包括信息缺失和行为缺失,所述信息缺失包括连续变量数据缺失和类别变量数据缺失;Determine the data missing type of the initial user data; the data missing type includes information missing and behavior missing, and the information missing includes continuous variable data missing and categorical variable data missing;
    确定属于所述连续变量数据类型的初始用户数据的均值,根据所述均值对属于所述连续变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determining the mean value of the initial user data belonging to the continuous variable data type, and filling in the initial user data belonging to the continuous variable data missing with missing values according to the mean value, to generate a data set to be processed;
    or
    确定属于所述类别变量数据类型的初始用户数据的众数,根据所述众数对属于所述类别变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determine the mode of the initial user data belonging to the categorical variable data type, and fill in the initial user data with missing data belonging to the categorical variable with missing values according to the mode, to generate a data set to be processed;
    or
    新建常量,根据所述常量对属于行为缺失的初始用户数据进行缺失值填充,生成待处理数据集。Create a new constant, fill in the initial user data with behavior missing according to the constant, and generate a data set to be processed.
  7. 一种用户分类装置,其中,所述装置包括:A user classification device, wherein the device includes:
    聚簇分布特征图获取模块,用于获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;The cluster distribution feature map acquisition module is used to acquire the cluster distribution feature map of the SOM network trained according to the sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the cluster The cluster distribution map is used to reflect the number and distribution of clusters included in the trained SOM network;
    聚类个数确定模块,用于基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;A cluster number determination module, configured to determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
    可用SOM网络生成模块,用于根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;The available SOM network generation module is used to adjust the initial node number of the original SOM network according to the number of clusters, determine the number of available SOM network nodes, and obtain the available SOM network; the number of nodes of the available SOM network indicates The number of nodes consistent with the number of clusters obtained by adjusting the number of initial nodes;
    聚类结果确定模块,用于根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;所述待处理数据集根据待处理的用户数据生成;The clustering result determination module is configured to perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
    用户分类结果生成模块,用于根据所述聚类结果得到对应的用户分类结果。The user classification result generating module is used to obtain the corresponding user classification result according to the clustering result.
  8. 根据权利要求7所述的装置,其中,所述装置还包括:The device according to claim 7, wherein the device further comprises:
    节点获取模块,获取原始SOM网络输出层的各个节点,并初始化各节点;训练结束条件获取模块,用于获取预设的训练结束条件。The node acquisition module acquires each node in the output layer of the original SOM network and initializes each node; the training end condition acquisition module is used to acquire preset training end conditions.
    样本用户数据获取模块,用于获取样本数据集内的各样本用户数据,并对各样本用户数据进行归一化处理。The sample user data acquisition module is used to acquire each sample user data in the sample data set, and perform normalization processing on each sample user data.
    最佳匹配节点确定模块,用于从归一化处理后的样本用户数据中确定第一样本数据,并从原始SOM网络输出层的各个节点中确定第一样本数据的最佳匹配节点。The best matching node determination module is used to determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network.
    最佳匹配邻域确定模块,用于获取最佳匹配节点的任一拓扑邻域,并从拓扑邻域中确定以最佳匹配节点为中心的最佳匹配邻域。The best matching neighborhood determining module is used to obtain any topological neighborhood of the best matching node, and to determine the best matching neighborhood centered on the best matching node from the topological neighborhood.
    训练后的SOM网络生成模块,用于返回从归一化处理后的样本用户数据中确定第一样本数据的步骤,直至达到训练结束条件,得到根据样本数据集训练后的SOM网络。The trained SOM network generation module is used to return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached to obtain the SOM network trained according to the sample data set.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现用户分类方法,所述用户分类方法包括以下步骤:A computer device includes a memory and a processor, the memory stores a computer program, wherein the processor implements a user classification method when the computer program is executed, and the user classification method includes the following steps:
    获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;
    基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
    根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
    根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果; 所述待处理数据集根据待处理的用户数据生成;Performing cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; generating the data set to be processed based on the user data to be processed;
    根据所述聚类结果得到对应的用户分类结果。A corresponding user classification result is obtained according to the clustering result.
  10. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述用户分类方法时,还包括:The computer device according to claim 9, wherein when the processor executes the user classification method, it further comprises:
    获取原始SOM网络输出层的各个节点,并初始化各所述节点;Obtain each node of the output layer of the original SOM network, and initialize each node;
    获取预设的训练结束条件;所述训练结束条件为连续两次训练得到的权值误差限达到预设阈值;Acquiring a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold;
    获取样本数据集内的各样本用户数据,并对各所述样本用户数据进行归一化处理;Acquiring each sample user data in the sample data set, and normalizing each of the sample user data;
    从归一化处理后的样本用户数据中确定第一样本数据,并从原始SOM网络输出层的各个节点中确定所述第一样本数据的最佳匹配节点;Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;
    获取所述最佳匹配节点的任一拓扑邻域,并从所述拓扑邻域中确定以所述最佳匹配节点为中心的最佳匹配邻域;Acquiring any topological neighborhood of the best matching node, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood;
    返回所述从归一化处理后的样本用户数据中确定第一样本数据的步骤,直至达到所述训练结束条件,得到根据样本数据集训练后的SOM网络。Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
  11. 根据权利要求10所述的计算机设备,其中,执行所述获取所述最佳匹配节点的任一拓扑邻域,并从所述拓扑邻域中确定以所述最佳匹配节点为中心的最佳匹配邻域时,包括:The computer device according to claim 10, wherein the obtaining of any topological neighborhood of the best matching node is performed, and the best one centered on the best matching node is determined from the topological neighborhoods When matching neighborhoods, include:
    获取所述最佳匹配节点的任一拓扑邻域,所述拓扑邻域为预设范围大小的初始邻域;Acquiring any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;
    根据训练时间和所述最佳匹配节点在所述初始邻域中的位置,对所述初始邻域进行收缩,确定以所述最佳匹配节点为中心的最佳匹配邻域。According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
  12. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述用户分类方法时,还包括:The computer device according to claim 11, wherein, when the processor executes the user classification method, further comprising:
    采用以下公式对所述最佳匹配节点以及对应的初始邻域内的各个节点的权值进行调整:The following formula is used to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:
    m i(t+1)=m i(t)+α(t)h ci(t)[x(t)-m i(t)]; m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];
    其中,t为步长,i表示节点,m i(t)表示第t步的i节点的权值,α(t)表示学习效率,是一种单调递减的学习系数,其中,0<α(t)<1,h ci(t)为领域函数,x(t)表示输出向量。 Among them, t is the step size, i is the node, mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h ci (t) is the domain function, and x(t) represents the output vector.
  13. 根据权利要求9所述的计算机设备,其中,在所述处理器执行根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果之前,还包括:9. The computer device according to claim 9, wherein before the processor performs clustering analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed, the method further comprises:
    获取待处理的用户数据,确定各所述待处理的用户数据的数据类型;所述数据类型包括连续变量数据类型和类别变量数据类型;Obtain the user data to be processed, and determine the data type of each user data to be processed; the data type includes a continuous variable data type and a categorical variable data type;
    根据所述数据类型,对各所述待处理的用户数据进行数据预处理,得到初始用户数据;所述预处理包括对所述连续变量数据类型对应的数据归一化处理,以及对所述类别变量数据类型对应的标签编码处理;According to the data type, perform data preprocessing on each of the user data to be processed to obtain initial user data; the preprocessing includes normalizing data corresponding to the continuous variable data type, and processing the category Label coding processing corresponding to variable data type;
    对所述初始用户数据进行缺失值填充处理,生成待处理数据集。Perform missing value filling processing on the initial user data to generate a data set to be processed.
  14. 根据权利要求13所述的计算机设备,其中,执行所述对所述初始用户数据进行缺失值填充处理,生成待处理数据集时,包括:The computer device according to claim 13, wherein, when performing the missing value filling processing on the initial user data to generate the to-be-processed data set, it comprises:
    确定所述初始用户数据的数据缺失类型;所述数据缺失类型包括信息缺失和行为缺失,所述信息缺失包括连续变量数据缺失和类别变量数据缺失;Determine the data missing type of the initial user data; the data missing type includes information missing and behavior missing, and the information missing includes continuous variable data missing and categorical variable data missing;
    确定属于所述连续变量数据类型的初始用户数据的均值,根据所述均值对属于所述连续变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determining the mean value of the initial user data belonging to the continuous variable data type, and filling in the initial user data belonging to the continuous variable data missing with missing values according to the mean value, to generate a data set to be processed;
    or
    确定属于所述类别变量数据类型的初始用户数据的众数,根据所述众数对属于所述类 别变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determine the mode of the initial user data belonging to the categorical variable data type, fill in the initial user data with missing data belonging to the category variable according to the mode, and generate a data set to be processed;
    or
    新建常量,根据所述常量对属于行为缺失的初始用户数据进行缺失值填充,生成待处理数据集。Create a new constant, fill in the initial user data with behavior missing according to the constant, and generate a data set to be processed.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现用户分类方法,所述用户分类方法包括以下步骤:A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, a user classification method is implemented, and the user classification method includes the following steps:
    获取根据样本数据集训练后的SOM网络的聚簇分布特征图;所述样本数据集包括样本用户数据,所述SOM网络为自组织映射网络,所述聚簇分布图用于反映训练后的SOM网络包括的聚类簇个数和分布;Obtain a clustering distribution feature map of the SOM network trained according to a sample data set; the sample data set includes sample user data, the SOM network is a self-organizing mapping network, and the clustering distribution map is used to reflect the trained SOM The number and distribution of clusters included in the network;
    基于所述聚簇分布特征图,确定训练后的所述SOM网络的聚类个数;Determine the number of clusters of the SOM network after training based on the cluster distribution feature map;
    根据所述聚类个数,对原始SOM网络的初始节点数目进行调整,确定可用的SOM网络节点数目,得到可用SOM网络;所述可用的SOM网络的节点数目表示对所述初始节点数目进行调整得到的与所述聚类个数一致的节点数目;According to the number of clusters, the number of initial nodes of the original SOM network is adjusted, the number of available SOM network nodes is determined, and the available SOM network is obtained; the number of nodes of the available SOM network indicates that the number of initial nodes is adjusted The obtained number of nodes consistent with the number of clusters;
    根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果;所述待处理数据集根据待处理的用户数据生成;Perform cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed; the data set to be processed is generated based on the user data to be processed;
    根据所述聚类结果得到对应的用户分类结果。A corresponding user classification result is obtained according to the clustering result.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行以实现所述用户分类方法时,还包括:The computer-readable storage medium according to claim 15, wherein, when the computer program is executed by a processor to implement the user classification method, the method further comprises:
    获取原始SOM网络输出层的各个节点,并初始化各所述节点;Obtain each node of the output layer of the original SOM network, and initialize each node;
    获取预设的训练结束条件;所述训练结束条件为连续两次训练得到的权值误差限达到预设阈值;Acquiring a preset training end condition; the training end condition is that the weight error limit obtained from two consecutive trainings reaches a preset threshold;
    获取样本数据集内的各样本用户数据,并对各所述样本用户数据进行归一化处理;Acquiring each sample user data in the sample data set, and normalizing each of the sample user data;
    从归一化处理后的样本用户数据中确定第一样本数据,并从原始SOM网络输出层的各个节点中确定所述第一样本数据的最佳匹配节点;Determine the first sample data from the normalized sample user data, and determine the best matching node of the first sample data from each node in the output layer of the original SOM network;
    获取所述最佳匹配节点的任一拓扑邻域,并从所述拓扑邻域中确定以所述最佳匹配节点为中心的最佳匹配邻域;Acquiring any topological neighborhood of the best matching node, and determining the best matching neighborhood centered on the best matching node from the topological neighborhood;
    返回所述从归一化处理后的样本用户数据中确定第一样本数据的步骤,直至达到所述训练结束条件,得到根据样本数据集训练后的SOM网络。Return to the step of determining the first sample data from the normalized sample user data until the training end condition is reached, and the SOM network trained according to the sample data set is obtained.
  17. 根据权利要求16所述的计算机可读存储介质,其中,执行所述获取所述最佳匹配节点的任一拓扑邻域,并从所述拓扑邻域中确定以所述最佳匹配节点为中心的最佳匹配邻域时,包括:The computer-readable storage medium according to claim 16, wherein the obtaining of any topological neighborhood of the best matching node is performed, and it is determined from the topological neighborhood that the best matching node is the center The best matching neighborhoods include:
    获取所述最佳匹配节点的任一拓扑邻域,所述拓扑邻域为预设范围大小的初始邻域;Acquiring any topological neighborhood of the best matching node, where the topological neighborhood is an initial neighborhood with a preset range size;
    根据训练时间和所述最佳匹配节点在所述初始邻域中的位置,对所述初始邻域进行收缩,确定以所述最佳匹配节点为中心的最佳匹配邻域。According to the training time and the position of the best matching node in the initial neighborhood, the initial neighborhood is shrunk, and the best matching neighborhood centered on the best matching node is determined.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述计算机程序被处理器执行以实现所述用户分类方法时,还包括:The computer-readable storage medium according to claim 17, wherein, when the computer program is executed by a processor to implement the user classification method, the method further comprises:
    采用以下公式对所述最佳匹配节点以及对应的初始邻域内的各个节点的权值进行调整:The following formula is used to adjust the weights of the best matching node and each node in the corresponding initial neighborhood:
    m i(t+1)=m i(t)+α(t)h ci(t)[x(t)-m i(t)]; m i (t+1)=m i (t)+α(t)h ci (t)[x(t)-m i (t)];
    其中,t为步长,i表示节点,m i(t)表示第t步的i节点的权值,α(t)表示学习效率,是一种单调递减的学习系数,其中,0<α(t)<1,h ci(t)为领域函数,x(t)表示输出向量。 Among them, t is the step size, i is the node, mi (t) is the weight of the i-node in the t-th step, and α(t) is the learning efficiency, which is a monotonically decreasing learning coefficient, where 0<α( t)<1, h ci (t) is the domain function, and x(t) represents the output vector.
  19. 根据权利要求15所述的计算机可读存储介质,其中,在所述处理器执行根据所述可用SOM网络对待处理数据集进行聚类分析,确定待处理数据集的聚类结果之前,还包括:15. The computer-readable storage medium according to claim 15, wherein before the processor performs cluster analysis on the data set to be processed according to the available SOM network to determine the clustering result of the data set to be processed, further comprising:
    获取待处理的用户数据,确定各所述待处理的用户数据的数据类型;所述数据类型包括连续变量数据类型和类别变量数据类型;Obtain the user data to be processed, and determine the data type of each user data to be processed; the data type includes a continuous variable data type and a categorical variable data type;
    根据所述数据类型,对各所述待处理的用户数据进行数据预处理,得到初始用户数据;所述预处理包括对所述连续变量数据类型对应的数据归一化处理,以及对所述类别变量数据类型对应的标签编码处理;According to the data type, perform data preprocessing on each of the user data to be processed to obtain initial user data; the preprocessing includes normalizing data corresponding to the continuous variable data type, and processing the category Label coding processing corresponding to variable data type;
    对所述初始用户数据进行缺失值填充处理,生成待处理数据集。Perform missing value filling processing on the initial user data to generate a data set to be processed.
  20. 根据权利要求19所述的计算机可读存储介质,其中,执行所述对所述初始用户数据进行缺失值填充处理,生成待处理数据集时,包括:The computer-readable storage medium according to claim 19, wherein, when performing the missing value filling processing on the initial user data to generate a data set to be processed, it comprises:
    确定所述初始用户数据的数据缺失类型;所述数据缺失类型包括信息缺失和行为缺失,所述信息缺失包括连续变量数据缺失和类别变量数据缺失;Determine the data missing type of the initial user data; the data missing type includes information missing and behavior missing, and the information missing includes continuous variable data missing and categorical variable data missing;
    确定属于所述连续变量数据类型的初始用户数据的均值,根据所述均值对属于所述连续变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determining the mean value of the initial user data belonging to the continuous variable data type, and filling in the initial user data belonging to the continuous variable data missing with missing values according to the mean value, to generate a data set to be processed;
    or
    确定属于所述类别变量数据类型的初始用户数据的众数,根据所述众数对属于所述类别变量数据缺失的初始用户数据进行缺失值填充,生成待处理数据集;Determine the mode of the initial user data belonging to the categorical variable data type, and fill in the initial user data with missing data belonging to the categorical variable with missing values according to the mode, to generate a data set to be processed;
    or
    新建常量,根据所述常量对属于行为缺失的初始用户数据进行缺失值填充,生成待处理数据集。Create a new constant, fill in the initial user data with behavior missing according to the constant, and generate a data set to be processed.
PCT/CN2021/077380 2020-04-09 2021-02-23 User classification method and apparatus, computer device and storage medium WO2021203854A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010273736.1 2020-04-09
CN202010273736.1A CN111553390A (en) 2020-04-09 2020-04-09 User classification method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021203854A1 true WO2021203854A1 (en) 2021-10-14

Family

ID=72007376

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/077380 WO2021203854A1 (en) 2020-04-09 2021-02-23 User classification method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN111553390A (en)
WO (1) WO2021203854A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988205A (en) * 2021-11-08 2022-01-28 福建龙净环保股份有限公司 Method and system for judging electric precipitation working condition
CN114386502A (en) * 2022-01-07 2022-04-22 北京点众科技股份有限公司 Method, apparatus and storage medium for cluster analysis of fast-applying users
CN116523320A (en) * 2023-07-04 2023-08-01 山东省标准化研究院(Wto/Tbt山东咨询工作站) Intellectual property risk intelligent analysis method based on Internet big data
CN117455555A (en) * 2023-12-25 2024-01-26 厦门理工学院 Big data-based electric business portrait analysis method and system
CN117593034A (en) * 2024-01-17 2024-02-23 湖南三湘银行股份有限公司 User classification method based on computer

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553390A (en) * 2020-04-09 2020-08-18 深圳壹账通智能科技有限公司 User classification method and device, computer equipment and storage medium
CN112464059B (en) * 2020-12-08 2024-03-22 深圳供电局有限公司 Distribution network user classification method, device, computer equipment and storage medium
CN118116610A (en) * 2024-04-28 2024-05-31 济宁职业技术学院 Data mining analysis method based on vision screening big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100194742A1 (en) * 2009-02-03 2010-08-05 Xerox Corporation Adaptive grand tour
CN104182889A (en) * 2014-08-18 2014-12-03 国家电网公司 Method for processing data and identifying fluctuations of historical wind power output
CN111553390A (en) * 2020-04-09 2020-08-18 深圳壹账通智能科技有限公司 User classification method and device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100194742A1 (en) * 2009-02-03 2010-08-05 Xerox Corporation Adaptive grand tour
CN104182889A (en) * 2014-08-18 2014-12-03 国家电网公司 Method for processing data and identifying fluctuations of historical wind power output
CN111553390A (en) * 2020-04-09 2020-08-18 深圳壹账通智能科技有限公司 User classification method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG LIGANG: "Data Mining Based on SOM Clustering and Its Application", 1 August 2006 (2006-08-01), pages 1 - 55, XP055858159, [retrieved on 20211104] *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988205A (en) * 2021-11-08 2022-01-28 福建龙净环保股份有限公司 Method and system for judging electric precipitation working condition
CN113988205B (en) * 2021-11-08 2022-09-20 福建龙净环保股份有限公司 Method and system for judging electric precipitation working condition
CN114386502A (en) * 2022-01-07 2022-04-22 北京点众科技股份有限公司 Method, apparatus and storage medium for cluster analysis of fast-applying users
CN116523320A (en) * 2023-07-04 2023-08-01 山东省标准化研究院(Wto/Tbt山东咨询工作站) Intellectual property risk intelligent analysis method based on Internet big data
CN116523320B (en) * 2023-07-04 2023-09-12 山东省标准化研究院(Wto/Tbt山东咨询工作站) Intellectual Property Risk Intelligent Analysis Method Based on Internet Big Data
CN117455555A (en) * 2023-12-25 2024-01-26 厦门理工学院 Big data-based electric business portrait analysis method and system
CN117455555B (en) * 2023-12-25 2024-03-08 厦门理工学院 Big data-based electric business portrait analysis method and system
CN117593034A (en) * 2024-01-17 2024-02-23 湖南三湘银行股份有限公司 User classification method based on computer
CN117593034B (en) * 2024-01-17 2024-06-07 湖南三湘银行股份有限公司 User classification method based on computer

Also Published As

Publication number Publication date
CN111553390A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
WO2021203854A1 (en) User classification method and apparatus, computer device and storage medium
Gurumoorthy et al. Efficient data representation by selecting prototypes with importance weights
US20210256403A1 (en) Recommendation method and apparatus
Gibert et al. Graph embedding in vector spaces by node attribute statistics
US7139739B2 (en) Method, system, and computer program product for representing object relationships in a multidimensional space
CN110796190A (en) Exponential modeling with deep learning features
US9536201B2 (en) Identifying associations in data and performing data analysis using a normalized highest mutual information score
CN112765480B (en) Information pushing method and device and computer readable storage medium
CN109766454A (en) A kind of investor&#39;s classification method, device, equipment and medium
JP2023523029A (en) Image recognition model generation method, apparatus, computer equipment and storage medium
CN108549909B (en) Object classification method and object classification system based on crowdsourcing
CN112131261B (en) Community query method and device based on community network and computer equipment
CN112231592A (en) Network community discovery method, device, equipment and storage medium based on graph
CN112258250A (en) Target user identification method and device based on network hotspot and computer equipment
Wang et al. Efficient multi-modal hypergraph learning for social image classification with complex label correlations
US11775813B2 (en) Generating a recommended target audience based on determining a predicted attendance utilizing a machine learning approach
CN113656699B (en) User feature vector determining method, related equipment and medium
Zou et al. NCRL: Neighborhood-based collaborative residual learning for adaptive QoS prediction
Goyal et al. Leaf Bagging: A novel meta heuristic optimization based framework for leaf identification
CN108876422B (en) Method and device for information popularization, electronic equipment and computer readable medium
CN111291795A (en) Crowd characteristic analysis method and device, storage medium and computer equipment
CN115564532A (en) Training method and device of sequence recommendation model
CN115659005A (en) Product pushing method and device, computer equipment and storage medium
CN114298118B (en) Data processing method based on deep learning, related equipment and storage medium
Liu et al. Attentive-feature transfer based on mapping for cross-domain recommendation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21785348

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.02.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21785348

Country of ref document: EP

Kind code of ref document: A1