CN110837862B - User classification method and device - Google Patents

User classification method and device Download PDF

Info

Publication number
CN110837862B
CN110837862B CN201911078245.5A CN201911078245A CN110837862B CN 110837862 B CN110837862 B CN 110837862B CN 201911078245 A CN201911078245 A CN 201911078245A CN 110837862 B CN110837862 B CN 110837862B
Authority
CN
China
Prior art keywords
user
function
application
sequence
average value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911078245.5A
Other languages
Chinese (zh)
Other versions
CN110837862A (en
Inventor
邱鑫
吴春成
邱泰生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201911078245.5A priority Critical patent/CN110837862B/en
Publication of CN110837862A publication Critical patent/CN110837862A/en
Application granted granted Critical
Publication of CN110837862B publication Critical patent/CN110837862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The invention provides a user classification method, a user classification device, electronic equipment and a storage medium; the method comprises the following steps: acquiring operation data of a user in an application, and analyzing to obtain a function sequence consisting of functions sequentially used by the user in the application; performing word embedding processing on the name of each function in the function sequence of the user to obtain a vector corresponding to each function; combining the vector sequence corresponding to each function in the function sequence to obtain a function sequence matrix corresponding to the user; and clustering the function sequence matrixes respectively corresponding to the plurality of users to obtain the category to which the user corresponding to each function sequence matrix belongs. By the method and the device, the users can be accurately classified according to the function sequences used by the users.

Description

User classification method and device
Technical Field
The present invention relates to the field of data mining technologies, and in particular, to a user classification method and apparatus, an electronic device, and a storage medium.
Background
Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, such as natural language processing technology and machine learning/deep learning. With the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important value.
The data mining is an important application field of artificial intelligence technology, and the purpose of the data mining is to search information hidden in a large amount of data through an algorithm. The user classification is an important application field of data mining, the requirements of different user groups are determined by dividing users into different categories, and corresponding information is pushed for the different user groups, so that accurate and directional pushing of the information is realized.
In the user classification, products are generally statically depicted and classified based on demographic information (such as age, gender, occupation, academic calendar, and the like) of users, or are defined in a classification based on certain behaviors of the users on the products (such as screening out users who have been on a level of 10 or more in a past period of time and pay gifts for more than 10 times).
However, the user classification method provided by the related art mainly performs classification based on the attribute information of the user, which often results in an inaccurate classification result.
Disclosure of Invention
The embodiment of the invention provides a user classification method, a user classification device, electronic equipment and a storage medium, which can accurately classify users according to function sequences used by the users.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a user classification method, which comprises the following steps:
acquiring operation data of a user in an application, and analyzing to obtain a function sequence consisting of functions sequentially used by the user in the application;
performing word embedding processing on the name of each function in the function sequence of the user to obtain a vector corresponding to each function;
combining the vector sequence corresponding to each function in the function sequence to obtain a function sequence matrix corresponding to the user;
and clustering the function sequence matrixes respectively corresponding to the plurality of users to obtain the category to which the user corresponding to each function sequence matrix belongs.
An embodiment of the present invention provides a user classification apparatus, including:
the acquisition module is used for acquiring operation data of a user in an application;
the analysis module is used for analyzing the operation data acquired by the acquisition module to obtain a function sequence consisting of functions sequentially used by the user in the application;
the word embedding module is used for carrying out word embedding processing on the name of each function in the function sequence of the user obtained by the analysis module to obtain a vector corresponding to each function;
the combination module is used for combining the vector sequence corresponding to each function in the function sequence obtained by the word embedding module to obtain a function sequence matrix corresponding to the user;
and the clustering module is used for clustering the function sequence matrixes respectively corresponding to the plurality of users obtained by the combination module to obtain the category of the user corresponding to each function sequence matrix.
In the foregoing solution, the obtaining module is further configured to obtain an average value of usage times of the applications corresponding to different users, where the usage time is a time interval between when the user starts to use the application and when the user ends to use the application;
dividing the average value of the use time into different time periods, and acquiring the user scale of the application in the different time periods;
acquiring operation data of online users in a corresponding time period aiming at the time period when the user scale exceeds a scale threshold;
wherein the scenario of starting to use the application includes: the process of the application is started, and the application is switched from a background to a foreground; the scene of ending using the application includes: the process of the application is ended and the application is switched from foreground to background.
In the foregoing solution, the analysis module is further configured to select, from the operation data of the user, a function that satisfies at least one of the following conditions: a function of using the frequency exceeding a frequency threshold, and a function of using the path depth exceeding a depth threshold;
and combining the selected functions according to the sequence used by the user to obtain a function sequence corresponding to the user.
In the above scheme, the word embedding module is further configured to determine a size of a sliding window for training a word skipping model;
obtaining training sample pairs according to the size of the sliding window, wherein each training sample pair comprises an input sample and an output sample;
training the jumping model according to the training sample pair to obtain parameters of a hidden layer of the jumping model;
and performing word embedding processing on the functional sequence based on the trained word skipping model to obtain a vector corresponding to each function in the functional sequence.
In the foregoing solution, the clustering module is further configured to determine an average value of vectors corresponding to a plurality of functions in the function sequence matrix;
clustering according to the average values respectively corresponding to the users to obtain a plurality of average value combinations corresponding to different categories, wherein each average value combination comprises the average values corresponding to part of the users;
and determining the category to which the corresponding user belongs according to the category corresponding to the average value combination to which the average value of each user belongs.
In the foregoing solution, the clustering module is further configured to randomly allocate the average values corresponding to the multiple users to k average value combinations;
when the k average value combinations do not meet the convergence condition, iteratively updating the average values included in the k average value combinations until the convergence condition is met;
wherein k represents the number of the plurality of average value combinations, and k is an integer greater than or equal to 1;
the convergence condition includes at least one of: the similarity between the average values in the average value combination of the same category is greater than a first similarity threshold, and the similarity of the average values in the average value combination of different categories is less than a second similarity threshold; wherein the first similarity threshold is greater than the second similarity threshold.
In the foregoing solution, the clustering module is further configured to traverse the k to determine a relationship curve between the k and an error of grouping the average values of the plurality of users based on the k;
and determining the value of k corresponding to the inflection point of the relation curve as the final value of the number of the average value combinations.
In the above scheme, the apparatus further includes a determining module, configured to determine, for each type of user group, a preference degree of the user group for a preset function;
wherein the determining the preference degree of the user group for the preset function comprises:
determining the ratio of the number of people using the preset function to the total number of people using the preset function in the user group;
determining the quantity proportion of the user group in all users;
determining a division operation result between the ratio and the quantity ratio as a preference degree of the user group for the preset function;
the preset function is a function satisfying the following conditions:
the frequency of use exceeds a frequency threshold and the path depth in the application exceeds a depth threshold.
In the above scheme, the obtaining module is further configured to obtain operation data of the user in an application from a database, and obtain a hash corresponding to the operation data from a blockchain network;
and when the hash of the operation data is consistent with the hash obtained from the blockchain network, determining that the operation data is credible.
An embodiment of the present invention provides a user classification device, including:
a memory for storing executable instructions;
and the processor is used for realizing the user classification method provided by the embodiment of the invention when the executable instructions stored in the memory are executed.
The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute so as to realize the user classification method provided by the embodiment of the invention.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention takes the sequence before and after the function use as the important index of the user classification, reserves the potential information before and after the function use of the user, classifies the user based on the function sequence used by the user, and improves the accuracy of the user classification.
Drawings
FIG. 1 is an alternative architecture diagram of a user classification system provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alternative architecture of a user classification system according to an embodiment of the present invention;
FIG. 3 is an alternative structural diagram of a server according to an embodiment of the present invention;
FIG. 4 is an alternative flow chart of a user classification method according to an embodiment of the present invention;
FIG. 5A is a diagram of a word skipping model provided by an embodiment of the present invention;
FIG. 5B is a diagram illustrating a skip word model combined with a Huffman tree according to an embodiment of the present invention;
FIG. 5C is a schematic diagram of a continuous bag of words model provided by an embodiment of the invention;
FIG. 6A is a schematic diagram of a process for classification based on a K-means clustering model according to an embodiment of the present invention;
FIG. 6B is a schematic diagram of a process for classifying based on a Mean-Shift clustering model according to an embodiment of the present invention;
FIG. 6C is a schematic diagram of a process for classification based on a hierarchical clustering model according to an embodiment of the present invention;
fig. 7A is a schematic diagram of a specific application scenario of the user classification method according to the embodiment of the present invention;
fig. 7B is a schematic view of another specific application scenario of the user classification method according to the embodiment of the present invention;
fig. 7C is a schematic view of another specific application scenario of the user classification method according to the embodiment of the present invention;
FIG. 8 is an alternative flow chart of a user classification method according to an embodiment of the present invention;
fig. 9 is a graph of the flow rate variation of application a in a certain day according to an embodiment of the present invention;
FIG. 10 is a graph of the correspondence between K values and WSSSE provided by an embodiment of the present invention;
fig. 11 is a schematic diagram of the preference degrees of different user groups for different main function points according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) Word Embedding (Word Embedding): the core idea of a generic model for vectorizing words is to map each word to a dense vector in a low-dimensional space.
For example, Word2vec uses an N-Gram Model (N-Gram Model), i.e., it is assumed that a Word is related to only the surrounding N words, and not to other words in the text. Which is a cluster of correlation models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and needs to guess the input words in adjacent positions, and after training is completed, the Word2vec model can be used for mapping each Word to a vector and can be used for representing the relation between words.
2) Clustering (Clustering): the process of dividing a data set into groups (groups) or clusters (clusters) of similar objects maximizes the similarity between objects in the same group and minimizes the similarity between objects in different groups. Or a cluster is a set of objects that are similar to each other, and the objects in different clusters are usually not similar or have a low degree of similarity.
Related art provides clustering methods including a partition method, a hierarchy method, a density method, and the like. The division method divides the data set into a plurality of clusters by taking the distance as the similarity measurement among different data in the data set, such as a K-Means clustering algorithm (K-Means); the hierarchical method performs hierarchical decomposition on a given data set to form a tree-shaped clustering result, such as a split ANAlysis clustering algorithm (DIANA, DIvisive ANAlysis); the Density method classifies Based on Density, such as the Density-Based Clustering algorithm with Noise (DBSCAN).
3) Target population Index (TGI, Target Group Index): and characterizing the difference situation of the attention problems of different characteristic users, wherein the TGI index is equal to 100 to represent the average level, and is higher than 100 to represent that the attention degree of the users in the category on the problems is higher than the overall level. The TGI index is equal to [ the proportion of a population having a certain characteristic in the target population/the proportion of a population having the same characteristic in the population ] 100 times the norm. The TGI index is also referred to as preference, core user, etc. according to different research objectives.
4) Blockchain (Blockchain): an encrypted, chained transactional memory structure formed of blocks (blocks).
5) Block chain Network (Blockchain Network): the new block is incorporated into the set of a series of nodes of the block chain in a consensus manner.
In the process of implementing the embodiment of the present invention, the inventor finds that, in the related art, when performing user classification, products are usually statically depicted and classified based on demographic attribute information (such as age, gender, occupation, academic calendar, city, and the like) of the user, or the user is classified from newly increased time, activity, source channels, and the like.
In addition, the related art provides a classification definition based on a certain behavior of the user on the product (for example, screening out the users who have "leave word" and "like" behaviors in the past 30 days, are ranked above 10, and have paid gift delivery times exceeding 10 times).
However, the user classification method provided by the related art cannot dynamically sense the difference of the product from the user through the preference of the function point used by the user, and also loses potential information before and after the user uses the function, and analyzes the influence of one or two functions on the user in an isolated manner, so that the classification result is not accurate enough.
In contrast, in consideration of a scheme of breaking away from research of each function in isolation, the sequence of the use of the functions is taken as an important index for user classification, so that operation data of a user in an application (for example, various types such as a client program, a webpage program and an applet) can be acquired, and analysis is performed to obtain a function sequence consisting of the functions sequentially used by the user in the application; performing word embedding processing on the name of each function in the function sequence of the user to obtain a vector corresponding to each function; combining the vector sequence corresponding to each function in the function sequence to obtain a function sequence matrix corresponding to the user; and clustering the function sequence matrixes respectively corresponding to the plurality of users to obtain the category to which the user corresponding to each function sequence matrix belongs.
In view of this, embodiments of the present invention provide a user classification method, apparatus, electronic device, and storage medium, which can dynamically depict and classify users, and provide classification accuracy.
An exemplary application of the user classification device provided in the embodiment of the present invention is described below, and the user classification device provided in the embodiment of the present invention may be implemented as a server or a server cluster, or may be implemented in a manner that a user terminal and a server cooperate with each other. In the following, an exemplary application will be explained when the user classification device is implemented as a server.
Referring to fig. 1, fig. 1 is an alternative architecture diagram of a user classification system 100 according to an embodiment of the present invention, in order to implement user classification, a terminal 400 (exemplary terminals 400-1 and 400-2 are shown) is connected to a server 200 and a database 500 through a network 300, the server 200 is also connected to the database 500, and the network 300 may be a wide area network or a local area network, or a combination of both.
As shown in fig. 1, an application 410 on a terminal 400 (an application 410-1 on the terminal 400-1 and an application 410-2 on the terminal 400-2 are exemplarily shown) records operation data of a user operating the application 410 for a certain period of time, and transmits the recorded operation data to a database 500 for storage through a network 300. The server 200 obtains the operation data of the user in the operation application 410 from the database 500, and analyzes the obtained operation data to obtain a function sequence composed of functions sequentially used by the user in the operation application 410. Then, the server 200 performs word embedding processing on the name of each function in the function sequence of the user to obtain a vector corresponding to each function, and combines the vector sequence corresponding to each function in the function sequence to obtain a function sequence matrix corresponding to the user. Subsequently, the server 200 performs clustering processing on the function sequence matrices respectively corresponding to the plurality of users to obtain the category to which the user corresponding to each function sequence matrix belongs. Finally, after dividing the users, the server 200 may customize different recommended contents for different user groups.
The embodiment of the invention can also be realized by combining a block chain technology, and the block chain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The blockchain is essentially a decentralized database, which is a string of data blocks associated by using cryptography, each data block contains information of a batch of network transactions, and the information is used for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.
Referring to fig. 2, fig. 2 is a schematic diagram of another alternative architecture of the user classification system 101 according to the embodiment of the present invention. As shown in fig. 2, the user classification system 101 includes a terminal 400 (exemplary terminals 400-1 and 400-2 are shown), a network 300, a server 200, a database 500, and a blockchain network 600 (exemplary shown is that the blockchain network 600 includes a node 610-1, a node 610-2, and a node 610-3). After recording operation data of the user operating the application 410 within a certain time period, the application 410 (the application 410-1 and the application 410-2 are exemplarily shown) on the terminal 400 sends the operation data to the database 500 through the network 300 for storage, and sends the hash corresponding to the operation data to the blockchain network 600 for storage. After obtaining the operation data of the user in the operation application 410 from the database 500, the server 200 requests the hash corresponding to the operation data from the blockchain network 600, and receives the hash returned by the blockchain network 600. The server 200 verifies the hash of the operation data with the hash returned by the blockchain network 600 to determine the trustworthiness of the operation data. The server 200 performs the subsequent steps after determining that the acquired operation data is authentic.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a server 200 according to an embodiment of the present invention, where the server 200 shown in fig. 3 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 3.
The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.
The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 250 described in embodiments of the invention is intended to comprise any suitable type of memory.
In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;
an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.
In some embodiments, the user classifying device provided by the embodiments of the present invention may be implemented in software, and fig. 3 illustrates the user classifying device 255 stored in the memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: an acquisition module 2551, an analysis module 2552, a word embedding module 2553, a combination module 2554, a clustering module 2555 and a determination module 2556, which are logical and therefore arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be explained below.
In other embodiments, the user classifying Device provided in the embodiments of the present invention may be implemented in hardware, and as an example, the user classifying Device provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the user classifying method provided in the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
The following describes a user classification method provided in the embodiment of the present invention with reference to an exemplary application of the user classification device provided in the embodiment of the present invention when implemented as a server.
Referring to fig. 4, fig. 4 is an alternative flowchart of a user classification method according to an embodiment of the present invention, which will be described with reference to the steps shown in fig. 4.
In step S401, the server acquires operation data of the user in the application from the database.
Here, the applications include various types of applications such as a social application, a security application, an audio playback application, and a video playback application.
In some embodiments, the server first counts the usage duration of a certain application by different users, and determines an average value of the usage duration, where the usage duration refers to a time interval from the beginning of the usage of the application by the user to the end of the usage of the application. Then, a day is divided into different time periods based on the average value of the usage periods, and the user sizes of the applications in the different time periods are acquired. And acquiring the operation data of the online users in the corresponding time period aiming at the time period when the scale of the users exceeds the scale threshold. Wherein the scenario of starting to use the application includes: the process of the application is started, or the application is switched from a background to a foreground, and the like; the scene of ending using the application includes: the application's process is ended or the application is switched from foreground to background.
For example, taking the application as the social application a as an example, the server first counts the usage time of the application a used by different users, and calculates an average value, and finds that the average value does not exceed 3 hours. Then, the server divides one day into 8 stages with 3 hours as a basic unit, and counts the user scale in each stage respectively. For example, the server selects data of one week to observe, and finds that the user scales reach the peak value from 10 to 13, so that the operation data of online users within 10 to 13 are obtained.
The method for selecting the operation data of the active user based on the average value of the use time length can ensure the effectiveness of the acquired operation data.
In other embodiments, the operation data may also be operation data from a plurality of different applications, and the plurality of applications may be continuous or discontinuous in the usage sequence.
For example, if the user uses the application 1, the application 2, the application 3, and the application 4 in sequence, the server may acquire operation data of the user operating the applications 1 to 4, or may acquire only operation data of the applications 1, 2, and 4.
For example, if the user uses application 1 and application 2 alternately, the server obtains operation data of the user using application 1 and application 2 alternately for a certain period of time.
In other embodiments, the server obtains the operation data of the user on the application from the database, obtains the hash corresponding to the operation data from the blockchain network, and determines whether the hash of the operation data is consistent with the hash returned by the blockchain network, so as to verify the reliability of the operation data.
In step S402, the server selects functions satisfying a preset condition from the acquired operation data, and combines the selected functions according to the sequence used by the user to obtain a function sequence corresponding to the user.
Here, the preset condition includes that the frequency of use of the function exceeds a frequency threshold, or the depth of the path where the function is located exceeds a depth threshold, that is, the selected function can cover all the main paths.
In some embodiments, the user classification is not affected much because the application involves a greater number of functions, and some of them are used less frequently. Therefore, functions which are used less frequently than a certain number of times can be filtered out firstly, and then the selected functions can be ensured to cover all the main paths by combining the path depths of the functions. And then, combining the selected functions according to the sequence used by the user to obtain a function sequence corresponding to the user.
By way of example, taking an application as the social application a as an example, the total number of functions of the social application a is 1000, counting the frequency of use of the 1000 functions by different users, filtering out functions whose frequency of use is less than a frequency threshold (for example, functions whose frequency of use is less than 1000), and then ensuring that the selected function can cover all major paths by combining the path depth of the function, and finally selecting 26 major functions from the 1000 functions. And then, analyzing the operation data of the user in the social application A, matching the functions used by the user with the 26 main functions, only reserving the functions in the 26 main functions, and combining the reserved functions according to the sequence used by the user to obtain a function sequence corresponding to the user.
For example, assuming that the functions used by the user when operating the social application a include the functions 1 to 50, and the functions 23 to 50 do not belong to the 26 main functions, only the functions 1 to 22 are reserved, and the functions 1 to 22 are combined according to the sequence used by the user to obtain the function sequence corresponding to the user.
By the method for screening the main function points, the data calculation amount is greatly reduced and the system resources are saved while the classification accuracy is not influenced.
In step S403, the server constructs and trains a word embedding model.
By way of example, the word embedding models include a Skip-Gram model (Skip-Gram) and a Continuous Bag of Words model (CBOW, Continuous Bag of Words).
In some embodiments, the server may build and train a jump word model.
For example, referring to fig. 5A, fig. 5A is a schematic diagram of a word skipping model provided by an embodiment of the present invention. As shown in fig. 5A, the basic idea of the word skipping model is to predict the window function of the sequential use order of each central function, and modify the vector of the central function according to the prediction result. When training the word skipping model, firstly, the size of a sliding window used for training the word skipping model needs to be determined, training sample pairs are obtained according to the determined size of the sliding window, and each group of training sample pairs comprises an input sample and an output sample. And obtaining parameters of a hidden layer of the jump character model according to the training sample pair training jump character model. The training goal of the word-skipping model is to learn word token vector distributions, with the optimization goal of maximizing the following likelihood functions given the vector of the function:
Figure BDA0002263155640000141
wherein T represents the total number of functions, T represents the serial number of a certain function in the T functions, and w1……wtIs a series of functional sequences, wtRepresents a central function, and wt+j(j∈[-c,c]) Indicating the use of a function in the precedence window c, p (w)t+j|wt) Representing window functions wt+jAt a central function wtConditional probability of (c).
Here, each precedence window function wiAt a given central function wjThe conditional probability of (c) is calculated in a form similar to the Softmax function (corresponding to a high-dimensional extended version of the Sigmoid function) and is calculated as follows:
Figure BDA0002263155640000142
wherein the content of the first and second substances,
Figure BDA0002263155640000143
representing window functions wiThe corresponding vector, T represents a transpose in the vector,
Figure BDA0002263155640000144
indicating a central function wjThe corresponding vector, V, represents the total number of functions.
Since the Softmax calculation used by the word skipping model is complex, in some embodiments, the hierarchical Softmax method may be optimized in combination with Huffman numbers (Huffman trees) to reduce the complexity of the calculation.
For example, referring to fig. 5B, fig. 5B is a schematic diagram of a skip word model combining a huffman tree according to an embodiment of the present invention. As shown in fig. 5B, in the training process of the model, a huffman tree is constructed through huffman coding, and is optimized through the hierarchical Softmax method, so that
Figure BDA0002263155640000151
The computational complexity is reduced from O (V) to O (log (V)).
In other embodiments, the server may also build and train a continuous bag of words model.
For example, referring to fig. 5C, fig. 5C is a schematic diagram of a continuous bag-of-words model provided by an embodiment of the present invention. As shown in fig. 5C, the basic idea of the continuous bag-of-words model is to predict the vector of the center function by the vector of the window function in the order of use of each function. The process of training the continuous bag-of-words model is basically similar to the process of training the skip word model, and the embodiment of the invention is not repeated herein.
In step S404, the server inputs the function sequence of the user into the trained word embedding model, and outputs a vector corresponding to each function in the function sequence.
In some embodiments, the server may input the function sequence of the user into a trained word skipping model, where the word skipping model performs word embedding processing on a name of each function in the function sequence of the user, and outputs a vector corresponding to each function.
In other embodiments, the server may also input the function sequence of the user into a trained continuous bag-of-words model, and the continuous bag-of-words model performs word embedding processing on the name of each function in the function sequence of the user and outputs a vector corresponding to each function.
For example, assuming that the function sequence of a certain user is { function 1, function 2, function 3, function 4, function 5}, after the function sequence is input into the trained word skipping model, the word skipping model outputs vector 1, vector 2, vector 3, vector 4, and vector 5 for each function in the function sequence.
In step S405, the server combines the vector sequences corresponding to each function in the function sequence to obtain a function sequence matrix corresponding to the user.
Here, after obtaining the vector corresponding to each function in the function sequence, the server combines, for each user, the vector sequence corresponding to each function in the function sequence corresponding to the user to obtain the function sequence matrix corresponding to the user. In this way, the function sequence vector of the user is quantized into a function sequence matrix formed by sequentially combining the vectors corresponding to each function so as to be used for subsequent clustering processing.
In step S406, the server constructs a clustering model.
Here, the Clustering models include a K-Means (K-Means) Clustering model, a Mean-Shift Clustering model, a Density-Based noisy Spatial Clustering model (DBSCAN), a gaussian mixture model-Based expectation-maximization Clustering model, and a hierarchical Clustering model.
For example, referring to fig. 6A, fig. 6A is a schematic diagram of a process of classifying based on a K-means clustering model according to an embodiment of the present invention. As shown in fig. 6A, the K-means clustering model first needs to determine the number of clusters and randomly initialize their respective center points. To determine the number to cluster, the data may be viewed first and an attempt made to identify any different groupings. The center point is a vector of the same length as each vector of data points, which is "X" in FIG. 6A. Each data point is classified by calculating the distance between the current point and the center of each group, and then grouped into the group with the center closest to the distance. Based on the iterated results, the average of all points of each class is calculated as the new cluster center. The iteration repeats the above steps, or until the group center does not vary much between iterations.
For example, referring to fig. 6B, fig. 6B is a schematic diagram of a process of classifying based on a Mean-Shift clustering model according to an embodiment of the present invention. The Mean-Shift clustering model is a sliding window based classification method that attempts to find regions with dense data points. It is a centroid-based algorithm, that is, the centroid of each group or class is located by updating the centroid candidate to the mean of the points within the sliding window. These candidate sliding windows are then filtered at a post-processing stage to reduce the number of adjacent repeat points, resulting in a collection of center points and their corresponding groups.
For example, the density-based noisy spatial clustering model is a density-based classification method that, like the Mean-Shift clustering model, starts with an arbitrary start data point that has not been visited. The neighborhood of this point is extracted by a distance epsilon and if there are a sufficient number of points in the neighborhood, the clustering process starts and the current data point becomes the first point in the new cluster. Otherwise, the point will be marked as noise (this noisy point may then become part of the cluster), in both cases the point is marked as "visited". For the first point in this new cluster, the point in its epsilon distance neighborhood will have become part of the same cluster. This process of having all points in the epsilon neighborhood belong to the same cluster is repeated until all new points are added to the cluster grouping. The above steps are repeated until all points within the cluster are determined, i.e., all points within the epsilon neighborhood are visited and labeled. After the current cluster is done, a new unaccessed point is extracted and processed, and then the next cluster or noise is found. This process is repeated until all points are edited as accessed. When all points are visited, then each point is marked as belonging to a cluster or as noise.
For example, the expectation-maximization clustering model based on the gaussian mixture model is a classification method with better flexibility than the K-means clustering model, and the number of clusters is firstly set, and then the gaussian distribution parameters of each cluster are randomly initialized. It is also possible to provide a good guess for the initial parameters by looking at the data quickly. Given the gaussian distribution of each cluster, the probability of each data point belonging to a particular cluster is calculated. The closer a point is to the gaussian center, the more likely it belongs to the cluster. Based on these probabilities, a new set of parameters is computed for the gaussian distribution, thereby maximizing the probability of data points in the cluster. These new parameters are calculated using a weighted sum of data point locations, where the weight is the probability of a data point belonging to a particular cluster. And repeating the steps until convergence.
For example, referring to fig. 6C, fig. 6C is a schematic diagram of a process of classifying based on a clustering model of a coacervation hierarchy according to an embodiment of the present invention. The coacervation hierarchical clustering model can be divided into two categories: from top to bottom or from bottom to top. All points are regarded as a cluster in the initial stage of top-down hierarchical clustering, and then the clusters are split one at a time until clusters of a single point are left at last; bottom-up hierarchical clustering treats each point as a cluster in the initial stage, followed by merging each nearest cluster each time.
In some embodiments, the server may build a K-means clustering model. The K-means clustering model clusters samples with more similarity and smaller difference into one class (cluster) according to the distance or similarity between the samples, and finally forms a plurality of clusters, so that the samples in the same cluster have high similarity and the difference between different clusters is high. In the K-means clustering model, the K value represents the number of clusters to be obtained, the centroid represents the mean vector of each cluster, namely, all dimensions of the vectors are averaged, and the distance measurement is characterized by the Euclidean distance and the cosine similarity. When comparing the similarity between two vectors, the following distances can also be used for characterization:
euclidean distance:
Figure BDA0002263155640000171
manhattan distance:
d12=|x1-x2|+|y1-y2| (4)
chebyshev distance:
d12=max(|x1-x2|,|y1-y2|) (5)
cosine distance:
Figure BDA0002263155640000181
jaccard similarity coefficient:
Figure BDA0002263155640000182
correlation coefficient:
Figure BDA0002263155640000183
wherein x is1,x2,y1,y2Representing the corresponding vector, (a, B) or (X, Y) two features, Cov covariance, D variance, and E mean.
The clustering process of the K-means clustering model is as follows:
(1) firstly, a K value is determined, namely K sets are obtained after the data sets are clustered.
(2) K data points are randomly selected from the data set as centroids.
(3) For each point in the data set, its distance from each centroid (any of the above distances may be used) is calculated, and the set to which that centroid belongs is divided as to which centroid is closer.
(4) After all data sets are grouped together, there are a total of K sets, and then the centroid of each set is recalculated.
(5) If the distance between the calculated centroid and the original centroid is less than a certain set threshold (indicating that the position of the recalculated centroid does not change much and tends to be stable or convergent), the clustering is considered to have reached the desired result and the process terminates.
(6) And (5) if the distance between the new centroid and the original centroid is greatly changed, iterating the step (3) to the step (5).
For example, referring to fig. 6A, first determining K to 3, randomly selecting 3 data points from the data set as centroids (i.e. symbol "X" in fig. 6A), then separately finding the distances from all the points in the samples to the 3 centroids, and marking the category of each sample as the centroid with the smallest distance to the sample, repeating the above process, the new centroid position will continuously move, i.e. all the categories are marked as the category of the closest centroid and the new centroid is found, and the final clustering result is shown as diagram (f) in fig. 6A.
In other embodiments, the K value may be determined by elbow methods, rather than by predetermination. The core criteria for the elbow method is the Sum of Squares of Errors (SSE), as shown in the following equation:
Figure BDA0002263155640000191
wherein, CiIs the ith cluster, p is CiSample point of (1), miIs CiThe centroid of (1), SSE, is the clustering error of all samples, and represents how good the clustering effect is.
The core idea of the elbow method is as follows: as the clustering number K increases, the sample division becomes finer, the aggregation degree of each cluster gradually increases, and the sum of squared errors SSE naturally becomes smaller. And when K is smaller than the true cluster number, the aggregation degree of each cluster is greatly increased due to the increase of K, so that the descending amplitude of the SSE is large, and when K reaches the true cluster number, the return of the aggregation degree obtained by increasing K is rapidly reduced, so that the descending amplitude of the SSE is rapidly reduced and then tends to be gentle along with the continuous increase of the K value, that is, the relation graph of the SSE and the K is in the shape of an elbow, and the K value corresponding to the elbow is the true cluster number of the data.
For example, the K-means clustering model may be trained in the following manner to select a suitable K value:
and (3) enabling the value of K to be from 1 to a proper upper limit (the value of the upper limit can be determined according to actual conditions and is generally between 8 and 10), clustering each K value, recording the corresponding SSE, drawing a relation curve of the K value and the SSE, and selecting the K corresponding to the elbow as a final clustering number.
In step S407, the server performs clustering processing on the function sequence matrices respectively corresponding to the multiple users based on the constructed clustering model, to obtain a category to which the user corresponding to each function sequence matrix belongs.
In some embodiments, before performing clustering processing on the function sequence matrixes respectively corresponding to the plurality of users, the server first determines an average value of vectors corresponding to the plurality of functions in the function sequence matrix, performs clustering processing according to the average values respectively corresponding to the plurality of users to obtain a plurality of average value combinations corresponding to different categories, and determines the category to which the corresponding user belongs according to the category corresponding to the average value combination to which the average value of each user belongs. Wherein the obtaining of a plurality of combinations of average values corresponding to different categories comprises: obtaining a plurality of combinations of the average values satisfying the following conditions: the similarity between the average values within the average value combination of the same category is higher than a first similarity threshold; the similarity of the mean values between the combinations of mean values of different classes is lower than a second similarity threshold, wherein the first similarity threshold is higher than the second similarity threshold.
For example, assuming that the function sequence matrix corresponding to the user 1 is { vector 1, vector 2, vector 4, vector 5}, the server first performs a process of summing and averaging the vector 1, the vector 2, the vector 4, and the vector 5 included in the function sequence matrix, and uses the obtained average value 1 as a feature corresponding to the user 1 for performing the next clustering process. Similarly, the same processing is performed for other users such as user 2 and user 3, and average values 2 and 3 are obtained. And then the server carries out clustering processing on the obtained average value 1, the average value 2, the average value 3 and the like to obtain average value combinations to which the average values belong, wherein each average value combination comprises the average values corresponding to part of users. For example, after the clustering process, the average value combination 1 includes an average value 1 and an average value 3, and the average value combination 2 includes an average value 2 and an average value 4, that is, the average value 1 and the average value 3 are classified into one type, and the average value 2 and the average value 4 are classified into another type. The average again corresponds to the user, i.e. user 1 and user 3 are classified into one category and user 2 and user 4 into another category.
In some embodiments, after determining the category to which the user belongs, the server may further determine, for each type of user group, a preference degree of the user group for a preset function, so as to mine behavior preferences and potential needs of the user group, and make a cushion for subsequent information pushing by an operator. Wherein the preset function comprises a function that the use frequency exceeds a frequency threshold value, and the path depth in the application exceeds a depth threshold value.
The user classification method provided by the embodiment of the invention has quite wide application scenes.
In some embodiments, users may be best suited according to the effective grouping of active users, customizing different recommendation scenarios for different user groups.
The active users can be determined according to parameters such as login times and use duration, different classifications can be represented by vectors of classification names, and similarity calculation is performed through the vector representations of the candidate recommendation information, so that the recommendation information is determined. Referring to fig. 7A, fig. 7A is a schematic diagram of a specific application scenario of the user classification method according to the embodiment of the present invention.
For example, as shown in diagram (a) of fig. 7A, the recommendation information may be advertisement information, and the vector representation of the name/advertisement word of the advertisement object and the vector representation of the category name are subjected to similarity calculation, so as to recommend the best matching advertisement information for the user groups of different categories.
For example, as shown in fig. 7A (b), the recommendation information may also be news information, and the vector representation of the title/keyword of the news and the vector representation of the category name are subjected to similarity calculation to recommend the best matching news information for the user groups of different categories.
In other embodiments, the user's cost of arrival may be reduced by recording the preferred use of the function, quick reach. Referring to fig. 7B, fig. 7B is a schematic diagram of another specific application scenario of the user classification method according to the embodiment of the present invention.
For example, as shown in diagram (a) of fig. 7B, an entry of a user preferred use function is shown in a prominent position of a first screen of the application (e.g., at the top of the interface), and the user can jump to the function by clicking the entry, thereby reducing the arrival cost of the user.
For example, as shown in a diagram (B) of fig. 7B, when it is determined that a function to be used by a user is a recorded preference function based on artificial intelligence learning of a use habit of the user (for example, a sequence of functions used by the user) according to a function currently used by the user, a popup window or other manners is used for prompting, a quick entry is displayed, if the user clicks, a jump is made to a corresponding function interface, and if the user does not click for a certain time, the prompt is closed, so that the cost of the user is reduced.
In other embodiments, the recommendation of the function may be made based on the correlation between the functions in the application. Referring to fig. 7C, fig. 7C is a schematic diagram of another specific application scenario of the user classification method according to the embodiment of the present invention.
For example, as shown in diagram (a) of fig. 7C, the user uses function a in application 1, if function B in application 1 is related to function a, function B of application 1 may be recommended to the user, and by clicking a corresponding recommendation window, application 1 jumps from the use interface of function a to the use interface of function B, so as to implement recommendation of functions.
By way of example, cross-application functionality recommendations may also be made. As shown in fig. 7C, the user uses the function a in the application 1 and then switches to the application 2, when the function B in the application 2 is related to the function a in the application 1, a prompt may be made in the interface of the application 2 when the user uses the application 2, and a quick entry of the function B is shown, and if the user clicks, the user jumps to the corresponding function interface, and waits for a certain time before being clicked by the user, the prompt is closed.
Continuing with the exemplary structure of the user classifying means 255 provided by the embodiment of the present invention implemented as software modules, in some embodiments, as shown in fig. 3, the software modules stored in the user classifying means 255 of the memory 250 may include: an acquisition module 2551, an analysis module 2552, a word embedding module 2553, a combination module 2554, a clustering module 2555, and a determination module 2556.
The obtaining module 2551 is configured to obtain operation data of a user in an application; the analysis module 2552 is configured to analyze the operation data acquired by the acquisition module to obtain a function sequence composed of functions sequentially used by the user in the application; the word embedding module 2553 is configured to perform word embedding processing on the name of each function in the function sequence of the user obtained by the analysis module to obtain a vector corresponding to each function; the combining module 2554 is configured to combine the vector sequence corresponding to each function in the function sequence obtained by the word embedding module to obtain a function sequence matrix corresponding to the user; the clustering module 2555 is configured to perform clustering on the function sequence matrices respectively corresponding to the multiple users obtained by the combining module to obtain a category to which the user corresponding to each function sequence matrix belongs.
In some embodiments, the obtaining module 2551 is further configured to obtain an average value of usage time of the application corresponding to different users, where the usage time is a time interval from the beginning of the usage of the application by the user to the end of the usage of the application; dividing the average value of the use time into different time periods, and acquiring the user scale of the application in the different time periods; acquiring operation data of online users in a corresponding time period aiming at the time period when the user scale exceeds a scale threshold; wherein the scenario of starting to use the application includes: the process of the application is started, and the application is switched from a background to a foreground; the scene of ending using the application includes: the process of the application is ended and the application is switched from foreground to background.
In some embodiments, the analyzing module 2552 is further configured to select a function from the operation data of the user, the function satisfying at least one of the following conditions: a function of using the frequency exceeding a frequency threshold, and a function of using the path depth exceeding a depth threshold; and combining the selected functions according to the sequence used by the user to obtain a function sequence corresponding to the user.
In some embodiments, the word embedding module 2553 is further configured to determine a sliding window size for training the word skipping model; obtaining training sample pairs according to the size of the sliding window, wherein each training sample pair comprises an input sample and an output sample; training the jumping model according to the training sample pair to obtain parameters of a hidden layer of the jumping model; and performing word embedding processing on the functional sequence based on the trained word skipping model to obtain a vector corresponding to each function in the functional sequence.
In some embodiments, the clustering module 2555 is further configured to determine an average value of vectors corresponding to a plurality of functions in the function sequence matrix; clustering according to the average values respectively corresponding to the users to obtain a plurality of average value combinations corresponding to different categories, wherein each average value combination comprises the average values corresponding to part of the users; and determining the category to which the corresponding user belongs according to the category corresponding to the average value combination to which the average value of each user belongs.
In some embodiments, the clustering module 2555 is further configured to randomly allocate the average values corresponding to the plurality of users to k average value combinations; when the k average value combinations do not meet the convergence condition, iteratively updating the average values included in the k average value combinations until the convergence condition is met; wherein k represents the number of the plurality of average value combinations, and k is an integer greater than or equal to 1; the convergence condition includes at least one of: the similarity between the average values in the average value combination of the same category is greater than a first similarity threshold, and the similarity of the average values in the average value combination of different categories is less than a second similarity threshold; wherein the first similarity threshold is greater than the second similarity threshold.
In some embodiments, the clustering module 2555 is further configured to traverse the k to determine a relationship between k and an error in grouping the average of the plurality of users based on the k; and determining the value of k corresponding to the inflection point of the relation curve as the final value of the number of the average value combinations.
In some embodiments, the apparatus further comprises a determining module 2556, configured to determine, for each type of user group, a preference degree of the user group for a preset function; wherein the determining the preference degree of the user group for the preset function comprises: determining the ratio of the number of people using the preset function to the total number of people using the preset function in the user group; determining the quantity proportion of the user group in all users; determining a division operation result between the ratio and the quantity ratio as a preference degree of the user group for the preset function; the preset function is a function satisfying the following conditions:
the frequency of use exceeds a frequency threshold and the path depth in the application exceeds a depth threshold.
In some embodiments, the obtaining module 2551 is further configured to obtain operation data of the user in the application from a database, and obtain a hash corresponding to the operation data from a blockchain network; and when the hash of the operation data is consistent with the hash obtained from the blockchain network, determining that the operation data is credible.
It should be noted that the description of the apparatus according to the embodiment of the present invention is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is omitted. The inexhaustible technical details in the user classification device provided by the embodiment of the invention can be understood according to the description of any one of the figures 4-11.
In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.
In grouping users, products are usually statically characterized and classified based on user information of demographic attributes (e.g., age, gender, occupation, academic calendar, city, etc.), or users are grouped from a channel of increasing time, activity, source. In addition, the related art provides a group definition based on a certain action of the user on the product (for example, a user who is screened out within the past 30 days, ranked above 10, has "leave word" or "like" action, and has paid gift delivery times exceeding 10 times).
However, the user clustering scheme provided by the related art cannot dynamically sense the difference of the product from the user through the preference of the function point used by the user, and also loses the potential information of the front and back sequence of the product function used by the user, and analyzes the influence of one or two functions on the user in an isolated manner.
The embodiment of the invention provides a user classification method, which gets rid of the technical scheme of researching each function point in an isolated way, and takes the sequence of the use of the function points as an important index, firstly, the function sequence of a product used by a user is taken as a characteristic, a matrix vector is constructed by using a Word2vec model, and the relevance of each function point and the adjacent function points is carved; then, clustering the vectorized function use sequence matrix of each user by using a K-means clustering algorithm (K-means) to obtain the category of each function use sequence, wherein each user corresponds to one function use sequence, and the category of each user can be obtained; and finally, determining the preference degree of each category to the function points, and realizing the grouping of the users on the function use sequence.
The user classification method provided by the embodiment of the invention can make the user best according to the effective grouping of the active users, and different user groups customize different recommendation scenes. And the function points used according to the recorded preference are quickly touched, so that the user arrival cost is reduced. In addition, based on strong correlation among various function points in the same category, information intercommunication among such function points can be promoted, and if the A function is used, the B function can be recommended to the A function, so that content customization is realized.
Referring to fig. 8, fig. 8 is an optional flowchart of a user classification method according to an embodiment of the present invention, and as shown in fig. 8, the method includes the following steps: data selection, feature selection, acquisition of a function use sequence of a user, vectorization of the function use sequence, K-means clustering and determination of preference degrees of different user groups to function points. Each step will be specifically described below.
1) Data selection
For example, taking application a as an example, the duration of one-time use (from beginning use to end use) of application a is counted first, and none of the times is found to exceed 3 hours. Data observations were then taken over a week and found to peak at a 10 o 'clock to 12 o' clock user scale.
Referring to fig. 9, fig. 9 is a graph showing the flow rate variation of application a in a certain day, and as shown in fig. 9, the user scale peaks from 10 to 12. Therefore, the embodiment of the invention selects the users from 10 to 12 points in a day as the research sample.
2) Feature selection
Still taking the application a as an example, because the number of the function points related to the application a is large, the total number of the function points is as many as 1000, and the frequency of use of some of the function points is small, and the influence on the user grouping is small, the function points with the frequency of use smaller than the preset frequency can be filtered out (for example, the function points with the frequency of use smaller than 1000 are filtered out), then the selected function points can be ensured to cover all the main paths by combining the path depth of the function points, and finally 26 main function points are selected out.
It should be noted that, a person skilled in the art may determine the number of the main functional points according to actual situations, and the embodiment of the present invention is not specifically limited herein.
3) Obtaining a sequence of functional uses of a user
The step is mainly to obtain the corresponding mapping relation between the user and the sequence of the use function of the user. For each user identity (i.e., user ID), the functions used by the user are ordered by the chronological order of the operation times.
For example, a sequence of user usage functions can be obtained by using a collect _ list function of Hive (a data warehouse base tool is used in Hadoop to process structured data), and a final "user ID-function usage sequence" matrix is shown in table 1.
Figure BDA0002263155640000261
TABLE 1
The meanings of the individual functions in table 1 are as follows:
EMID _ Secure _ Assistant _ Mini _ Open _ Big _ Window represents clicking a small floating Window to expand a large floating Window;
EMID _ Secure _ Clean _ AllEnter _ In _ Count represents garbage cleaning;
EMID _ Secure _ CleanSpacemanager _ AllEnter _ In _ Count represents cleaning and accelerating the weight reduction of the mobile phone;
EMID _ Secure _ Click _ Desk _ FastClean _ Total represents Click desktop top speed cleaning;
EMID _ Secure _ Assistant _ RocklLance _ Success represents that the Rocket is successfully launched.
4) Vectorization of functional usage sequences
In some embodiments, Word vectors of the function points may be obtained by using a Word2vec model, and then the function point sequence is vectorized to obtain a vectorized function use sequence matrix.
Word2vec is a Word Embedding (Word Embedding) scheme for computing a Distributed Word vector (also directly called Word vector) for each Word in the context of its given corpus. In the embodiment of the invention, one function point is used as a Word, and the semantic meaning of each function point can be described to a certain extent by the function point vector obtained by the Word2vec model. Two important models, namely a Continuous Bag Of Words model (CBOW, Continuous Bag Of Words) and a Skip-Word model (Skip-Gram), are arranged in the Word2vec model. The CBOW model predicts the vector of the central function point through the vectors of the window function points in the use sequence of each function point; and the Skip-Gram model predicts window function points of the use sequence of each central function point through each central function point and corrects the vector of the central function point according to the prediction result.
Since the Skip-Gram model can extract more information in a larger data set, in some embodiments, the functional usage sequence of the user can be vectorized based on the Skip-Gram model. The training goal of the Skip-Gram model is to learn word token vector distributions, with the optimization goal of maximizing the following likelihood functions given a function point vector:
Figure BDA0002263155640000271
wherein T represents the total number of functions, T represents the number of a certain function of the T functions, and w1……wtIs a series of functional sequences, wtRepresents a central function, and wt+j(j∈[-c,c]) Indicating the use of a function in the precedence window c, p (w)t+j|wt) Representing window functions wt+jAt a central function wtConditional probability of (c).
Here, each precedence window function point wiAt a given central function point wjThe conditional probability of (c) is calculated in a form similar to the Softmax function (corresponding to a high-dimensional extended version of the Sigmoid function) and is calculated as follows:
Figure BDA0002263155640000272
wherein the content of the first and second substances,
Figure BDA0002263155640000273
representing window functions wjThe corresponding vector, T represents a transpose in the vector,
Figure BDA0002263155640000274
indicating a central function wjThe corresponding vector, V, represents the total number of functions.
Since the Softmax used by the Skip-Gram model is computationally complex, in some embodiments, a Huffman tree (Huffman tree) may be used to optimize by a hierarchical Softmax (hierarchical Softmax) method, such that
Figure BDA0002263155640000275
The computational complexity is reduced from O (V) to O (log (V)).
And mapping each function point in the function use sequence into a vector with a fixed size to obtain a word vector. The Word2vec model uses the average number of each function point in the document to convert the function point use sequence into a vector, and then uses the vectorized function use sequence as the input of a clustering algorithm to calculate the similarity between different function use sequences.
For example, the matrix of "user ID-function usage sequence" in table 1 is shown in table 2, after being processed by Word2vec model, as a vectorized function usage sequence matrix.
Figure BDA0002263155640000281
TABLE 2
5) K-means clustering
Here, clustering processing is performed on the function-use sequences after quantization.
In some embodiments, the output result of the Word2vec model may be used as input, a K-means clustering algorithm training is performed, an iterative calculation of K is performed, and a K value with the smallest Sum of Squared errors in the Set (WSSSE, Within Set Sum of Squared errors) is selected.
For example, referring to fig. 10, fig. 11 is a graph of a correspondence relationship between a K value and a WSSSE provided by an embodiment of the present invention, as shown in fig. 10, the curve has a relatively perfect "elbow" shape, and an inflection point is at a position of 7, so that clustering is performed with K equal to 7, that is, dividing the vectorized function usage sequence into 7 categories, thereby obtaining a mapping relationship of "vectorized function usage sequence-category identification information (category ID)".
Based on the mapping relation between the user ID-function use sequence and the vectorized function use sequence-category ID, the user ID-category ID can be obtained without difficulty, and therefore user grouping is achieved.
6) Determining preference degrees of different user groups for function points
After grouping the users, the meaning of each user group can be determined according to the preference of different user groups to the function points.
For example, the preference of different user groups for 26 main function points can be calculated by a Target Group Index (TGI), wherein the TGI Index is equal to 100 and represents an average level, and a value higher than 100 represents a value higher than an overall level of interest of users in the Group to a certain problem. User group ciAt function point fiThe calculation formula of (a) is as follows:
Figure BDA0002263155640000291
wherein the molecule refers to a user group ciMiddle use function point fiThe number of people divided by the use function point fiThe denominator is the user group ciThe percentage among the total users.
For example, referring to fig. 11, fig. 11 is a schematic diagram illustrating the preference degrees of different user groups for different main function points according to the embodiment of the present invention. As shown in fig. 11, which shows the TGI indexes of the 7 different categories of user groups for 26 main function points respectively, the greater the TGI index (for example, the user group with the largest TGI index is represented by green for each main function point), the more significant the preference degree of the group for the main function point is, by calculating the TGI indexes and comparing the preference degree of each category for the main function point in the horizontal direction with the color scale. As can be seen from fig. 11, the 7 types of user groups can be determined as the 7 user types shown in table 3 according to the preference degrees of different user groups for different main function points.
Figure BDA0002263155640000292
Figure BDA0002263155640000301
TABLE 3
By taking the product a as an example, the user classification method provided by the embodiment of the present invention can classify users of the mobile phone housekeeper into 7 different categories, namely "commuting type, silent type, exploration type, saving type + network lacking type, safe type, light cleaning type and deep cleaning type". After the potential requirements and behavior preferences of different user groups are known, the method lays a foundation for information pushing of operators, and can continue to perform further fine operation aiming at different user groups.
Embodiments of the present invention provide a storage medium having stored therein executable instructions that, when executed by a processor, will cause the processor to perform a method provided by embodiments of the present invention, for example, a user classification method as shown in fig. 4 or 8.
In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the embodiment of the invention has the following beneficial effects:
the user classification method provided by the embodiment of the invention gets rid of the scheme of independently researching each function, the sequence of the function use is taken as an important index, firstly, the function sequence of the product used by the user is taken as the characteristic, the word embedding processing is carried out to construct the function sequence matrix, then, the clustering processing is carried out on the function sequence matrixes respectively corresponding to a plurality of users to obtain the category of each function sequence matrix, and each user corresponds to one function sequence matrix, thus obtaining the category of each user. And finally, calculating preference degrees of different user groups to the function points, thereby determining potential requirements and behavior hobbies of the different user groups, laying a cushion for information pushing of operators, customizing different recommended contents for the different user groups, realizing refined operation service and improving user experience.
The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims (11)

1. A method for classifying a user, the method comprising:
counting the use duration of different users aiming at the application, and determining the average value of the use duration;
dividing a day into different time periods based on the average value of the usage time periods, and acquiring the user scale of the application in the different time periods;
acquiring operation data generated by the user operating on the application in a corresponding time period aiming at the time period when the user scale exceeds the scale threshold;
selecting a function satisfying at least one of the following conditions from the operation data: a function of using the frequency exceeding the frequency threshold, and a function of using the path depth exceeding the depth threshold;
combining the selected functions according to the sequence of the user in the application to obtain a function sequence corresponding to the user;
performing word embedding processing on the name of each function in the function sequence of the user to obtain a vector corresponding to each function;
combining the vector sequence corresponding to each function in the function sequence to obtain a function sequence matrix corresponding to the user, and determining the average value of the vectors corresponding to a plurality of functions in the function sequence matrix;
clustering the average values of the vectors respectively corresponding to the plurality of users to obtain the category of the user corresponding to each functional sequence matrix;
for each type of user group, respectively determining the preference degree of the user group for a preset function, wherein the preset function comprises a function that the use frequency exceeds a frequency threshold value, and the path depth in the application exceeds a depth threshold value;
and recommending related functions to the user group based on the preference degree.
2. The method of claim 1, wherein the usage duration is a time interval from the beginning of the usage of the application by the user to the end of the usage of the application;
wherein the scenario of starting to use the application includes: the process of the application is started, and the application is switched from a background to a foreground; the scene of ending using the application includes: the process of the application is ended and the application is switched from foreground to background.
3. The method according to claim 1, wherein said performing word embedding processing on the name of each function in the function sequence of the user to obtain a vector corresponding to each function comprises:
determining the size of a sliding window for training a word skipping model;
obtaining training sample pairs according to the size of the sliding window, wherein each training sample pair comprises an input sample and an output sample;
training the jumping model according to the training sample pair to obtain parameters of a hidden layer of the jumping model;
and performing word embedding processing on the functional sequence based on the trained word skipping model to obtain a vector corresponding to each function in the functional sequence.
4. The method according to claim 1, wherein the clustering the average values of the vectors respectively corresponding to the plurality of users to obtain the category to which the user corresponding to each functional sequence matrix belongs comprises:
clustering according to the average values of the vectors respectively corresponding to the users to obtain a plurality of average value combinations corresponding to different categories, wherein each average value combination comprises the average values corresponding to part of the users;
and determining the category to which the corresponding user belongs according to the category corresponding to the average value combination to which the average value of each user belongs.
5. The method according to claim 4, wherein the clustering according to the average value of the vectors respectively corresponding to the plurality of users comprises:
randomly distributing the average values of the vectors respectively corresponding to the plurality of users to k average value combinations;
when the k average value combinations do not meet the convergence condition, iteratively updating the average values of the vectors included in the k average value combinations until the convergence condition is met;
wherein k represents the number of the plurality of average value combinations, and k is an integer greater than or equal to 1;
the convergence condition includes at least one of: the similarity between the average values of the vectors in the average value combination of the same category is greater than a first similarity threshold, and the similarity of the average values of the vectors between the average value combinations of different categories is less than a second similarity threshold; wherein the first similarity threshold is greater than the second similarity threshold.
6. The method according to claim 5, wherein before the clustering process is performed according to the average values of the vectors corresponding to the plurality of users, the method further comprises:
traversing k to determine a relationship curve between the k and an error grouping the average of the vectors corresponding to the plurality of users based on the k;
and determining the value of k corresponding to the inflection point of the relation curve as the final value of the number of the average value combinations.
7. The method according to any one of claims 1 to 6, wherein the determining, for each type of user group, a preference degree of the user group for a preset function respectively comprises:
determining the ratio of the number of people using the preset function to the total number of people using the preset function in the user group;
determining the quantity proportion of the user group in all users;
and determining the division operation result between the ratio and the quantity ratio as the preference degree of the user group for the preset function.
8. The method according to any one of claims 1 to 6, wherein the obtaining, for a time period in which the scale of the user exceeds a scale threshold, operation data generated by the user operating on the application in a corresponding time period comprises:
acquiring operation data generated by the user operating on the application in a time period by aiming at the time period when the scale of the user exceeds a scale threshold value from a database, and acquiring a hash corresponding to the operation data from a block chain network;
and when the hash of the operation data is consistent with the hash obtained from the blockchain network, determining that the operation data is credible.
9. An apparatus for classifying a user, the apparatus comprising:
the acquisition module is used for counting the use duration of different users aiming at the application and determining the average value of the use duration; dividing a day into different time periods based on the average value of the usage time periods, and acquiring the user scale of the application in the different time periods; acquiring operation data generated by the user operating on the application in a corresponding time period aiming at the time period when the user scale exceeds the scale threshold;
an analysis module for selecting a function from the operational data that satisfies at least one of the following conditions: a function of using the frequency exceeding the frequency threshold, and a function of using the path depth exceeding the depth threshold;
the function sequence corresponding to the user is obtained by combining the selected functions according to the sequence used by the user in the application;
the word embedding module is used for carrying out word embedding processing on the name of each function in the function sequence of the user obtained by the analysis module to obtain a vector corresponding to each function;
the combination module is used for combining the vector sequence corresponding to each function in the function sequence obtained by the word embedding module to obtain a function sequence matrix corresponding to the user, and determining the average value of the vectors corresponding to a plurality of functions in the function sequence matrix;
the clustering module is used for clustering the average values of the vectors respectively corresponding to the plurality of users obtained by the combination module to obtain the category of the user corresponding to each functional sequence matrix;
the determining module is used for respectively determining the preference degree of each user group for preset functions, wherein the preset functions comprise functions that the use frequency exceeds a frequency threshold value, and the path depth in the application exceeds a depth threshold value; and recommending related functions to the user group based on the preference degree.
10. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the user classification method of any one of claims 1 to 8 when executing executable instructions stored in the memory.
11. A computer-readable storage medium having stored thereon executable instructions which, when executed by a processor, carry out the user classification method of any one of claims 1 to 8.
CN201911078245.5A 2019-11-06 2019-11-06 User classification method and device Active CN110837862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911078245.5A CN110837862B (en) 2019-11-06 2019-11-06 User classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911078245.5A CN110837862B (en) 2019-11-06 2019-11-06 User classification method and device

Publications (2)

Publication Number Publication Date
CN110837862A CN110837862A (en) 2020-02-25
CN110837862B true CN110837862B (en) 2021-10-01

Family

ID=69576291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911078245.5A Active CN110837862B (en) 2019-11-06 2019-11-06 User classification method and device

Country Status (1)

Country Link
CN (1) CN110837862B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950598A (en) * 2020-07-19 2020-11-17 中国海洋大学 Method for individually classifying swimming crab groups based on K-Means algorithm and application
TWI801767B (en) * 2020-11-09 2023-05-11 財團法人工業技術研究院 Adjusting method and training system of machine learning classification model and user interface
CN112560910B (en) * 2020-12-02 2024-03-01 中国联合网络通信集团有限公司 User classification method and device
CN112465043A (en) * 2020-12-02 2021-03-09 平安科技(深圳)有限公司 Model training method, device and equipment
CN112488765A (en) * 2020-12-08 2021-03-12 深圳市欢太科技有限公司 Advertisement anti-cheating method, advertisement anti-cheating device, electronic equipment and storage medium
CN112612974A (en) * 2021-01-04 2021-04-06 上海明略人工智能(集团)有限公司 Friend recommendation method and system based on path sorting
CN113569910A (en) * 2021-06-25 2021-10-29 石化盈科信息技术有限责任公司 Account type identification method and device, computer equipment and storage medium
CN113836370B (en) * 2021-11-25 2022-03-01 上海观安信息技术股份有限公司 User group classification method and device, storage medium and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577415A (en) * 2012-07-20 2014-02-12 百度在线网络技术(北京)有限公司 Method and device for updating search configuration corresponding to mobile search application
CN106845644A (en) * 2015-12-10 2017-06-13 Tcl集团股份有限公司 A kind of heterogeneous network of the contact for learning user and Mobile solution by correlation
CN106951230A (en) * 2017-02-27 2017-07-14 深圳市金立通信设备有限公司 A kind of feature list of application program provides method and background service terminal
CN107818334A (en) * 2017-09-29 2018-03-20 北京邮电大学 A kind of mobile Internet user access pattern characterizes and clustering method
CN108711006A (en) * 2018-05-15 2018-10-26 腾讯科技(深圳)有限公司 Revenue control method, management node, system and storage device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7565346B2 (en) * 2004-05-31 2009-07-21 International Business Machines Corporation System and method for sequence-based subspace pattern clustering
CN101504737A (en) * 2008-02-05 2009-08-12 上海西门子医疗器械有限公司 Method and system for analyzing use condition of products
US9569785B2 (en) * 2012-11-21 2017-02-14 Marketo, Inc. Method for adjusting content of a webpage in real time based on users online behavior and profile
US20170169520A1 (en) * 2015-12-14 2017-06-15 Pelorus Technology Llc Time Tracking System and Method
KR101980977B1 (en) * 2017-11-23 2019-05-21 성균관대학교산학협력단 Method for User based Application Grouping under Multi-User Environment and Table Top Display Apparatus for Performing the Same
CN110309188A (en) * 2018-03-08 2019-10-08 优酷网络技术(北京)有限公司 Content clustering method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577415A (en) * 2012-07-20 2014-02-12 百度在线网络技术(北京)有限公司 Method and device for updating search configuration corresponding to mobile search application
CN106845644A (en) * 2015-12-10 2017-06-13 Tcl集团股份有限公司 A kind of heterogeneous network of the contact for learning user and Mobile solution by correlation
CN106951230A (en) * 2017-02-27 2017-07-14 深圳市金立通信设备有限公司 A kind of feature list of application program provides method and background service terminal
CN107818334A (en) * 2017-09-29 2018-03-20 北京邮电大学 A kind of mobile Internet user access pattern characterizes and clustering method
CN108711006A (en) * 2018-05-15 2018-10-26 腾讯科技(深圳)有限公司 Revenue control method, management node, system and storage device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于智能手机应用数据的用户属性挖掘;陶建容;《中国优秀硕士学位论文全文数据库_信息科技辑》;20180115;I138-877 *
用户APP使用行为分析;edcba123;《http://blog.sina.com.cn/s/blog_7ed6001f0102wzgg.html》;20170117;第4-6页第4点 *
社交类手机APP用户使用行为分析;陈梅等;《现代情报》;20150930;第35卷(第9期);171-177 *

Also Published As

Publication number Publication date
CN110837862A (en) 2020-02-25

Similar Documents

Publication Publication Date Title
CN110837862B (en) User classification method and device
CN110796190B (en) Exponential modeling with deep learning features
CN111125460B (en) Information recommendation method and device
Fawagreh et al. Random forests: from early developments to recent advancements
D’Ambrosio et al. A recursive partitioning method for the prediction of preference rankings based upon Kemeny distances
Abed-Alguni et al. Hybridizing the cuckoo search algorithm with different mutation operators for numerical optimization problems
WO2021203854A1 (en) User classification method and apparatus, computer device and storage medium
US6941318B1 (en) Universal tree interpreter for data mining models
Kumar et al. A benchmark to select data mining based classification algorithms for business intelligence and decision support systems
CN108710609A (en) A kind of analysis method of social platform user information based on multi-feature fusion
Bonaccorso Hands-On Unsupervised Learning with Python: Implement machine learning and deep learning models using Scikit-Learn, TensorFlow, and more
Hasan et al. Employment of ensemble machine learning methods for human activity Recognition
Sawant et al. Study of Data Mining Techniques used for Financial Data Analysis
CN113656699B (en) User feature vector determining method, related equipment and medium
Volkova et al. Online bayesian models for personal analytics in social media
US20220277031A1 (en) Guided exploration for conversational business intelligence
Mnih et al. Learning label trees for probabilistic modelling of implicit feedback
CN112819499A (en) Information transmission method, information transmission device, server and storage medium
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
Ding et al. ABC-based stacking method for multilabel classification
CN112507185A (en) User portrait determination method and device
Wang et al. Clustering analysis of human behavior based on mobile phone sensor data
Butyaev et al. Human-supervised clustering of multidimensional data using crowdsourcing
CN110119465A (en) Merge the mobile phone application user preferences search method of LFM latent factor and SVD
Zhang Learning fairness and graph deep generation in dynamic environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40021584

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant