CN111311292B

CN111311292B - User classification method and system

Info

Publication number: CN111311292B
Application number: CN201811514736.5A
Authority: CN
Inventors: 谢梁; 李盼
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2023-08-04
Anticipated expiration: 2038-12-12
Also published as: CN111311292A

Abstract

The embodiment of the application discloses a user classification method. The user classification method comprises the following steps: acquiring a plurality of groups of user data; generating at least one identification vector based on the plurality of sets of user data; wherein each of the identification vectors represents a data distribution type; determining a relevance index for each set of user data, the relevance index reflecting a relevance between the set of user data and the at least one identification vector; and classifying the plurality of users according to the correlation indexes of the plurality of groups of user data. The method and the device adopt the technical scheme of combining matrix decomposition and a class aggregation algorithm, so that the users can be classified more effectively, rapidly and accurately.

Description

User classification method and system

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a user classification method and system.

Background

With the advent of the big data age, users were classified and labeled to provide quality service to users as a hotspot problem. The traditional classification method may result in lower efficiency when processing mass data, and the classification accuracy cannot be guaranteed. Therefore, there is a need for a more efficient and accurate method of user classification.

Disclosure of Invention

One embodiment of the application provides a user classification method. The user classification method comprises the following steps: acquiring a plurality of groups of user data; generating at least one identification vector based on the plurality of sets of user data; wherein each of the identification vectors represents a data distribution type; determining a relevance index for each set of user data, the relevance index reflecting a relevance between the set of user data and the at least one identification vector; and classifying the plurality of users according to the correlation indexes of the plurality of groups of user data.

In some embodiments, the generating at least one identification vector based on the plurality of sets of user data comprises: preprocessing the plurality of groups of user data to obtain a user data matrix; singular value decomposition is carried out on the user data matrix; at least one singular vector is selected as the at least one identified vector.

In some embodiments, the selecting at least one singular vector as the at least one identified vector comprises: selecting a plurality of singular values with the proportion of the square sum to the square sum of all the singular values being greater than a preset threshold value from singular values obtained by decomposition; and taking singular vectors corresponding to the singular values as the identification vectors.

In some embodiments, the at least one singular vector as the at least one identified vector comprises: and selecting singular vectors with periodically distributed elements from the singular matrix obtained by decomposition as the identification vectors.

In some embodiments, the relevance index reflects a similarity between the user data and the identification vector.

In some embodiments, classifying the users according to the relevance index of the plurality of sets of user data includes: and carrying out clustering operation on the correlation indexes of the plurality of groups of user data, and further classifying the users.

In some embodiments, further comprising: determining a classification number K based on the plurality of sets of user data; clustering the correlation indexes of the plurality of groups of user data to divide the users into K classes

In some embodiments, the user data reflects the status of the user for different preset time periods within a preset time range; wherein the preset time range includes a plurality of the preset time periods.

In some embodiments, the user includes a driver; the user data reflects the departure time of the driver in different preset time periods within a preset time range.

In some embodiments, the predetermined time range includes at least one of a day, a month, a quarter, a half year, or a year.

In some embodiments, the preset time period comprises at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, half a month, or one month.

One of the embodiments of the present application provides a user classification system, including: the device comprises an acquisition module, a determination module and a classification module; the acquisition module is used for acquiring a plurality of groups of user data; the determining module is used for generating at least one identification vector based on the plurality of groups of user data; wherein each of the identification vectors represents a data distribution type; and determining a relevance index for each set of user data, the relevance index reflecting a relevance between the set of user data and the at least one identification vector; and the classification module is used for classifying the plurality of users according to the correlation indexes of the plurality of groups of user data.

In some embodiments, the determining module comprises: a matrix generation unit, a decomposition unit and an identification vector determination unit; the matrix generating unit is used for preprocessing the plurality of groups of user data to obtain a user data matrix; the decomposition unit is used for carrying out singular value decomposition on the user data matrix; and the identification vector determining unit is used for selecting at least one singular vector as the at least one identification vector.

In some embodiments, the identifying vector determining unit is further configured to select, from the singular values obtained by the decomposition, a plurality of singular values having a ratio of a sum of squares to a sum of squares of all the singular values greater than a preset threshold; and taking singular vectors corresponding to the singular values as the identification vectors.

In some embodiments, the identifying vector determining unit is further configured to select, from the singular matrices obtained by decomposition, singular vectors whose elements are periodically distributed as the identifying vector.

In some embodiments, the classification module is further configured to perform a clustering operation on the correlation indexes of the multiple sets of user data, so as to classify the users.

In some embodiments, the classification module comprises: a classification number determining unit and a clustering unit; the classification number determining unit is used for determining a classification number K based on the plurality of groups of user data; and the clustering unit is used for carrying out clustering operation on the correlation indexes of the plurality of groups of user data so as to divide the users into K classes.

One of the embodiments of the present application provides a user classification device, including a processor, where the processor is configured to perform the foregoing user classification method.

One of the embodiments of the present application provides a computer-readable storage medium storing computer instructions, which when read by a computer, perform the aforementioned user classification method.

Drawings

The present application will be further illustrated by way of example embodiments, which will be described in detail with reference to the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is an exemplary flow chart of a user classification method according to some embodiments of the present application;

FIG. 2 is an exemplary flow chart for selecting an identifying vector according to some embodiments of the present application;

FIG. 3 is an exemplary flow chart of classification using a clustering algorithm according to some embodiments of the present application;

FIG. 4 is a block diagram of a user classification system according to some embodiments of the present application; and

FIG. 5 is a periodic schematic of different identity vectors shown in accordance with some embodiments of the present application.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is obvious to those skilled in the art that the present application may be applied to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies of different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

Embodiments of the present application may be applied to different transportation systems including, but not limited to, one or a combination of several of land, sea, aviation, aerospace, and the like. For example, taxis, special cars, windmills, buses, driving trains, motor cars, high-speed rails, ships, airplanes, hot air balloons, unmanned vehicles, receiving/delivering express, etc., employ management and/or distribution transportation systems. The application scenarios of the different embodiments of the present application include, but are not limited to, one or a combination of several of web pages, browser plug-ins, clients, customization systems, in-enterprise analysis systems, artificial intelligence robots, and the like. It should be understood that the application scenarios of the systems and methods of the present application are merely some examples or embodiments of the present application, and that the present application can also be applied to other similar scenarios according to the present drawings without undue effort to one of ordinary skill in the art. For example, other similar guidance users park systems.

The terms "passenger," "passenger side," "user terminal," "customer," "demander," "service demander," "consumer," "user demander," and the like as used herein are interchangeable and refer to the party that needs or subscribes to a service, either personally or as a tool. Likewise, "driver," "driver side," "provider," "supplier," "service provider," "server," "service party," and the like are also interchangeably described herein and refer to a person, tool, or other entity or the like that provides or assists in providing a service. In addition, the "user" described herein may be a party that needs or subscribes to a service, or may be a party that provides a service or assists in providing a service.

FIG. 1 is an exemplary flow chart illustrating a user classification method according to some embodiments of the present application. As shown in fig. 1, the method 100 of user classification may include:

step 102: data of a plurality of users is acquired.

In some embodiments, step 102 may be performed by the acquisition module 402.

In some embodiments, the user data acquired by the acquisition module 402 includes, but is not limited to, one or more of order data (e.g., order data, order receiving data) of the user, user information data (e.g., order subscriber (also known as service initiator) information data, order receiving (also known as service provider) information data), online data of the user (e.g., order subscriber online data, order receiver online data), and the like. Taking the network order as an example, the order placing data can comprise one or more of any combination of the order placing time, the current position, the boarding place, the destination, the departure time, the number of passengers and the like of the passengers. The order taking data may include one or any combination of more of a current position of the driver, a vehicle travel track, an estimated time of arrival, and the like. The offeror information data may include personal information of the passenger, such as: name, gender, age, user name, nickname, preference, number of historical orders, historical score, etc. The order taker information data may include personal information of the driver such as: name, gender, age, user name, nickname, preference, number of historical orders, historical score, etc. The order taker online data may include one or more of any combination of an online time of the passenger, an offline time of the passenger, an online time of the passenger, an order taker frequency of the passenger, and the like. The order taker online data may include one or more of any combination of a driver's online time, a driver's offline time, a driver's online time period, a driver's order taking frequency, a driver's time period to execute an order (or get out), and the like.

In some embodiments, the user data may be stored in any storage-enabled component of the user classification system, such as a database. The acquisition module 402 may acquire the user data directly from the component having the storage function when performing step 102. In some embodiments, the user data may be stored in any system or device with a storage function external to the user classification system, where the system or device with a storage function external to the user classification system is connected to the user classification system through a network. In performing step 102, the obtaining module 402 may obtain the user data from any device having a storage function outside the user classification system through a network. In some embodiments, any storage-enabled device external to the user classification system may be a database, server, user terminal, etc. external to the user analysis system. The user terminal may include a tablet computer, a laptop computer, a mobile phone, a Personal Digital Assistant (PDA), a smart watch, a point of sale (POS) device, an on-board computer, an on-board television, a wearable device, and the like, or any combination thereof.

In some embodiments, the network may be any type of wired or wireless network or combination thereof. By way of example only, the network may include a cable network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), an inter-city network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a ZigBee network, a near field communication Network (NFC), and the like, or any combination thereof. In some embodiments, the network may include one or more network access points. For example, the network may include wired or wireless network access points, such as base stations and/or network switching points, via which one or more components of the user classification system may connect to the network for the exchange of data and/or information.

Step 104: at least one identification vector is generated based on the plurality of sets of user data.

In some embodiments, step 104 may be performed by determination module 404.

In some embodiments, after acquiring a plurality of user data, the determining module 404 may perform preprocessing on the user data to obtain a user data matrix. Further, singular value decomposition is performed on the user data matrix, and at least one singular vector is selected as the at least one identification vector based on the result of the singular value decomposition.

In some embodiments, a user data matrix may be generated based on the acquired plurality of user data. The user data matrix may reflect the behavior or state of the user within a preset time period of a preset time range. Taking the departure data of the driver as an example, the user data matrix can reflect the behavior of the driver in each preset time period, namely, information such as whether the driver leaves the vehicle or not, the departure time length and the like. If there are T preset time periods within the preset time range, the departure data of each driver can be represented by a vector of dimension T. If the number of drivers is N, each driver corresponds to a T-dimensional vector, and at this time, a data matrix of n×t may be used to represent the performances of N drivers in T preset time periods. For example, the driving data matrix of each day of the week of the driver can reflect the behavior of the driver in each preset time period of each day of the week, namely, the information of whether to drive out and the driving duration. For example only, a day may be divided into 144 10 minutes such that the departure data of each driver over a preset time range may be represented by a 144-dimensional vector, each element of which corresponds to the driver's departure performance over a corresponding period of time. If the number of drivers is 200, each driver corresponds to a 144-dimensional vector. At this time, the performance of the 200 drivers in 144 time periods of 10 minutes throughout the day, i.e., whether to get out and the length of the get out time, can be reflected by a 200×144 data matrix.

Specific methods of user data matrix generation may be referred to in the description of the remainder of this disclosure (e.g., in connection with step 202 of fig. 2).

After preprocessing a plurality of user data to obtain a user data matrix, singular Value Decomposition (SVD) may be performed on the user data matrix.

For example, the user data matrix may be represented as a matrix a of n×t, T is the number of preset time periods, N is the number of drivers, SVD matrix decomposition is performed on the matrix a, and a may be split into three matrices U, S, V, i.e., a=u·s·v, where U is a left singular matrix, and is a square of n×n, and a row vector of each row may be referred to as a left singular vector. In some embodiments, the left singular vectors may be orthogonal, i.e., any two left singular vector point products are 1.V is a right singular matrix, which is a square matrix of t×t, and the row vector of each row may be referred to as a right singular vector. In some embodiments, the right singular vectors may also be orthogonal and may abstract a behavioral pattern of the driver's departure time period and duration; s is a singular value matrix, which is an NxT matrix, diagonal elements of the matrix are non-zero singular values, elements of non-diagonal are all 0, and the singular values can be arranged from top to bottom.

At least one singular vector may be selected as the at least one identification vector based on a result of singular value decomposition of the user data matrix.

In some embodiments, the ratio of the first M singular value sums to all singular value sums may be compared to a preset ratio threshold, and when the ratio of the first M singular value sums to all singular value sums is greater than the threshold, the first M right singular vectors are selected as M identifying vectors.

In some embodiments, the periodicity of the right singular vectors may be observed, and the first M right singular vectors with periodicity are selected as M identifying vectors.

For a specific selection of the identification vector, reference may be made to the description of the rest of the disclosure (e.g., related to step 206 in fig. 2).

Step 106: a correlation indicator for each set of user data is determined.

In some embodiments, step 106 may be performed by the determination module 404.

In some embodiments, after selecting at least one of the identification vectors, the determining module 404 may determine the correlation index of the user data according to a correlation algorithm of the vectors. The relevance index reflects a relevance between the set of user data and the at least one identification vector. For example, the data of the driving out of 200 groups of drivers is obtained by obtaining a 200×144 driving out data matrix in 144 preset time periods of 144 minutes a day, and after singular value decomposition, the first 6 right singular vectors are selected as the identification vectors, wherein the identification vectors are 144-dimensional vectors. At this time, each row vector (a 144-dimensional vector corresponding to each driver, abbreviated as a driver vector) in the 200×144 driver departure data matrix is required to be respectively correlated with the 6 selected identification vectors. Thus, each driver can obtain the corresponding 6 relevance indexes.

In some embodiments, the correlation index may be determined by a correlation algorithm. The correlation algorithm may include an euclidean distance algorithm, a manhattan distance algorithm, a chebyshev distance algorithm, a minkowski distance algorithm, a normalized euclidean distance algorithm, a mahalanobis distance algorithm, a cosine similarity algorithm, a hamming distance algorithm, a jerad similarity coefficient algorithm, a pearson correlation coefficient algorithm, a correlation coefficient and correlation distance algorithm, or the like, or any combination thereof.

In some embodiments, taking the Euclidean distance algorithm as an example, the identifying vector and the driver vector are both T-dimensional vectors, assuming that the identifying vector is V= (V) ₁ ,V ₂ ,V ₃ ,…V _T ) The driver vector is a= (a) ₁ ,A ₂ ,A ₃ ,…A _T ) The Euclidean distance between the driver vector and the identifying vector can be calculated by formula (1)

Where i represents the i-th element in the vector, V _i -A _i Representing the difference between the i-th element of the indicative vector V and the i-th element of the driver vector a.

For each driver vector A and M identifying vectors V, M Euclidean distances can be found, which can be expressed as D ₁ 、D ₂ 、…、D _M . Further normalization processing can be performed on the M correlation indexes to obtain a final correlation index. By way of example only, the correlation index may be calculated by the following equation (2),

Wherein C is _i Representing the correlation index corresponding to each driver vector, D _i Representing the euclidean distance between each driver vector and the identifying vector.

From the above derivation, we can see that the larger the Euclidean distance Dist (V, A) between the driver vector and the identified vector is, the normalized correlation index C thereof _i The larger the similarity with the identified vector is, the lower the similarity with the identified vector is.

In some embodiments, taking cosine similarity algorithm as an example, the identification vector and the driver vector are both T-dimensional vectors, assuming that the identification vector is v= (V ₁ ,V ₂ ,V ₃ ,…V _T ) The driver vector is a= (a) ₁ ,A ₂ ,A ₃ ,…A _T ) The cosine algorithm between the driver vector and the identified vector can be calculated by equation (3)

Where θ is the angle between the driver vector a and the identifying vector V,dot product representing driver vector A and logo vector V, ">Modulo representing the identifying vector V, +.>Representing the modulus of the driver vector a.

For each driver vector A and M identifying vectors V, the cosine of the M included angle can be obtained as cos theta ₁ 、cosθ ₂ 、…、cosθ _M . And after normalization processing, obtaining a correlation index. The correlation index can be calculated by the following formula (4),

wherein C is _i Representing the correlation index corresponding to each driver vector, cos theta _i Representing the cosine of the angle between each driver vector and the identifying vector.

From the above deductions, we can see that the smaller the included angle between the driver vector and the logo vector is, the included angle cosine cos theta _i The larger the normalized correlation isThe larger the label Ci, the greater the similarity with the identified vector.

Step 108: and classifying the plurality of users according to the correlation indexes of the plurality of groups of user data.

In some embodiments, step 108 may be performed by classification module 406.

In some embodiments, the plurality of users may be classified using a clustering algorithm. Typical clustering algorithms may include K-Means clustering, AP (Affinity propagation) clustering, mean-shift clustering, spectral clustering, hierarchical clustering (including bottom-up AGNES (agglomerative hierarchical clustering) aggregation method and bottom-up DIANA (divisive hierarchical clustering splitting method)), and the like, or any combination thereof.

In some embodiments, the clustering may be performed using a K-Means clustering algorithm. M similarity indexes corresponding to each driver can obtain an M-dimensional similarity index vector corresponding to the driver. N drivers may have N M-dimensional similarity index vectors. The N similarity pair index vectors are clustered by using a K-Means clustering algorithm, and corresponding drivers can be classified.

Methods of clustering using the K-Means clustering algorithm may be referred to in the description of the remainder of this disclosure (e.g., in connection with step 304 of FIG. 3).

FIG. 2 is an exemplary flow chart for selecting an identifying vector according to some embodiments of the present application.

As shown in fig. 2, the process 200 of selecting an identification vector may include:

step 202: and preprocessing the plurality of groups of user data to obtain a user data matrix.

In some embodiments, step 202 may be performed by a matrix generation unit in the determination module 402.

In some embodiments, taking the departure data of multiple drivers as an example, for each driver, the departure data in multiple (e.g., P) preset time ranges is acquired, and for each time range, multiple (e.g., T) preset time periods may be split. The driver departure data can be represented by the following vectors:

[a_i1,…,a_i t,…,a_i T]

where i denotes a driver i, t denotes a t-th preset period, and a_it denotes a departure time of the driver i in the t-th preset period.

In some embodiments, the predetermined time range may include at least one of a day, a month, a quarter, a half year, or a year. The preset time period may include at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, half a month, or one month. For example, for driver i, the departure data for his/her week is obtained, the preset time range may be one day, and the preset time period may be 10 minutes, at which point P equals 7 and t equals 144. If the vehicle goes from 8 a.m. to 5 a.m. in the morning. Correspondingly, the departure data of each preset time period (10 minutes) in the driver preset time range (one day) can be expressed as:

[a_i1,…,a_it,…,a_i144]

Where a_i1 to a_i48 are 0, a_i49 to a_i102 are 10 minutes, and a_i103 to a_i144 are 0.

In some embodiments, driver departure data for each preset time period in each preset range over P cycles may be obtained. Thus, the departure time a_it of the driver i in the t-th preset period may be an average value in P periods. For example, the preset time range is 1 day, the preset time period is 10, if the driver i gets out in the preset time period from 8 to 8 in the morning from monday to Saturday and the driver i does not get out in the preset time period from 8 to 10 in the morning from Zhou Tianshi, a_ i t =And (3) minutes.

The data of the N drivers can be processed to obtain an NxT driver departure data matrix. The data matrix may be expressed as:

wherein each row represents the departure time of the same driver in different preset time periods T, and each column represents the departure time of different drivers in the same preset time period.

Further, the matrix may be normalized, that is, the value of each row of the matrix a may be divided by the sum of the elements of the row, so that the sum of the data of each row is 1, and thus, each value of each row represents the distribution of the driver's departure over the preset period of time. The normalized matrix can be expressed as:

Where a_n represents the sum of the departure times of the driver N in all preset time periods.

Step 204: singular value decomposition of the user data matrix

In some embodiments, step 204 may be performed by a decomposition unit in determination module 402.

In some embodiments, after preprocessing a plurality of user data to obtain a user data matrix, singular Value Decomposition (SVD) may be performed on the user data matrix. Since the user data matrix a is data of N pieces of driver data within each preset period of T, a is an n×t matrix. Matrix a may be split into three matrices U, S, V, a=u·s·v, where U is a left singular matrix, which is an n×n square, the row vectors of each row of which may be referred to as left singular vectors, which may be orthogonal in some embodiments, i.e., any two left singular vector point products of 1.V is a right singular matrix, which is a T x T square matrix, and the row vectors of each row can be called right singular vectors, in some embodiments, left singular vectors can be orthogonal, and a behavior pattern of the departure time period and duration of a driver can be abstracted; s is a singular value matrix, which is an NxT matrix, diagonal elements of the matrix are nonzero singular values, the elements of the non-diagonal are all 0, and the singular values are arranged from large to small from top to bottom.

Step 206: at least one singular vector is selected as the at least one identified vector.

In some embodiments, step 206 may be performed by an identification vector determination unit in determination module 402.

In some embodiments, the determination module 404 may select the first M right singular vectors as the identifying vectors. For example only, the number of identifying vectors may be determined by calculating the ratio of the first M singular value sums to all singular value sums and comparing the ratio to a preset ratio threshold. For example, when the current M-1 singular value sums of squares are less than the proportionality threshold and the first M singular value sums of squares are greater than the proportionality threshold. The determination module 404 may determine the number of the identification vectors as M. For example, the singular value matrix S has four singular values, 4, 2, and 1, respectively. The preset proportion threshold value is 90%, the proportion of the square sum of the first 2 singular values to the square sum of all 4 singular values is 80%, and the proportion is smaller than the proportion threshold value by 90%. The proportion of the first 3 singular value square sums to all 4 singular value square sums is 96 percent and is larger than the proportion threshold value of 90 percent. At this time, the determining module 404 may determine that the number of the identification vectors is 3.

In some embodiments, the number of selected identification vectors may also be derived by observing the distribution behavior of each row of right singular vectors of the right singular matrix V. Generally, the right singular vectors of the first M rows covering the source data information will exhibit periodicity, while the right singular vectors to the next M rows will exhibit a random distribution of high amplitude vibrations. For example, as shown in FIG. 5, the right singular matrix V contains 6 right singular vectors V ₀ 、V ₁ 、V ₂ 、V ₃ 、V ₄ And V ₅ . Wherein the right singular vector V ₀ 、V ₁ 、V ₂ And V ₃ Exhibit significant periodicity, while the right singular vector V ₄ And V ₅ It appears as a random distribution of high amplitude vibrations. Therefore, the number of the selected identification vectors can be determined to be 4, and the first 4 right singular vectors are determined to beSelected as the identification vector.

It should be noted that the above description of the process 200 is for purposes of illustration and description only and is not intended to limit the scope of applicability of the application. Various modifications and changes to flow 200 may be made by those skilled in the art in light of the present application. However, such modifications and variations are still within the scope of the present application. For example, steps 202 and 204 may be combined into one step. Also for example, a pre-screening step may be included prior to step 202, which may remove some of the significantly unreasonable noisy user data, e.g., driver data that is online and not ordered, according to certain rules.

FIG. 3 is an exemplary flow chart of classification using a clustering algorithm, shown according to some embodiments of the present application.

As shown in fig. 3, the classifying process 300 using the clustering algorithm may include:

step 302: a number of classifications is determined based on the plurality of sets of user data.

In some embodiments, step 202 may be performed by a classification number determination unit in classification module 404.

In some embodiments, the number of classifications K may be determined by one or more of a root mean square algorithm, an elbow or contour coefficient method, or a graphical method. For example, the number of classifications K is determined using a root mean square algorithm. Since there are N T-dimensional similarity index vectors, thereforeFor example, n=200, where k=10. If->Is a non-integer, then K is less thanIs the largest integer of (a). For example, n=500, ">About 15.81 when k=15。

In some embodiments, the number of classifications K may be determined using an elbow method. The criteria used in the elbow method to evaluate the number of classifications K are the square error and SSE. Can be calculated by formula (5):

SSE＝∑p∈B _i |p-m _i | ² (5)

wherein B is _i Represents the ith cluster, p is cluster B _i Sample points m in _i Is the centroid of the cluster.

Starting from k=1, as K increases, the sample division becomes finer, the degree of aggregation of each cluster increases gradually, and then the error square sum SSE naturally becomes smaller gradually. And when K is smaller than the optimal classification number, the increase of K greatly increases the aggregation degree of each cluster, so that the decrease amplitude of SSE is large, and when K reaches the optimal classification number, the return of the aggregation degree obtained by increasing K again is rapidly reduced, so that the decrease amplitude of SSE is rapidly reduced, and then gradually becomes gentle along with the continuous increase of K value, namely the relation diagram of SSE and K is in the shape of an elbow, and the K value corresponding to the elbow (or turning point) is the optimal classification number of data. Based on which the number of classifications K can be determined using the elbow method.

Step 304: and carrying out clustering operation on the correlation indexes of the plurality of groups of user data, and further classifying the users.

In some embodiments, step 304 may be performed by a clustering unit in classification module 404.

In some embodiments, the classification module 404 may use a clustering algorithm to cluster the relevance indicators of the multiple sets of user data. Typical clustering algorithms may include K-Means clustering, AP (Affinity propagation) clustering, mean-shift clustering, spectral clustering, hierarchical clustering (including bottom-up AGNES (agglomerative hierarchical clustering) aggregation method and bottom-up DIANA (divisive hierarchical clustering splitting method)), and the like, or any combination thereof.

In some embodiments, the clustering may be performed using a K-Means clustering algorithm. For each driver, the corresponding M similarity indexes may be represented by an M-dimensional similarity index vector Ci. The ith driver whose similarity index vector Ci can be expressed as

Ci＝[Ci_1,…,Ci_m,…,Ci_M]

Where Ci_m represents a similarity pair indicator of the ith driver relative to the mth identifying vector.

N drivers may have N similarity index vectors Ci. The N similarity pair index vectors Ci are clustered by using a K-Means clustering algorithm, and corresponding drivers can be classified.

Specifically, after determining the number of classifications K, K may be randomly selected from the N similarity index vectors Ci as the initial center. Then calculate the distances of all other N-K similarity index vectors to these K centers. The distance is determined according to the following formula (6):

where Ci_m represents the Mth element in the Ci vector, and CenterK_m represents the Mth element in CenterK.

For each driver's similarity index vector Ci, K Dists (Ci, center) are calculated _K ) Representing their distances to K different centers. These K Dists (Ci, center) _K ) Compare and compare the Ci with the Center with the smallest distance to it _K Fall into one category. In this way, all similarity index vectors Ci may be clustered into K-class clusters.

And for each of the K-class clusters, updating the center of the cluster again. The updated center can be determined by the following equation (7):

wherein C is _K Represents the kth cluster, |C _K The term "sum" refers to the class cluster C, where the sum is the number of data objects in the K-th class cluster _K The sum of all elements in each column of attributes, thus Center _K Also a vector containing M attributes, expressed as:

Center _K ＝(CenterK,1,CenterK,2,...,CenterK,M)

and iterating the operations until the iteration times exceed the preset times, and stopping iterating when the clustering center is not changed or the loss function is smaller than the preset threshold value. At this time, all the correlation index vectors are classified into a certain class of the K class clusters. Thus, the purpose of classifying the users is achieved.

It should be noted that the above description of the process 300 is for purposes of illustration and description only and is not intended to limit the scope of applicability of the application. Various modifications and changes to flow 300 will be apparent to those skilled in the art in light of the teachings of this application. However, such modifications and variations are still within the scope of the present application. For example, the number of classifications K in step 302 may be empirically preset.

FIG. 4 is a block diagram of a user classification system according to some embodiments of the application.

As shown in fig. 4, the user classification system may include an acquisition module 402, a determination module 404, a classification module 406, and a storage module 408.

The acquisition module 402 may be used to acquire multiple sets of user data.

The determining module 404 may be configured to generate an identification vector of at least one based on the plurality of sets of user data; wherein each of the identification vectors represents a data distribution type. The determination module 404 may also be configured to determine a relevance index for each set of user data reflecting a relevance between the set of user data and each of the at least one identification vector.

In some embodiments, the determination module 404 may include a matrix generation unit, a decomposition unit, and an identification vector determination unit. The matrix generating unit may be configured to pre-process the plurality of sets of user data to obtain a user data matrix. The decomposition unit may be configured to perform singular value decomposition on the user data matrix. The identification vector determination unit may be configured to select at least one singular vector as the at least one identification vector. The identifying vector determining unit may be configured to select, from singular values obtained by decomposition, a plurality of singular values having a ratio of a sum of squares to a sum of squares of all singular values greater than a preset threshold, and use right singular vectors corresponding to the plurality of singular values as the identifying vector. The identification vector determining unit may be further configured to select, from the right singular matrix obtained by decomposition, a right singular vector whose elements are periodically distributed as the identification vector.

The classification module 406 may be configured to classify the plurality of users based on correlation indicators for the plurality of sets of user data. The classification module 406 may be further configured to perform a clustering operation on the correlation indexes of the multiple sets of user data, so as to classify the users.

In some embodiments, the classification module 406 may include a classification number determination unit and a clustering unit. The number of classifications determining unit may be configured to determine the number of classifications K based on the plurality of sets of user data. The clustering unit may be configured to perform a clustering operation on the correlation indexes of the plurality of groups of user data, so as to classify the users into K classes.

The storage module 408 may be used to store service request related data, control parameters, processed service request related data, and the like, or any combination thereof. In some embodiments, the storage module 408 may store one or more programs and/or instructions that may be executed by the processor to implement the exemplary methods described herein. For example, the storage module 408 may store programs and/or instructions executable by the processor to obtain a plurality of user data based on which the plurality of users may be categorized.

It should be understood that the system shown in fig. 4 and its modules may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may then be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules of the present application may be implemented not only with hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also with software, such as executed by various types of processors, and with a combination of the above hardware circuitry and software (e.g., firmware).

It should be noted that the above description of the candidate display, determination system, and modules thereof is for descriptive convenience only and is not intended to limit the application to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles. For example, in some embodiments, the acquisition module 402, the determination module 404, the classification module 406, and the storage module 408 disclosed in fig. 4 may be different modules in one system, or may be one module to implement the functions of two or more modules described above. For example, the determining module 404 and the classifying module 406 may be two modules, or may be one module having both determining and classifying functions. For example, each module may share one memory module, or each module may have a respective memory module. Such variations are within the scope of the present application.

Possible beneficial effects of embodiments of the present application include, but are not limited to: (1) The data can be effectively classified by performing class aggregation after being filtered; (2) And the layering of different time dimensions is flexibly carried out on the users. (3) The singular value decomposition method is used for realizing the dimension reduction of the data, and the class aggregation algorithm is combined to more efficiently and directly classify the users. It should be noted that, the advantages that may be generated by different embodiments may be different, and in different embodiments, the advantages that may be generated may be any one or a combination of several of the above, or any other possible advantages that may be obtained.

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations of the present application may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this application, and are therefore within the spirit and scope of the exemplary embodiments of this application.

Meanwhile, the present application uses specific words to describe embodiments of the present application. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present application may be combined as suitable.

Furthermore, those skilled in the art will appreciate that the various aspects of the invention are illustrated and described in the context of a number of patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the present application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.

The computer storage medium may contain a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer storage medium may be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or a combination of any of the foregoing.

The computer program code necessary for operation of portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, vb net, python, etc., a conventional programming language such as C language, visual Basic, fortran 2003, perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, ruby and Groovy, or other programming languages, etc. The program code may execute entirely on the user's computer or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.

Furthermore, the order in which the elements and sequences are presented, the use of numerical letters, or other designations are used in the application and are not intended to limit the order in which the processes and methods of the application are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present application. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed herein and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the subject application. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this application is hereby incorporated by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the present application, documents that are currently or later attached to this application for which the broadest scope of the claims to the present application is limited. It is noted that the descriptions, definitions, and/or terms used in the subject matter of this application are subject to such descriptions, definitions, and/or terms if they are inconsistent or conflicting with such descriptions, definitions, and/or terms.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present application. Other variations are also possible within the scope of this application. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present application may be considered in keeping with the teachings of the present application. Accordingly, embodiments of the present application are not limited to only the embodiments explicitly described and depicted herein.

Claims

1. A method of classifying users, comprising:

acquiring a plurality of groups of user data;

generating at least one identification vector based on the plurality of sets of user data, including:

preprocessing the plurality of groups of user data to obtain a user data matrix;

singular value decomposition is carried out on the user data matrix;

selecting at least one singular vector as the at least one identification vector, wherein the singular vector is a right singular vector, and the right singular vector is a row vector of a right singular matrix obtained by carrying out the singular value decomposition on the user data matrix;

wherein each of the identification vectors represents a data distribution type;

determining a relevance index for each set of user data, the relevance index reflecting a relevance between the set of user data and the at least one identification vector;

And classifying the plurality of users according to the correlation indexes of the plurality of groups of user data.

2. The method of claim 1, wherein selecting at least one singular vector as the at least one identification vector comprises:

selecting a plurality of singular values with the proportion of the square sum to the square sum of all the singular values being greater than a preset threshold value from singular values obtained by decomposition;

and taking singular vectors corresponding to the singular values as the identification vectors.

3. The method of claim 1, wherein selecting at least one singular vector as the at least one identification vector comprises:

and selecting singular vectors with periodically distributed elements from the singular matrix obtained by decomposition as the identification vectors.

4. The method of claim 1, wherein the relevance index reflects a degree of similarity between user data and an identification vector.

5. The method of claim 1, wherein classifying the users according to the correlation indicators of the plurality of sets of user data comprises:

and carrying out clustering operation on the correlation indexes of the plurality of groups of user data, and further classifying the users.

6. The method as recited in claim 5, further comprising:

determining a classification number K based on the plurality of sets of user data;

and clustering the correlation indexes of the plurality of groups of user data, so as to divide the users into K classes.

7. The method of claim 1, wherein the user data reflects the status of the user for different preset time periods within a preset time range; wherein the preset time range includes a plurality of the preset time periods.

8. The method of claim 7, wherein the user comprises a driver; the user data reflects the departure time of the driver in different preset time periods within a preset time range.

9. The method of claim 7, wherein the predetermined time range comprises at least one of a day, a month, a quarter, a half year, or a year.

10. The method of claim 7, wherein the predetermined period of time comprises at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, half a month, or one month.

11. A user classification system, comprising: the device comprises an acquisition module, a determination module and a classification module, wherein the determination module comprises: a matrix generation unit, a decomposition unit and an identification vector determination unit;

The acquisition module is used for acquiring a plurality of groups of user data;

the determining module is used for generating at least one identification vector based on the plurality of groups of user data; wherein each of the identification vectors represents a data distribution type; and determining a relevance index for each set of user data, the relevance index reflecting a relevance between the set of user data and the at least one identification vector;

the matrix generating unit is used for preprocessing the plurality of groups of user data to obtain a user data matrix;

the decomposition unit is used for carrying out singular value decomposition on the user data matrix; a kind of electronic device with high-pressure air-conditioning system

The identification vector determining unit is used for selecting at least one singular vector as the at least one identification vector, wherein the singular vector comprises a right singular vector which is a row vector of each row of a right singular matrix obtained by carrying out the singular value decomposition on the user data matrix;

the classification module is used for classifying the plurality of users according to the correlation indexes of the plurality of groups of user data.

12. The system according to claim 11, wherein the identifying vector determining unit is further configured to select, among the singular values obtained by the decomposition, a number of singular values whose ratio of the sum of squares to the sum of squares of all the singular values is greater than a preset threshold; and taking singular vectors corresponding to the singular values as the identification vectors.

13. The system according to claim 11, wherein the identification vector determining unit is further configured to select, as the identification vector, a singular vector in which elements are periodically distributed among the singular matrices obtained by the decomposition.

14. The system of claim 11, wherein the relevance index reflects a degree of similarity between user data and an identification vector.

15. The system of claim 11, wherein the classification module is further configured to perform a clustering operation on the correlation indicators of the plurality of sets of user data, and further classify the users.

16. The system of claim 15, wherein the classification module comprises: a classification number determining unit and a clustering unit;

the classification number determining unit is used for determining a classification number K based on the plurality of groups of user data; a kind of electronic device with high-pressure air-conditioning system

The clustering unit is used for carrying out clustering operation on the correlation indexes of the plurality of groups of user data, and further classifying the users into K types.

17. The system of claim 11, wherein the user data reflects the status of the user for different preset time periods within a preset time frame; wherein the preset time range includes a plurality of the preset time periods.

18. The system of claim 17, wherein the user comprises a driver; the user data reflects the departure time of the driver in different preset time periods within a preset time range.

19. The system of claim 17, wherein the predetermined time range comprises at least one of a day, a month, a quarter, a half year, or a year.

20. The system of claim 17, wherein the predetermined period of time comprises at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, half a month, or one month.

21. A computer readable storage medium for storing computer instructions which, when executed, implement a user classification method according to any one of claims 1-10.

22. A user classification device comprising at least one processor and at least one storage medium;

the at least one storage medium is for storing computer instructions;

the at least one processor is configured to execute the computer instructions to implement the user classification method of any of claims 1-10.