CN111311292A

CN111311292A - User classification method and system

Info

Publication number: CN111311292A
Application number: CN201811514736.5A
Authority: CN
Inventors: 谢梁; 李盼
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2020-06-19
Anticipated expiration: 2038-12-12
Also published as: CN111311292B

Abstract

The embodiment of the application discloses a user classification method. The user classification method comprises the following steps: acquiring a plurality of groups of user data; generating at least one identifying vector based on the plurality of sets of user data; wherein each identifying vector represents a data distribution type; determining a relevance indicator for each set of user data, the relevance indicator reflecting a relevance between the set of user data and the at least one identifying vector; and classifying the plurality of users according to the relevance indexes of the plurality of groups of user data. The technical scheme of combining matrix decomposition and the clustering algorithm is adopted, so that the users can be classified more effectively, quickly and accurately.

Description

User classification method and system

Technical Field

The present application relates to the field of data processing, and in particular, to a user classification method and system.

Background

With the advent of the big data age, users are classified and labeled to provide high-quality services for the users, so that the problem becomes a hotspot. When a traditional classification method is used for processing mass data, low efficiency can be caused, and the classification accuracy cannot be guaranteed. Therefore, it is necessary to provide a more effective and accurate user classification method.

Disclosure of Invention

One embodiment of the present application provides a user classification method. The user classification method comprises the following steps: acquiring a plurality of groups of user data; generating at least one identifying vector based on the plurality of sets of user data; wherein each identifying vector represents a data distribution type; determining a relevance indicator for each set of user data, the relevance indicator reflecting a relevance between the set of user data and the at least one identifying vector; and classifying the plurality of users according to the relevance indexes of the plurality of groups of user data.

In some embodiments, said generating at least one identifying vector based on said plurality of sets of user data comprises: preprocessing the multiple groups of user data to obtain a user data matrix; performing singular value decomposition on the user data matrix; selecting at least one singular vector as the at least one identifying vector.

In some embodiments, said selecting at least one singular vector as said at least one identifying vector comprises: selecting a plurality of singular values of which the ratio of the sum of squares to the sum of squares of all singular values is greater than a preset threshold value from the singular values obtained by decomposition; and taking singular vectors corresponding to the singular values as the identification vectors.

In some embodiments, the at least one singular vector as the at least one identifying vector comprises: and selecting singular vectors with elements in periodic distribution from the singular matrix obtained by decomposition as the identification vectors.

In some embodiments, the relevance indicator reflects a degree of similarity between the user data and the identifying vector.

In some embodiments, the classifying the users according to the relevance indicators of the plurality of sets of user data includes: and clustering the correlation indexes of the multiple groups of user data, and further classifying the users.

In some embodiments, further comprising: determining a number of classifications K based on the plurality of sets of user data; clustering the correlation indexes of the multiple groups of user data to further classify the users into K classes

In some embodiments, the user data reflects the status of the user at different preset time periods within a preset time range; wherein the preset time range includes a plurality of the preset time periods.

In some embodiments, the user comprises a driver; the user data reflects the departure time of the driver in different preset time periods within the preset time range.

In some embodiments, the preset time range includes at least one of a day, a month, a quarter, a half year, or a year.

In some embodiments, the preset time period comprises at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, half a month, or one month.

One of embodiments of the present application provides a user classification system, where the user classification system includes: the device comprises an acquisition module, a determination module and a classification module; the acquisition module is used for acquiring a plurality of groups of user data; the determination module is configured to generate at least one identifying vector based on the plurality of sets of user data; wherein each identifying vector represents a data distribution type; and determining a relevance indicator for each set of user data, the relevance indicator reflecting a relevance between the set of user data and the at least one identifying vector; and the classification module is used for classifying a plurality of users according to the relevance indexes of the plurality of groups of user data.

In some embodiments, the determining module comprises: the device comprises a matrix generation unit, a decomposition unit and an identification vector determination unit; the matrix generation unit is used for preprocessing the multiple groups of user data to obtain a user data matrix; the decomposition unit is used for carrying out singular value decomposition on the user data matrix; and the identifying vector determining unit is used for selecting at least one singular vector as the at least one identifying vector.

In some embodiments, the identifying vector determining unit is further configured to select, from among singular values obtained by decomposition, singular values whose ratio of a sum of squares to a sum of squares of all singular values is greater than a preset threshold; and taking singular vectors corresponding to the singular values as the identification vectors.

In some embodiments, the identifying vector determining unit is further configured to select, as the identifying vector, a singular vector whose elements are periodically distributed in a singular matrix obtained by decomposition.

In some embodiments, the classification module is further configured to perform clustering operation on the correlation indexes of the multiple sets of user data, so as to classify the users.

In some embodiments, the classification module comprises: a classification number determining unit and a clustering unit; the classification number determining unit is used for determining a classification number K based on the plurality of groups of user data; and the clustering unit is used for clustering the correlation indexes of the multiple groups of user data so as to divide the users into K classes.

One of the embodiments of the present application provides a user classifying device, which includes a processor, where the processor is configured to execute the foregoing user classifying method.

One of the embodiments of the present application provides a computer-readable storage medium, where the storage medium stores computer instructions, and after the computer reads the computer instructions in the storage medium, the computer executes the foregoing user classification method.

Drawings

The present application will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is an exemplary flow chart of a user classification method according to some embodiments of the present application;

FIG. 2 is an exemplary flow diagram illustrating the selection of an identifying vector according to some embodiments of the present application;

FIG. 3 is an exemplary flow chart illustrating classification using a clustering algorithm according to some embodiments of the present application;

FIG. 4 is a block diagram of a user categorization system shown in accordance with some embodiments of the present application; and

FIG. 5 is a schematic diagram of the periodicity of different identification vectors shown in accordance with some embodiments of the present application.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only examples or embodiments of the application, from which the application can also be applied to other similar scenarios without inventive effort for a person skilled in the art. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Embodiments of the present application may be applied to different transportation systems including, but not limited to, one or a combination of terrestrial, marine, aeronautical, aerospace, and the like. For example, taxis, special cars, tailplanes, buses, designated drives, trains, railcars, high-speed rails, ships, airplanes, hot air balloons, unmanned vehicles, receiving/sending couriers, and the like, employ managed and/or distributed transportation systems. The application scenarios of the different embodiments of the present application include, but are not limited to, one or a combination of several of a web page, a browser plug-in, a client, a customization system, an intra-enterprise analysis system, an artificial intelligence robot, and the like. It should be understood that the application scenarios of the system and method of the present application are merely examples or embodiments of the present application, and those skilled in the art can also apply the present application to other similar scenarios without inventive effort based on these figures. For example, other similar guided user parking systems.

The terms "passenger", "passenger end", "user terminal", "customer", "demander", "service demander", "consumer", "user demander" and the like are used interchangeably and refer to a party that needs or orders a service, either a person or a tool. Similarly, "driver," "provider," "service provider," "server," and the like, as described herein, are interchangeable and refer to an individual, tool, or other entity that provides a service or assists in providing a service. In addition, a "user" as described herein may be a party that needs or subscribes to a service, or a party that provides or assists in providing a service.

FIG. 1 illustrates an exemplary flow chart of a user classification method according to some embodiments of the present application. As shown in fig. 1, the method 100 of user classification may include:

step 102: data for a plurality of users is acquired.

In some embodiments, step 102 may be performed by acquisition module 402.

In some embodiments, the user data obtained by the obtaining module 402 includes, but is not limited to, one or any combination of order data (e.g., order placing data, order taking data) of the user, user information data (e.g., order placing (also called service initiator) information data, order taking (also called service provider) information data), online data (e.g., order placing online data, order taking online data) of the user, and the like. Taking the network appointment order as an example, the ordering data may include one or more of the ordering time, the current position, the getting-on place, the destination, the departure time, the number of passengers, and the like in any combination. The order taking data may include one or any combination of the current position of the driver, the vehicle travel track, the predicted arrival time, etc. The order information data may include personal information of the passenger, such as: name, gender, age, user name, nickname, preference, number of orders placed in history, historical score, etc. The order taker information data may include personal information about the driver, such as: name, gender, age, user name, nickname, preference, historical order taking number, historical score and the like. The order placing online data can comprise one or any combination of the online time of the passenger, the offline time of the passenger, the online time of the passenger, the order placing frequency of the passenger and the like. The order taker online data may include one or any combination of the driver's online time, the driver's offline time, the driver's online time duration, the driver's order taking frequency, the driver's length of time to execute an order (or leave), etc.

In some embodiments, the user data may be stored in any component of the user classification system having a storage function, such as a database. In executing step 102, the obtaining module 402 may obtain the user data directly from the component with the storage function. In some embodiments, the user data may be stored in any system or device with storage function outside the user classification system, and the system or device with storage function outside the user classification system is connected to the user classification system through a network. In executing step 102, the obtaining module 402 may obtain the user data from any device with storage function outside the user classification system through a network. In some embodiments, any device with a storage function outside the user classification system may be a database, a server, or a user terminal outside the user analysis system. The user terminal may include a tablet, a laptop, a mobile phone, a Personal Digital Assistant (PDA), a smart watch, a point of sale (POS) device, an on-board computer, an on-board television, a wearable device, and the like, or any combination thereof.

In some embodiments, the network may be any type of wired or wireless network or combination thereof. By way of example only, the network may include a cable network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), an interurban network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a ZigBee network, a near field communication Network (NFC), and the like, or any combination thereof. In some embodiments, the network may include one or more network access points. For example, the network may include wired or wireless network access points, such as base stations and/or network switching points, via which one or more components of the user classification system may connect to the network for the exchange of data and/or information.

Step 104: at least one identifying vector is generated based on the sets of user data.

In some embodiments, step 104 may be performed by determination module 404.

In some embodiments, after obtaining the plurality of user data, the determining module 404 may perform preprocessing on the user data to obtain a user data matrix. Further, singular value decomposition is performed on the user data matrix, and at least one singular vector is selected as the at least one identifying vector based on a result after the singular value decomposition.

In some embodiments, a user data matrix may be generated based on the acquired plurality of user data. The user data matrix may reflect a behavior or state of the user within a preset time period of a preset time range. Taking the vehicle-out data of the driver as an example, the user data matrix can reflect the behavior of the driver in each preset time period, namely information such as whether the vehicle is out and the vehicle-out time length. If there are T preset time periods within the preset time range, the departure data of each driver can be represented by a vector of T dimension. If the number of drivers is N, each driver corresponds to a T-dimensional vector, and in this case, the performance of N drivers in T preset time periods can be represented by an N × T data matrix. For example, the data matrix of the departure time of the driver every day in a week can reflect the behavior of the driver in each preset time period every day in the week, i.e. the information of whether the driver departs from the vehicle and the time length of the departure time. For example only, a day may be divided into 144 minutes and 10 minutes, so that the departure data of each driver in the preset time range can be represented by a 144-dimensional vector, and each element of the vector corresponds to the departure performance of the driver in the corresponding time period. If the number of drivers is 200, each driver corresponds to a vector of 144 dimensions. At this time, a 200 × 144 data matrix can be used to reflect the performance of the 200 drivers in 144 10-minute time periods throughout the day, i.e., whether the driver is going out and the length of the driver's departure time.

Specific methods for generating the user data matrix may be described in other parts of this disclosure (e.g., in relation to step 202 of fig. 2).

After preprocessing the plurality of user data to obtain a user data matrix, Singular Value Decomposition (SVD) may be performed on the user data matrix.

For example, the user data matrix may be represented as an N × T matrix a, where T is the number of preset time periods and N is the number of drivers, the matrix a is subjected to SVD matrix decomposition, and a may be split into three matrices U, S, V, i.e., a ═ U · S · V, where U is a left singular matrix and is an N × N square, and a row vector of each row may be referred to as a left singular vector. In some embodiments, the left singular vectors may be orthogonal, i.e., any two left singular vector point multiplications result in 1. V is a right singular matrix, which is a T × T square matrix, and the row vectors of each row are referred to as right singular vectors. In some embodiments, the right singular vectors may also be orthogonal and may abstract a driver's behavior pattern of departure time periods and durations; s is a singular value matrix which is an N multiplied by T matrix, diagonal elements of the matrix are non-zero singular values, the non-diagonal elements are all 0, and the singular values can be arranged from top to bottom from large to small.

Based on the result of the singular value decomposition of the user data matrix, at least one singular vector may be selected as the at least one identifying vector.

In some embodiments, the ratio of the first M singular value square sums to all singular value square sums may be compared with a preset ratio threshold, and when the ratio of the first M singular value square sums to all singular value square sums is greater than the threshold, the first M right singular vectors are selected as the M identifying vectors.

In some embodiments, the periodicity of the right singular vectors may be observed, and the first M periodic right singular vectors are selected as the M identifying vectors.

The specific selection method of the identification vector can refer to the description of other parts of the disclosure (for example, related content of step 206 in fig. 2).

Step 106: and determining the relevance index of each group of user data.

In some embodiments, step 106 may be performed by determination module 404.

In some embodiments, after selecting at least one identifying vector, the determining module 404 may determine a relevance indicator of the user data according to a relevance algorithm of the vectors. The relevance indicator reflects a relevance between the set of user data and the at least one identifying vector. For example, 200 groups of driver departure data are obtained, a preset time period is 144 for 10 minutes a day, a 200 × 144 driver departure data matrix is obtained, and after singular value decomposition, the first 6 right singular vectors are selected as identifying vectors, wherein the identifying vectors are 144-dimensional vectors. At this time, it is necessary to perform correlation calculation between each row vector (a 144-dimensional vector corresponding to each driver, which is simply referred to as a driver vector) in the 200 × 144 driver departure data matrix and the selected 6 identification vectors. In this way, each driver can get 6 corresponding relevance indicators.

In some embodiments, the correlation indicator may be determined by a correlation algorithm. The correlation algorithm may include a euclidean distance algorithm, a manhattan distance algorithm, a chebyshev distance algorithm, a minkowski distance algorithm, a normalized euclidean distance algorithm, a mahalanobis distance algorithm, a cosine similarity algorithm, a hamming distance algorithm, a jarradian similarity coefficient algorithm, a pearson correlation coefficient algorithm, a correlation coefficient and correlation distance algorithm, and the like, or any combination thereof.

In some embodiments, taking the euclidean distance algorithm as an example, the identifying vector and the driver vector are both T-dimensional vectors, assuming that the identifying vector is V ═ V (V ═ V)₁,V₂,V₃,…V_T) The driver vector is A ═ A₁,A₂,A₃,…A_T) Then the Euclidean distance between the driver vector and the identification vector can be calculated by formula (1)

Where i denotes the ith element in the vector, V_i-A_iRepresenting the difference between the i-th element of the identification vector V and the i-th element of the driver vector a.

For each driver vector A and M identifying vectors V, M Euclidean distances, denoted as D, can be found₁、D₂、…、D_M. Can be used for thisAnd further normalizing the M correlation indexes to obtain the final correlation index. By way of example only, the correlation index may be calculated by the following equation (2),

wherein, C_iRepresenting a correlation index, D, corresponding to each driver vector_iRepresenting the euclidean distance between each driver vector and the identifying vector.

From the above derivation, we can see that the larger the Euclidean distance Dist (V, A) between the driver vector and the identification vector is, the normalized correlation index C thereof_iThe larger the similarity with the identifying vector.

In some embodiments, taking cosine similarity algorithm as an example, the identification vector and the driver vector are both T-dimensional vectors, and it is assumed that the identification vector is V ═ V (V)₁,V₂,V₃,…V_T) The driver vector is A ═ A₁,A₂,A₃,…A_T) Then the cosine algorithm between the driver vector and the identification vector can be calculated by equation (3)

Where theta is the angle between the driver vector a and the identifying vector V,

representing the dot product of the driver vector a and the identifying vector V,

representing the modulus of the identification vector V,

representing the norm of the driver vector a.

For each driver vector A and M identification vectors V, the cosine of the M included angle can be obtained in the same way,can be expressed as cos θ₁、cosθ₂、…、cosθ_M. And after normalization processing, obtaining a correlation index. The correlation index can be calculated by the following formula (4),

wherein, C_iRepresenting a correlation index, cos θ, corresponding to each driver vector_iAnd expressing the cosine of an included angle between each driver vector and the identification vector.

From the above derivation, we can see that the smaller the angle between the driver vector and the identification vector, the smaller the cosine cos θ of the angle_iThe larger the correlation index Ci is, the larger the normalized correlation index Ci is, the larger the similarity between the normalized correlation index Ci and the identification vector is.

Step 108: and classifying the plurality of users according to the relevance indexes of the plurality of groups of user data.

In some embodiments, step 108 may be performed by classification module 406.

In some embodiments, the plurality of users may be classified using a clustering algorithm. Typical clustering algorithms may include K-Means clustering, AP (affinity propagation) clustering, Mean-shift clustering, spectra (Spectral) clustering, hierarchical clustering (including bottom-up AGNES (acquired clustering) agglomeration methods and bottom-up DIANA (binary clustering splitting methods)), and the like, or any combination thereof.

In some embodiments, the clustering may be performed using a K-Means clustering algorithm. And obtaining an M-dimensional similarity index vector corresponding to each driver by using the M similarity indexes corresponding to each driver. N drivers can have N M-dimensional similarity index vectors. The N similar pair index vectors are clustered by using a K-Means clustering algorithm, and corresponding drivers can be classified.

The method of clustering using the K-Means clustering algorithm can be referred to the description of other parts of the disclosure (e.g., relevant to step 304 in FIG. 3).

Fig. 2 is an exemplary flow diagram illustrating the selection of an identifying vector according to some embodiments of the present application.

As shown in fig. 2, the selecting an identifying vector process 200 may include:

step 202: and preprocessing the plurality of groups of user data to obtain a user data matrix.

In some embodiments, step 202 may be performed by a matrix generation unit in the determination module 402.

In some embodiments, taking the departure data of multiple drivers as an example, for each driver, the departure data in multiple (e.g., P) preset time ranges is obtained, and for each time range, the departure data may be divided into multiple (e.g., T) preset time periods. The driver departure data can now be represented by the following vector:

[a_i1,…,a_i t,…,a_i T]

wherein i represents a driver i, t represents a tth preset time period, and a _ it represents the departure time of the driver i in the tth preset time period.

In some embodiments, the preset time range may include at least one of a day, a month, a quarter, a half year, or a year. The preset time period may include at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, one half month, or one month. For example, for driver i to obtain the departure data in his week, the preset time range may be one day, and the preset time period may be 10 minutes, where P equals 7 and T equals 144. If it is going out from 8 am to 5 pm. Correspondingly, the departure data for each preset time period (10 minutes) in the driver preset time range (one day) can be expressed as:

[a_i1,…,a_it,…,a_i144]

wherein a _ i1 to a _ i48 are 0, a _ i49 to a _ i102 are 10 minutes, and a _ i103 to a _ i144 are 0.

In some embodiments, driver departure data for each preset time period in each preset range for P cycles may be obtained. Therefore, the departure time a _ it of the driver i in the t preset time period can beIs the average over P cycles. For example, the preset time range is 1 day, the preset time period is 10, if the driver i gets out of the vehicle within the preset time period from 8 o 'clock to 8 o' clock 10 minutes in monday to saturday morning, and the driver i does not get out of the vehicle within the preset time period from 8 o 'clock to 8 o' clock 10 minutes in sunday, where a _ i t is equal to

And (3) minutes.

The departure data of the N drivers are processed to obtain an NxT driver departure data matrix. The data matrix may be represented as:

each row represents the departure time of the same driver in different preset time periods T, and each column represents the departure time of different drivers in the same preset time period.

Further, the matrix may be normalized by dividing the value of each row of the a matrix by the sum of the elements of the row to make the sum of each row of data 1, so that each value of each row represents the distribution of the driver's departure over a predetermined time period. The normalized matrix can be represented as:

wherein a _ N represents the sum of the departure time lengths of the drivers N in all the preset time periods.

Step 204: performing singular value decomposition on the user data matrix

In some embodiments, step 204 may be performed by a decomposition unit in determination module 402.

In some embodiments, after preprocessing the plurality of user data to obtain a user data matrix, Singular Value Decomposition (SVD) may be performed on the user data matrix. Since the user data matrix a is data of N driver data in T preset periods, a is an N × T matrix. The matrix a may be split into three matrices U, S, V, i.e., a ═ U · S · V, where U is the left singular matrix, and is an N × N square, and the row vectors of each row may be referred to as left singular vectors, and in some embodiments, the left singular vectors may be orthogonal, i.e., the dot product of any two left singular vectors is 1. V is a right singular matrix, which is a T × T square matrix, and the row vector of each row of the matrix may be referred to as a right singular vector, in some embodiments, the left singular vectors may also be orthogonal, and a behavior pattern of the departure time period and duration of the driver may be abstracted; s is a singular value matrix which is an N multiplied by T matrix, diagonal elements of the matrix are non-zero singular values, the non-diagonal elements are all 0, and the singular values are arranged from top to bottom from large to small.

Step 206: selecting at least one singular vector as the at least one identifying vector.

In some embodiments, step 206 may be performed by an identifying vector determination unit in determination module 402.

In some embodiments, the determination module 404 may select the first M right singular vectors as the identifying vectors. For example only, the number of identifying vectors may be determined by calculating a ratio of the sum of squared first M singular values to the sum of squared all singular values, and comparing the ratio to a preset ratio threshold. For example, when the ratio of the current M-1 singular value square sums to all the singular value square sums is less than the ratio threshold value, and the ratio of the previous M singular value square sums to all the singular value square sums is greater than the ratio threshold value. The determining module 404 may determine that the number of identifying vectors is M. For example, the singular value matrix S has four singular values, 4, 2, and 1, respectively. The preset proportion threshold value is 90%, the proportion of the square sum of the first 2 singular values in the square sum of all 4 singular values is 80%, and the proportion is smaller than the proportion threshold value by 90%. The proportion of the square sum of the first 3 singular values to the square sum of all 4 singular values is 96%, which is greater than the proportion threshold value 90%. At this point, the determining module 404 may determine that the number of identifying vectors is 3.

In some embodiments, the right singular vectors may be generated by observing each row of the right singular matrix VThe distribution performance can also obtain the number of the selected identification vectors. Generally, the first M rows of right singular vectors covering the source data information exhibit periodicity, while the right singular vectors up to and after the first M rows exhibit a random distribution of high amplitude vibrations. For example, as shown in FIG. 5, the right singular matrix V contains 6 right singular vectors V₀、V₁、V₂、V₃、V₄And V₅. Where the right singular vector V₀、V₁、V₂And V₃Exhibit significant periodicity, while the right singular vector V₄And V₅It appears as a random distribution of high amplitude vibrations. Therefore, the number of the selected identification vectors is determined to be 4, and the first 4 right singular vectors are selected as the identification vectors.

It should be noted that the above description related to the flow 200 is only for illustration and explanation, and does not limit the applicable scope of the present application. Various modifications and changes to flow 200 will be apparent to those skilled in the art in light of this disclosure. However, such modifications and variations are intended to be within the scope of the present application. For example, step 202 and step 204 may be combined into one step. Also for example, a pre-filtering step may be included prior to step 202, which may remove portions of significantly unreasonable noisy user data, such as driver data that is online and not on order, according to certain rules.

FIG. 3 is an exemplary flow chart illustrating classification using a clustering algorithm according to some embodiments of the present application.

As shown in fig. 3, the classification using clustering algorithm 300 may include:

step 302: determining a number of classifications based on the plurality of sets of user data.

In some embodiments, step 202 may be performed by a classification number determination unit in the classification module 404.

In some embodiments, the number of classifications K may be determined by one or more of a root mean square algorithm, an elbow method or a contour coefficient method, or a graphical method. The number of classifications K is determined, for example, using a root mean square algorithm. Because N T-dimensional similarity indexes existVectors, therefore

For example, N is 200, and K is 10. If it is

Is a non-integer, then K is less than

Is the largest integer of (a). For example, N-500,

equal to about 15.81, where K is 15.

In some embodiments, the number of classifications, K, may be determined using the elbow method. The criterion for evaluating the number of classes K in the elbow method is the sum of squared errors SSE. Can be calculated by equation (5):

SSE＝∑p∈B_i|p-m_i|²(5)

wherein, B_iRepresents the ith cluster, p is cluster B_iSample point of (i), m_iIs the centroid of the cluster.

Starting from K to 1, as K increases, the sample division becomes finer, the aggregation degree of each cluster gradually increases, and then the sum of squared errors SSE naturally becomes smaller. Moreover, when K is smaller than the optimal classification number, the decrease of SSE is large because the increase of K greatly increases the aggregation level of each cluster, and when K reaches the optimal classification number, the return of the aggregation level obtained by increasing K is rapidly reduced, so the decrease of SSE is rapidly reduced and then becomes gentle as the K value continues to increase, that is, the relation graph of SSE and K is the shape of an elbow, and the K value corresponding to the elbow (or turning point) is the optimal classification number of data. The number of classes K can be determined on the basis of this using the elbow method.

Step 304: and clustering the correlation indexes of the multiple groups of user data, and further classifying the users.

In some embodiments, step 304 may be performed by a clustering unit in classification module 404.

In some embodiments, the classification module 404 may cluster the relevance indicators of the plurality of sets of user data using a clustering algorithm. Typical clustering algorithms may include K-Means clustering, AP (affinity propagation) clustering, Mean-shift clustering, Spectral (Spectral) clustering, hierarchical clustering (including bottom-up AGNES (associative clustering) clustering method and bottom-up DIANA (binary partitioning method)), and the like, or any combination thereof.

In some embodiments, the clustering may be performed using a K-Means clustering algorithm. For each driver, its corresponding M similarity indicators can be represented by an M-dimensional similarity indicator vector Ci. The i-th driver can be represented by the similarity index vector Ci

Ci＝[Ci_1,…,Ci_m,…,Ci_M]

Where Ci _ m represents the similarity pair indicator for the ith driver relative to the mth identifying vector.

N drivers can have N similarity index vectors Ci. The N similar pair index vectors Ci are clustered by using a K-Means clustering algorithm, and corresponding drivers can be classified.

Specifically, after the classification number K is determined, K similarity index vectors Ci may be randomly selected from the N similarity index vectors Ci as initial centers. The distances of all other N-K similarity indicator vectors to the K centers are then calculated. The distance is determined according to the following equation (6):

where Ci _ M represents the Mth element in the Ci vector, and CenterK _ M represents the Mth element in CenterK.

For each driver's similarity index vector Ci, K Dist (Ci, Center) are calculated_K) Indicating its distance to K different centers. The K Dist (Ci, Center) are added_K) Compare and compare Ci to the Center whose distance to it is smallest_KFall into one category. Thus, all ofThe similarity index vectors Ci may be clustered into K-class clusters.

For each of the K-class clusters, the center of the class cluster is updated again. The updated center can be determined by the following equation (7):

wherein, C_KRepresents the Kth class cluster, | C_KI represents the number of data objects in the Kth class cluster, and the summation here refers to the class cluster C_KThe sum of all elements in (1) on each column of attributes, hence the Center_KAlso a vector containing M attributes, represented as:

Center_K＝(CenterK,1,CenterK,2,...,CenterK,M)

and iterating the above operations until the iteration times exceed the preset times, and stopping iteration when the clustering center is not changed or the loss function is smaller than the preset threshold value. At this time, all the correlation index vectors are classified into one of the K class clusters. Therefore, the purpose of classifying the users is achieved.

It should be noted that the above description related to the flow 300 is only for illustration and explanation, and does not limit the applicable scope of the present application. Various modifications and changes to flow 300 will be apparent to those skilled in the art in light of this disclosure. However, such modifications and variations are intended to be within the scope of the present application. For example, the number of classifications K in step 302 may be preset empirically.

FIG. 4 is a block diagram of a user categorization system shown in accordance with some embodiments of the present application.

As shown in FIG. 4, the user classification system may include an acquisition module 402, a determination module 404, a classification module 406, and a storage module 408.

The acquisition module 402 may be used to acquire multiple sets of user data.

The determination module 404 may be configured to generate at least one identifying vector based on the plurality of sets of user data; wherein each identifying vector represents a data distribution type. The determining module 404 may be further configured to determine a relevance indicator for each set of user data, the relevance indicator reflecting a relevance between the set of user data and each of the at least one identifying vector.

In some embodiments, the determination module 404 may include a matrix generation unit, a decomposition unit, and an identifying vector determination unit. The matrix generation unit may be configured to perform preprocessing on the multiple sets of user data to obtain a user data matrix. The decomposition unit may be configured to perform a singular value decomposition on the user data matrix. The identifying vector determining unit may be configured to select at least one singular vector as the at least one identifying vector. The identifying vector determining unit may be configured to select, from singular values obtained by decomposition, a plurality of singular values whose square sums and a ratio of a sum of squares of all singular values are greater than a preset threshold, and use a right singular vector corresponding to the plurality of singular values as the identifying vector. The identifying vector determining unit may be further configured to select, as the identifying vector, a right singular vector whose elements are periodically distributed from a right singular matrix obtained by decomposition.

The classification module 406 may be used to classify a plurality of users according to relevance indicators for a plurality of sets of user data. The classification module 406 may also be configured to perform clustering operation on the correlation indexes of the multiple sets of user data, so as to classify the users.

In some embodiments, the classification module 406 may include a classification number determination unit and a clustering unit. The classification number determination unit may be configured to determine a classification number K based on the plurality of sets of user data. The clustering unit may be configured to perform clustering operation on the correlation indexes of the multiple sets of user data, and further classify users into K classes.

The storage module 408 may be used to store service request related data, control parameters, processed service request related data, and the like, or any combination thereof. In some embodiments, the memory module 408 may store one or more programs and/or instructions that may be executed by a processor to implement the example methods described herein. For example, the memory module 408 may store programs and/or instructions that may be executed by a processor to retrieve a plurality of user data, which may be categorized based on the plurality of user data.

It should be understood that the system and its modules shown in FIG. 4 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules of the present application may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above descriptions of the candidate item display and determination system and the modules thereof are only for convenience of description, and are not intended to limit the present application within the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, for example, the obtaining module 402, the determining module 404, the classifying module 406 and the storing module 408 disclosed in fig. 4 may be different modules in one system, or may be a module that implements the functions of two or more modules described above. For example, the determining module 404 and the classifying module 406 may be two modules, or one module may have both determining and classifying functions. For example, each module may share one memory module, and each module may have its own memory module. Such variations are within the scope of the present application.

The beneficial effects that may be brought by the embodiments of the present application include, but are not limited to: (1) after the data are filtered, the data are classified, so that the massive data can be effectively classified; (2) and the user can be flexibly layered in different time dimensions. (3) The dimensionality reduction of the data is realized by using a singular value decomposition method, and the users can be classified more efficiently and directly by combining an aggregation algorithm. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be considered merely illustrative and not restrictive of the broad application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.

Also, this application uses specific language to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present application may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereon. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which elements and sequences of the processes described herein are processed, the use of alphanumeric characters, or the use of other designations, is not intended to limit the order of the processes and methods described herein, unless explicitly claimed. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

The entire contents of each patent, patent application publication, and other material cited in this application, such as articles, books, specifications, publications, documents, and the like, are hereby incorporated by reference into this application. Except where the application is filed in a manner inconsistent or contrary to the present disclosure, and except where the claim is filed in its broadest scope (whether present or later appended to the application) as well. It is noted that the descriptions, definitions and/or use of terms in this application shall control if they are inconsistent or contrary to the statements and/or uses of the present application in the material attached to this application.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present application. Other variations are also possible within the scope of the present application. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the present application can be viewed as being consistent with the teachings of the present application. Accordingly, the embodiments of the present application are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method for classifying a user, comprising:

acquiring a plurality of groups of user data;

generating at least one identifying vector based on the plurality of sets of user data; wherein each identifying vector represents a data distribution type;

determining a relevance indicator for each set of user data, the relevance indicator reflecting a relevance between the set of user data and the at least one identifying vector;

and classifying the plurality of users according to the relevance indexes of the plurality of groups of user data.

2. The method of claim 1, wherein generating at least one identifying vector based on the plurality of sets of user data comprises:

preprocessing the multiple groups of user data to obtain a user data matrix;

performing singular value decomposition on the user data matrix;

selecting at least one singular vector as the at least one identifying vector.

3. The method of claim 2, wherein selecting at least one singular vector as the at least one identifying vector comprises:

selecting a plurality of singular values of which the ratio of the sum of squares to the sum of squares of all singular values is greater than a preset threshold value from the singular values obtained by decomposition;

and taking singular vectors corresponding to the singular values as the identification vectors.

4. The method of claim 2, wherein selecting at least one singular vector as the at least one identifying vector comprises:

and selecting singular vectors with elements in periodic distribution from the singular matrix obtained by decomposition as the identification vectors.

5. The method of claim 1, wherein the relevance indicator reflects a similarity between user data and an identifying vector.

6. The method of claim 1, wherein classifying users according to relevance indicators of multiple sets of user data comprises:

and clustering the correlation indexes of the multiple groups of user data, and further classifying the users.

7. The method of claim 6, further comprising:

determining a number of classifications K based on the plurality of sets of user data;

and clustering the correlation indexes of the multiple groups of user data, and further classifying the users into K classes.

8. The method of claim 1, wherein the user data reflects a status of the user at different preset time periods within a preset time range; wherein the preset time range includes a plurality of the preset time periods.

9. The method of claim 8, wherein the user comprises a driver; the user data reflects the departure time of the driver in different preset time periods within the preset time range.

10. The method of claim 8, wherein the preset time range comprises at least one of a day, a month, a quarter, a half year, or a year.

11. The method of claim 8, wherein the preset period of time comprises at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, one half month, or one month.

12. A user classification system, comprising: the device comprises an acquisition module, a determination module and a classification module;

the acquisition module is used for acquiring a plurality of groups of user data;

the determination module is configured to generate at least one identifying vector based on the plurality of sets of user data; wherein each identifying vector represents a data distribution type; and determining a relevance indicator for each set of user data, the relevance indicator reflecting a relevance between the set of user data and the at least one identifying vector; and

the classification module is used for classifying a plurality of users according to the relevance indexes of a plurality of groups of user data.

13. The system of claim 12, wherein the determining module comprises: the device comprises a matrix generation unit, a decomposition unit and an identification vector determination unit;

the matrix generation unit is used for preprocessing the multiple groups of user data to obtain a user data matrix;

the decomposition unit is used for carrying out singular value decomposition on the user data matrix; and

the identifying vector determining unit is configured to select at least one singular vector as the at least one identifying vector.

14. The system according to claim 13, wherein the identification vector determination unit is further configured to select, among the singular values obtained by the decomposition, a plurality of singular values whose ratio of the sum of squares to the sum of squares of all singular values is greater than a preset threshold; and taking singular vectors corresponding to the singular values as the identification vectors.

15. The system according to claim 13, wherein the identifying vector determining unit is further configured to select singular vectors with periodically distributed elements as the identifying vectors in the singular matrix obtained by the decomposition.

16. The system of claim 12, wherein the relevance indicator reflects a similarity between the user data and the identifying vector.

17. The system of claim 12, wherein the classification module is further configured to perform a clustering operation on the relevance indicators of the plurality of sets of user data, so as to classify the users.

18. The system of claim 17, wherein the classification module comprises: a classification number determining unit and a clustering unit;

the classification number determining unit is used for determining a classification number K based on the plurality of groups of user data; and

the clustering unit is used for carrying out clustering operation on the correlation indexes of the multiple groups of user data so as to divide the users into K classes.

19. The system of claim 12, wherein the user data reflects a status of the user at different preset time periods within a preset time range; wherein the preset time range includes a plurality of the preset time periods.

20. The system of claim 19, wherein the user comprises a driver; the user data reflects the departure time of the driver in different preset time periods within the preset time range.

21. The system of claim 19, wherein the preset time range comprises at least one of a day, a month, a quarter, a half year, or a year.

22. The system of claim 19, wherein the preset period of time comprises at least one of ten minutes, twenty minutes, half an hour, one hour, six hours, twelve hours, one day, one week, one half month, or one month.

23. A computer-readable storage medium storing computer instructions which, when executed, implement a user classification method according to any one of claims 1 to 11.

24. A user classification apparatus comprising at least one processor and at least one storage medium;

the at least one storage medium is configured to store computer instructions;

the at least one processor is configured to execute the computer instructions to implement the user classification method of any of claims 1-11.