CN105320702B - A kind of analysis method of user behavior data, device and smart television - Google Patents
A kind of analysis method of user behavior data, device and smart television Download PDFInfo
- Publication number
- CN105320702B CN105320702B CN201410380588.8A CN201410380588A CN105320702B CN 105320702 B CN105320702 B CN 105320702B CN 201410380588 A CN201410380588 A CN 201410380588A CN 105320702 B CN105320702 B CN 105320702B
- Authority
- CN
- China
- Prior art keywords
- behavior data
- user
- average value
- distance
- users
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 6
- 238000000034 method Methods 0.000 claims abstract description 29
- 230000003542 behavioural effect Effects 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 65
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000012937 correction Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 10
- 238000013500 data storage Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000007405 data analysis Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention is suitable for technical field of data processing, provide analysis method, device and the smart television of a kind of user behavior data, the described method includes: first establishing user behavior data sample, clustering processing is carried out to the user behavior data sample of foundation again, the more similar user of behavioral data is incorporated into a cluster, a similar users group is formed.The more similar user of behavioral data is incorporated into a cluster by carrying out clustering processing to user behavior data sample, forms a similar users group by the present invention.Due to the preference generally having the same of the user in similar users group, therefore, the video that user similar with active user can have once been seen, the website once browsed or the article once bought recommend active user, personalized service is preferably provided for user, the usage experience of user is promoted.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a user behavior data analysis method and device and an intelligent television.
Background
At present, the share of the smart television in the market has been increased year by year, users tend to individualize and diversify watching and using the smart television, and applications and tools based on the smart television are all in a hundred.
However, the existing applications and tools of the smart tv cannot accurately, timely, and efficiently analyze the behavior data of the user to know the usage behavior of the user, so as to obtain the similarity between users in the user group.
Disclosure of Invention
The embodiment of the invention provides a method and a device for analyzing user behavior data and an intelligent television, and aims to solve the problem that the similarity among users in a user group cannot be obtained according to the user behavior data of the intelligent television provided by the prior art.
In one aspect, a method for analyzing user behavior data is provided, where the method includes:
step A, establishing a user behavior data sample;
b, selecting behavior data of k users from the user behavior data sample, and taking the behavior data of the k users as respective centers of k clusters;
step C, respectively calculating the dissimilarity degree of the behavior data of other users in the user behavior data sample and the centers of the k clusters, and respectively classifying the behavior data of the other users to the cluster with the lowest dissimilarity degree to obtain a clustering result;
step D, according to the clustering result, recalculating respective centers of the k clusters to obtain respective new centers of the k clusters;
and E, respectively calculating the dissimilarity degree of the behavior data of all the users in the user behavior data sample and the new centers of the k clusters, classifying the behavior data of all the users to the cluster with the lowest dissimilarity degree respectively to obtain a clustering result, and returning to the step D until the clustering result is not changed any more or the execution frequency of the step D reaches the preset frequency.
Further, the step B includes:
calculating distances between behavior data of users in the user behavior data samples;
calculating the average value of the distances to obtain the average value of the distance vectors of the distances among the behavior data of the user, wherein the average value of the distance vectors is the average value of the distance vectors of the kth point;
calculating the average value of the distance vector average value to obtain a distance average value;
calculating a deviation value between the distance vector average value and the distance average value according to the distance vector average value and the distance average value;
and if the deviation value meets a preset condition, calculating the behavior data of the user corresponding to the distance vector average value of the kth point, and taking the behavior data of the user corresponding to the distance vector average value of the kth point as the behavior data of the selected kth user.
Further, calculating the dissimilarity degree of the behavior data of the user with the respective centers of the k clusters, and classifying the behavior data of the user into the cluster with the lowest dissimilarity degree includes:
calculating Euclidean distances between behavior data of the user and respective centers of the k clusters;
and classifying the behavior data of the user into a cluster with the minimum Euclidean distance with the behavior data of the user.
Further, after the step E, the method further includes:
scanning the behavior data of all users in a specified cluster in the clustering result;
generating a frequent 1 item set to a frequent N item set according to the behavior data, and calculating the support degree of each item set in the frequent item sets, wherein only one item set exists in the frequent N item sets;
and calculating to obtain an association rule between the behavior data of the user according to the support degree of each item set in the frequent N item sets and the support degree of each item set from the frequent N-1 item set to the frequent 1 item set.
Further, if the deviation value satisfies a preset condition, calculating the behavior data of the user corresponding to the average value of the distance vectors of the kth point, and taking the behavior data of the user corresponding to the average value of the distance vectors of the kth point as the behavior data of the selected kth user specifically includes:
if by formulaIf the calculated delta value meets the preset condition, the delta value will be calculatedTaking the distance vector average value of the corresponding kth point as behavior data of the kth user to be selected;
wherein,is the distance of the kth pointThe average value of the deviation vector is obtained,is the distance average, lambda is the correction factor, delta is the deviation between the distance vector average and the distance average.
In another aspect, an apparatus for analyzing user behavior data is provided, the apparatus including:
the behavior data sample establishing unit is used for establishing a user behavior data sample;
a first cluster center determining unit, configured to select behavior data of k users from the user behavior data samples, and use the behavior data of the k users as respective centers of the k clusters;
the first clustering result generating unit is used for respectively calculating the dissimilarity degree of the behavior data of other users in the user behavior data sample and the centers of the k clusters, and classifying the behavior data of the other users to the cluster with the lowest dissimilarity degree to obtain a clustering result;
a second cluster center determining unit, configured to recalculate respective centers of the k clusters according to the clustering result to obtain respective new centers of the k clusters;
and a second clustering result generating unit, configured to calculate difference degrees between the behavior data of all users in the user behavior data sample and respective new centers of the k clusters, and classify the behavior data of all users to the cluster with the lowest difference degree, to obtain a clustering result, and return to call the second cluster center determining unit until the clustering result is not changed any more or the number of times of executing step D reaches a preset number of times.
Further, the first cluster center determining unit includes:
a distance calculation module for calculating distances between the behavior data of the users in the user behavior data samples;
the distance vector average value calculation module is used for calculating the average value of the distances to obtain the distance vector average value of the distances among the behavior data of the user, and the distance vector average value is the distance vector average value of the kth point;
the distance average value calculating module is used for calculating the average value of the distance vector average value to obtain a distance average value;
the deviation value calculating module is used for calculating a deviation value between the distance vector average value and the distance average value according to the distance vector average value and the distance average value;
and the cluster center determining module is used for calculating the behavior data of the user corresponding to the average value of the distance vectors of the kth point if the deviation value meets a preset condition, and taking the behavior data of the user corresponding to the average value of the distance vectors of the kth point as the behavior data of the selected kth user.
Further, the first clustering result generating unit and
the second clustering result generating units each include:
a euclidean distance calculation module for calculating euclidean distances of the behavioral data of the user from respective centers of the k clusters;
and the user classification module is used for classifying the behavior data of the user into a cluster with the minimum Euclidean distance from the behavior data of the user.
Further, the apparatus further comprises:
the behavior data scanning unit is used for scanning the behavior data of all users in a specified cluster in the clustering result;
a frequent item set and support degree generating unit, configured to generate frequent 1 item sets to frequent N item sets according to the behavior data, and calculate support degree of each item set in the frequent item sets, where there is only one item set in the frequent N item sets;
and the association rule generating unit is used for calculating and obtaining the association rule between the behavior data of the user according to the support degree of each item set in the frequent N item sets and the support degree of each item set from the frequent N-1 item set to the frequent 1 item set.
In still another aspect, a smart television is provided, where the smart television includes the apparatus for analyzing user behavior data as described above.
In the embodiment of the invention, users with similar behavior data are classified into a cluster by clustering user behavior data samples to form a similar user group. Because the users in the similar user group generally have the same preference, the videos watched by the users similar to the current user, websites browsed by the users or articles purchased by the users can be recommended to the current user, personalized services can be better provided for the users, and the use experience of the users is improved.
Drawings
Fig. 1 is a flowchart illustrating an implementation of a method for analyzing user behavior data according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a big data storage platform according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a clustering process of user behavior data according to an embodiment of the present invention;
fig. 4 is a flowchart of an implementation of a method for analyzing user behavior data according to a second embodiment of the present invention;
fig. 5 is a block diagram of a specific structure of an apparatus for analyzing user behavior data according to a third embodiment of the present invention;
fig. 6 is a block diagram of a user behavior data analysis device according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the embodiment of the invention, the user behavior data samples are established firstly, then the established user behavior data samples are clustered, and users with similar behavior data are classified into a cluster to form a similar user group.
The following detailed description of the implementation of the present invention is made with reference to specific embodiments:
example one
Fig. 1 shows an implementation flow of a method for analyzing user behavior data according to an embodiment of the present invention. In the whole process, the smart television firstly establishes a user behavior data sample, then performs clustering processing on the established user behavior data sample, classifies users with similar behavior data into a cluster to form a plurality of similar user groups, and is detailed as follows:
in step S101, a user behavior data sample is created.
In the embodiment of the invention, the smart television firstly acquires original behavior data of a user, then cleans, formats and organizes the original behavior data according to a pre-established data standard to form a new user behavior data sample which accords with the standard, and finally establishes a data storage label and a classification catalogue for the complete user behavior data sample which accords with the standard and leads the data storage label and the classification catalogue into a big data storage platform.
The original behavior data are disorderly, varied and specifically disordered, and some dirty data appear in the process of collecting the original behavior data, so that a data specification needs to be established in advance, and the original behavior data are structured through the data specification.
The big data storage platform is shown in fig. 2 and includes a data storage service cluster, a metadata storage service cluster and an application server cluster.
The data storage service cluster is a loosely coupled node set formed by a plurality of nodes and provides services to the outside in a coordinated mode. The data storage service cluster not only has the advantages of high performance, high availability or load balance, but also can eliminate the problems of single-point failure and performance bottleneck, has Scale-Out horizontal high expansion capability, and can realize linear expansion of capacity and performance. The high availability of data storage service clusters may improve the availability of systems and applications.
The data storage service cluster provides transparent redundant processing capability through the D _1_1, D _1_2 …, D _2_ n data storage servers shown in fig. 2, thereby achieving the goal of uninterrupted application. These servers collectively provide a unified service to clients, where each server providing a service is referred to as a Node. When one node is unavailable or cannot process the request of the client, the request can be timely transferred to another available node for processing, and the process is invisible to the client and completely transparent. The data storage service cluster is used for improving the availability of the system, and can continuously meet the requirements of customers when a single node fails.
Each data storage server stores a certain number of copies (Replication) of a data file. Each copy is a complete copy of the original data. By rack sensing, copies in the big data storage platform are stored in different racks, so that the availability of files can be effectively improved, and data loss or unavailability caused by network disconnection or machine failure and other dynamic unmeasured factors at the nodes distributed in the racks is avoided.
The copy storage enables a rack sensing function and can also play a role in improving the system performance. By reasonably selecting the storage nodes to place the copies and matching with a routing protocol, near-end access of data can be realized, access delay is reduced, and system performance is improved. In addition, the data requests can be reasonably distributed to different nodes and network paths through a copy mechanism, the load is balanced by using other nodes, the data hotspot problem can be effectively solved, and the data access flood peak can also be effectively solved. For a larger file, the node load can be further dispersed and balanced by parallel reading of a plurality of copies, the file reading efficiency is improved, and the I/O performance of the system is improved.
In step S102, behavior data of k users are selected from the user behavior data samples, and the behavior data of the k users are respectively used as respective centers of k clusters.
In the embodiment of the invention, the smart television firstly obtains the user behavior data samples from the big data storage platform, then selects the behavior data of k users from the obtained user behavior data samples, and respectively uses the behavior data of the k users as respective centers of the k clusters.
Specifically, in the embodiment of the present invention, for the selection of the behavior data of k users, an algorithm of an electronic program coordinate system based on a time axis is adopted, and k time points and program lists corresponding to the k time points are selected as the behavior data of the k users.
Selecting behavior data of k users by the following steps:
step 1, calculating the distance between the behavior data of the users in the user behavior data sample.
Specifically, the distance d between the behavior data of the user i and the user j is calculatedkWherein d iskSatisfies the following formula:
dk=d(χi,χj)
wherein, χiHexix-jRespectively representing the behavior data of a user i and a user j, k isAnd n is the number of users in the user behavior data sample.
And 2, calculating the average value of the distances to obtain the average value of the distance vectors of the distances among the behavior data of the user.
In particular, dkThe distance vector average of the distances between the behavior data of the user can be obtained by averaging the distances between the two behavior dataSatisfies the following formula:
and 3, calculating the average value of the distance vector average value to obtain the distance average value.
Specifically, the distance average value is calculated by the following formula
Wherein,is the average of the distance vectors at the k-th point,is the average of the distance vectors for n points.
And 4, calculating a deviation value between the distance vector average value and the distance average value according to the distance vector average value and the distance average value.
Specifically, the deviation value δ is calculated by the following formula:
wherein λ is a correction factor.
And 5, if the deviation value meets a preset condition, calculating the behavior data of the user corresponding to the distance vector average value of the kth point, and taking the behavior data of the user corresponding to the distance vector average value of the kth point as the behavior data of the selected kth user.
In particular, if by formulaIf the calculated delta value meets the preset condition, the delta value will be calculatedAnd taking the average value of the distance vectors of the corresponding kth point as the behavior data of the kth user to be selected.
The following exemplary steps 1 to 5 are performed:
1. according to the distances from the point P to other points, respectively 10, 262 and 23 … … 17;
2. calculating the average of these distances
3. Repeating the steps 1 and 2 to calculate other points32, 22, 23 … … 96, respectively;
4. calculating the average of the result of step 3Mean value
5. And if the lambda is 1.0, and the delta is greater than 0.2, the delta meets a preset condition, the point P is a selected point by calculating the delta to be 1.0 |56-88|/88 to be 0.36, and the average value of the distance vectors of the point P is taken as the behavior data of the P-th user to be selected.
Compared with the random selection method in the prior art, the method for selecting the behavior data of the k users in the embodiment of the invention ensures that the whole clustering algorithm is not easy to fall into low efficiency consumption, and the convergence rate of the clustering result is obviously accelerated in the subsequent processing of the clustering algorithm because the behavior data of the k users are accurately determined.
In step S103, the dissimilarity degree between the behavior data of the other users in the user behavior data sample and the respective centers of the k clusters is respectively calculated, and the behavior data of the other users are respectively classified into the cluster with the lowest dissimilarity degree, so as to obtain a clustering result.
In the embodiment of the present invention, the detailed step of calculating the dissimilarity degree between the behavior data of the user and the center of each of the k clusters and classifying the behavior data of the user into the cluster with the lowest dissimilarity degree by the smart television includes:
and 11, calculating the Euclidean distance between the behavior data of the user and the center of each of the k clusters.
Specifically, as shown in fig. 3, the user behavior data sample includes behavior data of the user a, the user B, the user C, the user D, and the user E, and behavior data of 2 users selected in step S102, the behavior data of the 2 users are respectively used as respective centers of the 2 clusters, and the dissimilarity degree between the behavior data of the user a, the user B, the user C, the user D, and the user E and the respective centers of the 2 clusters can be obtained by calculating the distances between the behavior data of the user a, the user B, the user C, the user D, and the user E and the respective centers of the 2 clusters.
The euclidean distance algorithm is adopted to calculate the distances between the behavior data of the user a, the user B, the user C, the user D and the user E and the respective centers of the 2 clusters, and the formula is as follows:
wherein x1 represents the i-th coordinate of the first point, and x2 represents the i-th coordinate of the second point
n is euclidean space and each point of it can be represented as (x (1), x (2), … x (n)), where x (i) ═ 1,2 … n is a real number, referred to as the ith coordinate of x, and d (x, y) represents the euclidean distance between point x and point y ═ y (1), y (2) … y (n)).
And step 12, classifying the behavior data of the user into a cluster with the minimum Euclidean distance with the behavior data of the user.
Specifically, after the distances between the behavior data of the user a, the user B, the user C, the user D, and the user E and the respective centers of the 2 clusters are calculated, the behavior data of the users are classified into the cluster with the minimum euclidean distance therebetween. For example, as shown in fig. 3, if the distance between the behavior data of the user a and the user B and the center of the upper right cluster is small as calculated in step 11, the behavior data of the user a and the user B is classified into the upper right cluster, and the distance between the behavior data of the user C, the user D, and the user E and the center of the lower left cluster is small, the behavior data of the user C, the user D, and the user E is classified into the lower left cluster.
In step S104, the respective centers of the k clusters are recalculated based on the clustering result, and new respective centers of the k clusters are obtained.
In the embodiment of the present invention, as shown in fig. 3, according to the clustering result, the center of the cluster at the upper right corner and the new center of the cluster at the lower left corner are respectively calculated. Specifically, the new center of each cluster is obtained by calculating the arithmetic mean of all the user behavior data in each cluster.
In step S105, the dissimilarity degree between the behavior data of all users in the user behavior data sample and the new center of each of the k clusters is calculated, the behavior data of all users are classified into the cluster with the lowest dissimilarity degree, so as to obtain a clustering result, and the step S104 is returned until the clustering result is not changed any more or the number of times of execution of the step S104 reaches a preset number of times.
In the embodiment of the present invention, the execution process of steps S104 and S105 is schematically shown in fig. 3, and details are not repeated. And when the clustering result is not changed any more or the number of times of execution of the step S104 reaches a preset number of times, taking the obtained clustering result as a final behavior data classification result of the user.
In this embodiment, users with similar behavior data are classified into a cluster by clustering user behavior data samples, so as to form a similar user group. Because the users in the similar user group generally have the same preference, the videos watched by the users similar to the current user, websites browsed by the users or articles purchased by the users can be recommended to the current user, personalized services can be better provided for the users, and the use experience of the users is improved. Especially, compared with the prior art, the behavior data of k users are not randomly selected, so that the whole clustering algorithm is not easy to fall into low efficiency consumption, and the convergence rate of clustering results is obviously accelerated in the subsequent processing of the clustering algorithm due to the fact that the behavior data of k users are accurately determined.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the embodiments described above may be implemented by using a program to instruct relevant hardware, and the corresponding program may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk or optical disk.
Example two
Fig. 4 shows an implementation flow of the analysis method for user behavior data according to the second embodiment of the present invention. In the whole process, the smart television firstly establishes a user behavior data sample, then performs clustering processing on the established user behavior data sample, classifies users with similar behavior data into a cluster to form a similar user group, and finally finds out the undiscovered association relationship among the behavior data of the users in the similar user group in the same cluster to find out the invisible association network contained in the behavior data, wherein the detailed specific process is as follows:
in step S401, a user behavior data sample is created.
In step S402, behavior data of k users are selected from the user behavior data samples, and the behavior data of the k users are respectively used as respective centers of k clusters.
In step S403, the dissimilarity degree between the behavior data of the other users in the user behavior data sample and the respective centers of the k clusters is respectively calculated, and the behavior data of the other users are respectively classified into the cluster with the lowest dissimilarity degree, so as to obtain a clustering result.
In step S404, the respective centers of the k clusters are recalculated according to the clustering result, and new respective centers of the k clusters are obtained.
In step S405, the dissimilarity degree between the behavior data of all users in the user behavior data sample and the new center of each of the k clusters is calculated, the behavior data of all users are classified into the cluster with the lowest dissimilarity degree, so as to obtain a clustering result, and the step S404 is returned until the clustering result is not changed any more or the number of times of execution of the step S404 reaches a preset number of times.
In step S406, the behavior data of all users in a specified cluster in the clustering result is scanned.
In the embodiment of the invention, the smart television scans the behavior data of all users in a specified cluster in the clustering result. For example, the behavior data of the user included in the scanned designated cluster is shown in table 1:
user recording | Viewing video ID |
R1 | T1,T2,T5 |
R2 | T2,T3 |
R3 | T2,T4 |
R4 | T1,T2,T4 |
R5 | T1,T3 |
R6 | T2,T3 |
R7 | T1,T3 |
R8 | T1,T2,T3,T5 |
R9 | T1,T2,T3 |
TABLE 1
In step S407, a frequent 1 item set to a frequent N item set are generated according to the behavior data, and the support of each frequent item set is calculated, where there is only one item set in the frequent N item set.
In the embodiment of the present invention, according to the behavior data of the users in table 1, the occurrence frequency of the corresponding behavior of the user in the designated cluster can be calculated, and then different frequent item sets and the support degrees of the frequent item sets are generated according to the occurrence frequency of each behavior. For example, for the behavior data in table 1, a frequent 1 item set, a frequent 2 item set, a frequent 3 item set, and a frequent 4 item set may be generated. Wherein, the frequent 1 item set comprises an item set, the frequent 2 item set comprises 2 item sets, and so on, and the frequent N item set comprises N item sets.
Specifically, the frequent 1 item set is generated as follows:
[T1]6
[T2]7
[T3]6
[T4]2
[T5]2
the frequent 2 item set is as follows:
[T1,T2]4
[T1,T3]4
[T1,T5]2
[T2,T3]4
[T2,T4]2
[T2,T5]2
the frequent 3 item set is as follows:
[T1,T2,T3]2
[T1,T2,T5]2
the frequent 4 items set is as follows:
[T1,T2,T3,T5]1
and if only one item set exists in the frequent k item sets, the frequent k +1 item sets are not generated any more.
In step S408, an association rule between the behavior data of the user is calculated according to the support degree of each item set in the frequent N item sets and the support degree of each item set from the frequent N-1 item set to the frequent 1 item set.
Wherein the support of each item set corresponds to the number of times each action occurs. For example, the item set [ T1] in the frequent 1 item set appears 6 times in the behavior data of the user shown in Table 1, and therefore, the support degree of the item set [ T1] is 6.
In the embodiment of the present invention, taking the frequent 3-item set [ T1, T2, T5] as an example, its non-true subset has [ T1, T2], [ T1, T5], [ T2, T5], [ T1], [ T2], [ T5], and the confidence levels calculated for [ T1, T2, T5] corresponding to [ T1, T2], [ T1, T5], [ T2, T5], [ T1], [ T2], [ T5 ]:
[T1,T2]-》[T5] 2/4=50%
[T1,T5]-》[T2] 2/2=100%
[T2,T5]-》[T1] 2/2=100%
[T1]-》[T2,T5] 2/6=33%
[T2]-》[T1,T5] 2/7=29%
[T5]-》[T1,T2] 2/2=100%
if the preset minimum threshold of confidence is 60%, the generated association rules are [ T1, T5] - [ T2], [ T2, T5] - [ T1], [ T5] - [ T1, T2 ].
Wherein, the two events generate the association rule, which indicates that the probability of the two events occurring at the same time is higher. For example, [ T1, T5] and [ T2] in the present embodiment generate an association rule, which indicates that when [ T1, T5] occurs, the probability of occurrence of [ T2] is high.
According to the embodiment, the undiscovered association relationship among the behavior data of the users in the similar user group in the same cluster can be found out, the invisible association network contained in the behavior data can be found out, when the user is recommended to a certain video, other videos which generate association rules with the video can be recommended to the user, and the use experience of the user can be further improved.
EXAMPLE III
Fig. 5 is a block diagram showing a specific configuration of an apparatus for analyzing user behavior data according to a third embodiment of the present invention, and only a part related to the third embodiment of the present invention is shown for convenience of description.
The apparatus may be a software unit, a hardware unit or a combination of software and hardware unit built in the smart tv, and the apparatus 5 includes: a behavior data sample establishing unit 51, a first cluster center determining unit 52, a first clustering result generating unit 53, a second cluster center determining unit 54, and a second clustering result generating unit 55.
The behavior data sample establishing unit 51 is configured to establish a user behavior data sample;
a first cluster center determining unit 52, configured to select behavior data of k users from the user behavior data samples, and use the behavior data of the k users as respective centers of the k clusters;
a first clustering result generating unit 53, configured to calculate difference degrees between behavior data of other users in the user behavior data sample and respective centers of the k clusters, and classify the behavior data of the other users into a cluster with the lowest difference degree, respectively, to obtain a clustering result;
a second cluster center determining unit 54, configured to recalculate respective centers of the k clusters according to the clustering result, to obtain respective new centers of the k clusters;
and a second clustering result generating unit 55, configured to calculate the dissimilarity degree between the behavior data of all users in the user behavior data sample and the new centers of the k clusters, and classify the behavior data of all users into the cluster with the lowest dissimilarity degree, so as to obtain a clustering result, and return to call the second cluster center determining unit until the clustering result is not changed any more or the number of times of executing step D reaches a preset number of times.
Specifically, the first cluster center determining unit 52 includes: the device comprises a distance calculation module, a distance vector average value calculation module, a distance average value calculation module, a deviation value calculation module and a cluster center determination module.
The distance calculation module is used for calculating the distance between the behavior data of the users in the user behavior data sample;
the distance vector average value calculation module is used for calculating the average value of the distances to obtain the distance vector average value of the distances among the behavior data of the user, and the distance vector average value is the distance vector average value of the kth point;
the distance average value calculating module is used for calculating the average value of the distance vector average value to obtain a distance average value;
the deviation value calculating module is used for calculating a deviation value between the distance vector average value and the distance average value according to the distance vector average value and the distance average value;
and the cluster center determining module is used for calculating the behavior data of the user corresponding to the average value of the distance vectors of the kth point if the deviation value meets a preset condition, and taking the behavior data of the user corresponding to the average value of the distance vectors of the kth point as the behavior data of the selected kth user.
Specifically, the first clustering result generating unit 53 and the second clustering result generating unit 55 each include:
a euclidean distance calculation module for calculating euclidean distances of the behavioral data of the user from respective centers of the k clusters;
and the user classification module is used for classifying the behavior data of the user into a cluster with the minimum Euclidean distance from the behavior data of the user.
The analysis apparatus for user behavior data provided in the embodiment of the present invention can be applied to the first corresponding method embodiment, and for details, reference is made to the description of the first embodiment, and details are not repeated here.
Example four
Fig. 6 is a block diagram showing a specific configuration of an apparatus for analyzing user behavior data according to a fourth embodiment of the present invention, and only a part related to the fourth embodiment of the present invention is shown for convenience of description. The apparatus may be a software unit, a hardware unit or a combination of software and hardware units built in the smart television, the apparatus 6 includes the behavior data sample establishing unit 51, the first cluster center determining unit 52, the first clustering result generating unit 53, the second cluster center determining unit 54 and the second clustering result generating unit 55 described in the third embodiment, and further includes:
the behavior data scanning unit 61 is configured to scan behavior data of all users in a specified cluster in the clustering result;
a frequent item set and support degree generating unit 62, configured to generate a frequent 1 item set to a frequent N item set according to the behavior data, and calculate a support degree of each item set in the frequent item set, where there is only one item set in the frequent N item set;
and the association rule generating unit 63 is configured to calculate an association rule between the behavior data of the user according to the support degree of each item set in the frequent N item sets and the support degree of each item set from the frequent N-1 item set to the frequent 1 item set.
The analysis apparatus for user behavior data provided in the embodiment of the present invention can be applied to the second corresponding method embodiment, and for details, reference is made to the description of the second embodiment, and details are not repeated here.
It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (8)
1. A method for analyzing user behavior data, the method comprising:
step A, establishing a user behavior data sample;
b, selecting behavior data of k users from the user behavior data sample, and taking the behavior data of the k users as respective centers of k clusters;
step C, respectively calculating the dissimilarity degree of the behavior data of other users in the user behavior data sample and the centers of the k clusters, and respectively classifying the behavior data of the other users to the cluster with the lowest dissimilarity degree to obtain a clustering result;
step D, according to the clustering result, recalculating respective centers of the k clusters to obtain respective new centers of the k clusters;
step E, respectively calculating the dissimilarity degree of the behavior data of all the users in the user behavior data sample and the new centers of the k clusters, classifying the behavior data of all the users to the cluster with the lowest dissimilarity degree respectively to obtain a clustering result, and returning to the step D until the clustering result is not changed or the execution frequency of the step D reaches a preset frequency;
after the step E, the method further comprises the following steps:
scanning the behavior data of all users in a specified cluster in the clustering result;
generating a frequent 1 item set to a frequent N item set according to the behavior data, and calculating the support degree of each item set in the frequent item sets, wherein only one item set exists in the frequent N item sets;
and calculating to obtain an association rule between the behavior data of the user according to the support degree of the item set in the frequent N item set and the support degree of the item set from the frequent N-1 item set to the frequent 1 item set.
2. The method of claim 1, wherein step B comprises:
calculating distances between behavior data of users in the user behavior data samples;
calculating the average value of the distances to obtain the average value of the distance vectors of the distances among the behavior data of the user, wherein the average value of the distance vectors is the average value of the distance vectors of the kth point;
calculating the average value of the distance vector average value to obtain a distance average value;
calculating a deviation value between the distance vector average value and the distance average value according to the distance vector average value and the distance average value;
and if the deviation value meets a preset condition, calculating the behavior data of the user corresponding to the distance vector average value of the kth point, and taking the behavior data of the user corresponding to the distance vector average value of the kth point as the behavior data of the selected kth user.
3. The method of claim 1, wherein calculating a degree of dissimilarity of the user's behavioral data to respective centers of the k clusters, and grouping the user's behavioral data into the lowest degree of dissimilarity cluster comprises:
calculating Euclidean distances between behavior data of the user and respective centers of the k clusters;
and classifying the behavior data of the user into a cluster with the minimum Euclidean distance with the behavior data of the user.
4. The method of claim 2, wherein if the deviation value satisfies a preset condition, the method calculates the behavior data of the user corresponding to the average value of the distance vectors at the kth point, and the specific example of taking the behavior data of the user corresponding to the average value of the distance vectors at the kth point as the behavior data of the selected kth user is:
if by formulaIf the calculated delta value meets the preset condition, the delta value will be calculatedTaking the distance vector average value of the corresponding kth point as behavior data of the kth user to be selected;
wherein,is the average of the distance vectors at the k-th point,is the distance average, λ is the correction factor, δ is the distance vector average and distance averageDeviation values between the mean values.
5. An apparatus for analyzing user behavior data, comprising:
the behavior data sample establishing unit is used for establishing a user behavior data sample;
a first cluster center determining unit, configured to select behavior data of k users from the user behavior data samples, and use the behavior data of the k users as respective centers of the k clusters;
the first clustering result generating unit is used for respectively calculating the dissimilarity degree of the behavior data of other users in the user behavior data sample and the centers of the k clusters, and classifying the behavior data of the other users to the cluster with the lowest dissimilarity degree to obtain a clustering result;
a second cluster center determining unit, configured to recalculate respective centers of the k clusters according to the clustering result to obtain respective new centers of the k clusters;
a second clustering result generating unit, configured to calculate dissimilarity degrees of behavior data of all users in the user behavior data sample and respective new centers of the k clusters, and classify the behavior data of all users into a cluster with the lowest dissimilarity degree, respectively, to obtain a clustering result, and return to call the second cluster center determining unit until the clustering result is no longer changed or the number of times of execution of step D reaches a preset number of times;
the device further comprises:
the behavior data scanning unit is used for scanning the behavior data of all users in a specified cluster in the clustering result;
a frequent item set and support degree generating unit, configured to generate frequent 1 item sets to frequent N item sets according to the behavior data, and calculate support degree of each item set in the frequent item sets, where there is only one item set in the frequent N item sets;
and the association rule generating unit is used for calculating and obtaining the association rule between the behavior data of the user according to the support degree of the item set in the frequent N item set and the support degree of the item set from the frequent N-1 item set to the frequent 1 item set.
6. The apparatus of claim 5, wherein the first cluster center determining unit comprises:
a distance calculation module for calculating distances between the behavior data of the users in the user behavior data samples;
the distance vector average value calculation module is used for calculating the average value of the distances to obtain the distance vector average value of the distances among the behavior data of the user, and the distance vector average value is the distance vector average value of the kth point;
the distance average value calculating module is used for calculating the average value of the distance vector average value to obtain a distance average value;
the deviation value calculating module is used for calculating a deviation value between the distance vector average value and the distance average value according to the distance vector average value and the distance average value;
and the cluster center determining module is used for calculating the behavior data of the user corresponding to the average value of the distance vectors of the kth point if the deviation value meets a preset condition, and taking the behavior data of the user corresponding to the average value of the distance vectors of the kth point as the behavior data of the selected kth user.
7. The apparatus of claim 5, wherein the first clustering result generating unit and the second clustering result generating unit each comprise:
a euclidean distance calculation module for calculating euclidean distances of the behavioral data of the user from respective centers of the k clusters;
and the user classification module is used for classifying the behavior data of the user into a cluster with the minimum Euclidean distance from the behavior data of the user.
8. An intelligent television, characterized in that the intelligent television comprises analysis means of user behavior data according to any one of claims 5 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410380588.8A CN105320702B (en) | 2014-08-04 | 2014-08-04 | A kind of analysis method of user behavior data, device and smart television |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410380588.8A CN105320702B (en) | 2014-08-04 | 2014-08-04 | A kind of analysis method of user behavior data, device and smart television |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105320702A CN105320702A (en) | 2016-02-10 |
CN105320702B true CN105320702B (en) | 2019-02-01 |
Family
ID=55248102
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410380588.8A Active CN105320702B (en) | 2014-08-04 | 2014-08-04 | A kind of analysis method of user behavior data, device and smart television |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105320702B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526735B (en) * | 2016-06-20 | 2020-12-11 | 杭州海康威视数字技术股份有限公司 | Method and device for identifying incidence relation |
CN106412635B (en) * | 2016-09-29 | 2019-07-30 | 北京赢点科技有限公司 | A kind of intelligence advertisement placement method and system |
CN107623715B (en) * | 2017-08-08 | 2020-06-09 | 阿里巴巴集团控股有限公司 | Identity information acquisition method and device |
CN109861953B (en) | 2018-05-14 | 2020-08-21 | 新华三信息安全技术有限公司 | Abnormal user identification method and device |
CN109753994B (en) * | 2018-12-11 | 2024-05-14 | 东软集团股份有限公司 | User image drawing method, device, computer readable storage medium and electronic equipment |
CN110929145B (en) * | 2019-10-17 | 2023-07-21 | 平安科技(深圳)有限公司 | Public opinion analysis method, public opinion analysis device, computer device and storage medium |
CN112783956B (en) * | 2019-11-08 | 2024-03-05 | 北京沃东天骏信息技术有限公司 | Information processing method and device |
CN111159555A (en) * | 2019-12-30 | 2020-05-15 | 北京每日优鲜电子商务有限公司 | Commodity recommendation method, commodity recommendation device, server and storage medium |
CN113378020A (en) * | 2021-06-08 | 2021-09-10 | 深圳Tcl新技术有限公司 | Acquisition method, device and computer readable storage medium for similar film watching users |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103353880A (en) * | 2013-06-20 | 2013-10-16 | 兰州交通大学 | Data mining method adopting dissimilarity degree clustering and association |
CN103886003A (en) * | 2013-09-22 | 2014-06-25 | 天津思博科科技发展有限公司 | Collaborative filtering processor |
CN103927347A (en) * | 2014-04-01 | 2014-07-16 | 复旦大学 | Collaborative filtering recommendation algorithm based on user behavior models and ant colony clustering |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101802332B1 (en) * | 2010-11-25 | 2017-12-29 | 삼성전자주식회사 | Method for providing contents and the system thereof |
-
2014
- 2014-08-04 CN CN201410380588.8A patent/CN105320702B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103353880A (en) * | 2013-06-20 | 2013-10-16 | 兰州交通大学 | Data mining method adopting dissimilarity degree clustering and association |
CN103886003A (en) * | 2013-09-22 | 2014-06-25 | 天津思博科科技发展有限公司 | Collaborative filtering processor |
CN103927347A (en) * | 2014-04-01 | 2014-07-16 | 复旦大学 | Collaborative filtering recommendation algorithm based on user behavior models and ant colony clustering |
Non-Patent Citations (1)
Title |
---|
"基于K-means算法的校园网用户行为聚类分析";潘莹等;《计算技术与自动化》;20070331;第26卷(第1期);66-69 |
Also Published As
Publication number | Publication date |
---|---|
CN105320702A (en) | 2016-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105320702B (en) | A kind of analysis method of user behavior data, device and smart television | |
CN111027736B (en) | Micro-service combined deployment and scheduling method under multi-objective optimization | |
US10936765B2 (en) | Graph centrality calculation method and apparatus, and storage medium | |
US10820296B2 (en) | Generating wireless network access point models using clustering techniques | |
JP5755822B1 (en) | Similarity calculation system, similarity calculation method, and program | |
US11100073B2 (en) | Method and system for data assignment in a distributed system | |
CN110909182A (en) | Multimedia resource searching method and device, computer equipment and storage medium | |
WO2020168992A1 (en) | Product recommendation method, apparatus, and device and storage medium | |
CN110322318B (en) | Client grouping method, device and computer storage medium | |
CN110990372A (en) | Dimensional data processing method and device and data query method and device | |
CN103455531A (en) | Parallel indexing method supporting real-time biased query of high dimensional data | |
US20190042893A1 (en) | Incremental clustering of a data stream via an orthogonal transform based indexing | |
WO2020094064A1 (en) | Performance optimization method, device, apparatus, and computer readable storage medium | |
US20130016908A1 (en) | System and Method for Compact Descriptor for Visual Search | |
CN107656989A (en) | The nearest Neighbor perceived in cloud storage system based on data distribution | |
CN107844536B (en) | Method, device and system for selecting application program | |
Pei | Some new progress in analyzing and mining uncertain and probabilistic data for big data analytics | |
Zhou et al. | JPR: Exploring joint partitioning and replication for traffic minimization in online social networks | |
Song et al. | An Euclidean similarity measurement approach for hotel rating data analysis | |
US11061876B2 (en) | Fast aggregation on compressed data | |
Almaslukh et al. | Scalable Spatio-temporal Top-k Interaction Queries on Dynamic Communities | |
CN117556288B (en) | Physical space management system and method based on Internet of things | |
CN113609378B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN112861034B (en) | Method, device, equipment and storage medium for detecting information | |
Jin et al. | Research on a kind of high efficiency cloud service recommendation algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |