CN104899331A - Television used behavior data clustering method and device and Spark big data platform - Google Patents
Television used behavior data clustering method and device and Spark big data platform Download PDFInfo
- Publication number
- CN104899331A CN104899331A CN201510355359.5A CN201510355359A CN104899331A CN 104899331 A CN104899331 A CN 104899331A CN 201510355359 A CN201510355359 A CN 201510355359A CN 104899331 A CN104899331 A CN 104899331A
- Authority
- CN
- China
- Prior art keywords
- matrix
- factor
- attribute
- major component
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention is applicable to the technical field of digital televisions and provides a television user behavior data clustering method and device and a Spark big data platform. The television user behavior data clustering method comprises the steps of obtaining television user behavior data and storing the television user behavior data in a first matrix A1 which is an n rows * m column matrix, using a principal component analysis method to conduct attribute reduction treatment on the first matrix A1 to obtain a second matrix A2 which is a n rows * 15 column matrix, using a factor analysis method to conduct attribute conversion on the second matrix A2 to obtain a third matrix A3 which is a n rows * 4 column matrix, adopting a K-mean value clustering algorithm to cluster the third matrix A3 so as to obtain a clustering result. The third matrix A3 is a low-dimensionality matrix, and the phenomenon of geometric increase of calculated quantity does not occur.
Description
Technical field
The invention belongs to digital television techniques field, particularly relate to a kind of clustering method of TV user behavioral data, device and the large data platform of Spark.
Background technology
Along with the develop rapidly of modern communication technology and progressively popularizing of multimedia television, Digital Television has become the main path of vast family obtaining information.The change of technology makes us can obtain a large amount of TV user behavioral datas every day, how based on high-dimensional TV user behavioral data, user to be classified, and carry out corresponding marketing based on classification and also become problem demanding prompt solution with marketing activity.But the clustering method of traditional TV user behavioral data also exists following defect when analyzing high-dimensional TV user behavioral data:
(1) high-dimensional data may be concentrated and there is a large amount of irrelevant attribute, make the possibility that there is bunch (cluster result) in all dimensions be almost 0;
(2) Data distribution8 in high-dimensional data is more sparse than the Data distribution8 in lower dimensional space, and wherein data pitch is commonplace from almost equal situation;
(3) traditional clustering algorithm (such as hierarchical clustering, K-mean cluster) is conventional data clustering method, these algorithm service range matrixes, so its time and spatial complexity is all very high, when the dimension of data is higher, (when space complexity improves) can cause the geometric increase of calculated amount.
(4) because the Data Clustering Algorithm of classics is all based under stand-alone environment, when data to be processed are mass datas, the resource restriction of unit well can not complete data mining task.
Summary of the invention
Embodiments provide a kind of clustering method of TV user behavioral data, the large data platform of device Spark, be intended to the clustering method solving the TV user behavioral data that prior art provides, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.
On the one hand, provide a kind of clustering method of TV user behavioral data, described method comprises:
Obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
Use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n;
The method of usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
Adopt K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.
Further, the method for described use principal component analysis (PCA) carries out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, comprising:
Call principal component analysis (PCA) code, described first matrix A 1 is processed, obtain the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Based on the value of the characteristic root λ 1 of each major component, pick out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculate the accumulative variance contribution degree D2 of a front M major component;
Based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
According to described attribute reduction rule list, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.
Further, the method for described usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, comprising:
Call the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
Based on the eigenwert root λ 2 of each factor, and in conjunction with described factor rubble figure, draw the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
The 3rd matrix A 3 is obtained according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.
Further, the method analyzed in described usage factor carries out attribute conversion process to described second matrix A 2, after obtaining the 3rd matrix A 3, also comprises:
Concurrent operation based on K-mean algorithm carries out clustering processing to described 3rd matrix A 3, obtains cluster result.
On the other hand, provide a kind of clustering apparatus of TV user behavioral data, described device comprises:
Data capture unit, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
First dimensionality reduction unit, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;
Second dimensionality reduction unit, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
First cluster cell, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.
Further, described first dimensionality reduction unit, comprising:
First processing module, for calling principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Second processing module, for the value of the characteristic root λ 1 based on each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component;
3rd processing module, for the factor coefficient loading matrix C based on a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
Merging module, for merging the video attribute in described first matrix A 1 according to described attribute reduction rule list, obtaining the second matrix A 2.
Further, described second dimensionality reduction unit, comprising:
3rd processing module, for calling the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
4th processing module, for the eigenwert root λ 2 based on each factor, and in conjunction with described factor rubble figure, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
5th processing module, obtains the 3rd matrix A 3 for the factor coefficient loading matrix E answered according to described second matrix A 2 and described top n factor pair.
Further, described device, also comprises:
Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.
Again on the one hand, provide the large data platform of a kind of Spark, the large data platform of described Spark comprises the clustering apparatus of TV user behavioral data as above.
In the embodiment of the present invention, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.Solve the clustering method of the TV user behavioral data provided of prior art, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.
Accompanying drawing explanation
Fig. 1 is the realization flow figure of the clustering method of the TV user behavioral data that the embodiment of the present invention one provides;
Fig. 2 is in the clustering method of the TV user behavioral data that the embodiment of the present invention one provides, the schematic diagram of the concurrent operation structure of K-mean algorithm;
Fig. 3 is the structured flowchart of the clustering apparatus of the TV user behavioral data that the embodiment of the present invention two provides.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
In embodiments of the present invention, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.
Below in conjunction with specific embodiment, realization of the present invention is described in detail:
Embodiment one
Fig. 1 shows the realization flow of the clustering method of the TV user behavioral data that the embodiment of the present invention one provides, and details are as follows:
In step S101, obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing.
In embodiments of the present invention, TV user behavioral data mainly comprises the property content of two aspects, is the Video attribute information of user's viewing on the one hand; On the other hand for user watches the behavioral data produced in the process of video, it is even higher that the dimension that especially Video attribute information comprises can reach dimension up to a hundred.By efficient Spark large data platform TV user behavioral data carried out cleaning and change, getting the duration matrix A 1 of each attribute of programme televised live that each user watches within a period of time.Duration matrix A 1 structure is as follows:
Wherein, line number is n representative of consumer quantity, and columns is the quantity of the video attribute of m representative of consumer viewing.
In step s 102, use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n.
In embodiments of the present invention, the programme televised live of 58333 users' 86 attribute dimensions is watched matrix A as the first matrix A 1 for example is described.
The first step: the first matrix A 1 be directed in R statistical analysis software, call principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component.
Be input as: the first matrix A 1; Output is: the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, the variance contribution degree D of each major component and accumulative variance contribution degree D1, each major component.
Second step: based on the value of the characteristic root λ 1 of each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component.
First threshold is 1, principal component analysis (PCA) code continues to process the value of the characteristic root λ 1 of each major component that the first step obtains, pick out front M the major component that λ 1 is worth >=1, and calculate the accumulative variance contribution degree D2 of a front M major component, if the value >=80% of the D1 of a front M major component, illustrate that the major component quantity of screening is comparatively suitable.
The value of front 15 major components obtained after the first matrix A 1 shown in his-and-hers watches 1 carries out the first step and second step process and the characteristic root λ 1 of front 15 major components, variance contribution degree, accumulative variance contribution degree are as shown in table 2 below.
User ID | Amusement | Physical culture | Information | Tourism |
85374799a028ac2d74c33505cee9904c02bb4188 | 56.73 | 3.31 | 1.92 | 0.46 |
5ef937c5b39659d55fe8a1d22fe466cb4042f393 | 54.72 | 3.1 | 2.3 | 0.34 |
e6a6412a9c457edaf40bec4a1d80e0c63a6cfda9 | 53.81 | 0 | 1.25 | 0.44 |
fb46307c0fdae3bd948e4a8ab5c47e08a21b1fc7 | 53.45 | 8.45 | 1.44 | 0.23 |
6de40a065be1417c83befd4f019b72462175902b | 52.95 | 15.19 | 2.72 | 0.02 |
4a0c7f3717f4d749e933b63605a75deca936f304 | 52.7 | 27.56 | 6.64 | 0.11 |
34f6636f483ed611a2b834947a015906efde69f8 | 52.56 | 0.18 | 1.88 | 2.14 |
4fce63266b4b1fc2a5b97ae506653d37be952491 | 51.68 | 0.43 | 5 | 1.48 |
3b3602111c19b54cdcac0172872750a223e100a1 | 50.81 | 4.5 | 24.71 | 0.5 |
ad7c3b868d3fcf59a9076068dc9774dc0e6f511e | 50.21 | 0.75 | 0 | 0 |
d7e8b3251c0b8ada566221e5f85e532d638150d6 | 50.17 | 0 | 0 | 0.59 |
cc306687816e0d22e2e22b75590fcbffeb46864e | 49.97 | 0 | 2.09 | 0 |
917670faccb41b97b6b47758030b8b9641dc8eab | 48.24 | 1.61 | 1.32 | 1.64 |
2da1bd15d324a6ad28ef6b99732a70689681c415 | 47.57 | 0.05 | 0.04 | 1.23 |
961b1e281bb2c49f1cafc4a47a46d41e941ed6db | 47.33 | 3.83 | 1.84 | 2.15 |
76a66116b0bde1235acfcf5c8a12939ed1355990 | 46.12 | 0.15 | 9.15 | 0 |
02750b735181f1949c5225e322cf835e105224c8 | 45.27 | 4.28 | 4.04 | 4.6 |
6f963460c487470f215ae9a7879d6849e31c52ca | 44.67 | 0 | 0.34 | 0.05 |
0e2aa3f1dbe72e15de4398cc655cdcd461bc5723 | 44.57 | 0.22 | 0.23 | 0.73 |
Table 1
Major component | Eigenwert (λ) | Variance (%) | Accumulative variance (%) |
1 | 24.7 | 28.74 | 28.74 |
2 | 4.47 | 6.2 | 34.94 |
3 | 3.6 | 5.19 | 40.13 |
4 | 3.33 | 4.87 | 45 |
5 | 3 | 4.46 | 49.46 |
6 | 2.5 | 3.91 | 53.37 |
7 | 2.33 | 3.71 | 57.08 |
8 | 1.9 | 3.21 | 60.29 |
9 | 1.82 | 3.12 | 63.41 |
10 | 1.56 | 2.81 | 66.22 |
11 | 1.52 | 2.77 | 68.99 |
12 | 1.44 | 2.67 | 71.66 |
13 | 1.33 | 2.54 | 74.2 |
14 | 1.28 | 2.49 | 76.69 |
15 | 1.13 | 2.31 | 79 |
Table 2
3rd step: based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list.
Wherein, the Second Threshold preset is 0.5.A part in the factor coefficient loading matrix C of each major component calculated in the first step be truncated to is as shown in table 3, in table 3, only the coefficient of major component 3 (science and education is humane) is sorted, coefficient according to each major component forms size, and the attribute reduction rule list obtained after screening is as shown in table 4.
1 | 2 | 3 | 4 | 5 | 6 | 7 | |
Comprehensively | .916 | -.127 | .049 | -.106 | .064 | -.291 | -.022 |
Focus | .855 | -.060 | -.170 | -.135 | -.109 | -.224 | .025 |
Life | .815 | -.025 | .035 | -.014 | .165 | -.054 | -.073 |
Domestic | .773 | -.299 | -.137 | -.045 | .178 | -.198 | .028 |
TV play | .769 | .316 | -.044 | -.241 | .307 | .063 | -.099 |
News | .739 | -.195 | -.073 | -.135 | .130 | -.388 | .112 |
Interview | .737 | -.178 | -.027 | -.068 | -.050 | -.184 | .001 |
International | .732 | -.305 | -.164 | -.042 | .177 | -.181 | .060 |
Amusement | .731 | .113 | -.353 | -.174 | -.433 | .100 | -.202 |
The story of a play or opera | .722 | .299 | -.068 | -.211 | .276 | .049 | -.035 |
Humane | .721 | -.177 | .325 | -.014 | -.156 | -.033 | .010 |
Variety | .710 | .094 | -.341 | -.160 | -.408 | .117 | -.168 |
Star | .693 | .175 | -.318 | -.205 | -.388 | .145 | -.112 |
Science and education | .688 | -.185 | .547 | .107 | -.188 | .095 | -.040 |
The people's livelihood | .686 | -.136 | .029 | -.083 | .095 | -.444 | -.088 |
Love | .666 | .289 | -.082 | -.208 | .251 | .019 | -.094 |
Interactive | .662 | .004 | -.341 | .065 | -.351 | .035 | -.152 |
Table 3
Table 4
4th step: according to the attribute reduction rule list that the 3rd step obtains, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.
Being the new TV user behavioural matrix comprising user video attribute information obtained after processing the video attribute in the first matrix A 1 in second matrix A 2, is the matrix of capable * 15 row of n, n representative of consumer quantity.After the first matrix A 1 shown in his-and-hers watches 1 processes, the second matrix A 2 obtained is as shown in table 5.
Table 5
In step s 103, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n.
In embodiments of the present invention, the first step: the second matrix A 2 is imported in R statistical analysis software, call the code of factorial analysis, the method for usage factor analysis processes the second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E.
Be input as: the second matrix A 2; Output is: characteristic root λ 2, the factor rubble figure of each factor, factor coefficient loading matrix E.
Second step: based on the eigenwert root λ 2 of each factor, and in conjunction with the described factor rubble figure that the first step exports, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers.
Wherein, the Second Threshold preset is 1.Based on the eigenwert root λ 2 of each factor, and combine the factor rubble figure exported, from the second matrix A 2, pick 4 factors, i.e. eigenwert (ss loadings) factor that is greater than 1, as shown in table 6, the factor coefficient loading matrix E of these 4 factors is as shown in table 7:
Factor1 | Factor2 | Factor3 | Factor4 | |
SSloadings | 2.688 | 2.191 | 1.86 | 1.796 |
ProportionVar | 0.179 | 0.146 | 0.124 | 0.12 |
CumulativeVar | 0.179 | 0.325 | 0.449 | 0.569 |
Table 6
Factor1 | Factor2 | Factor3 | Factor4 | |
Science and education is humane | 0.362 | 0.425 | 0.236 | 0.493 |
Amusement variety | 0.319 | 0.234 | 0.9 | 0.167 |
Domestic News | 0.349 | 0.8 | 0.288 | 0.274 |
Finance and money management | 0.132 | 0.438 | 0.186 | 0.209 |
Juvenile education | 0.284 | 0.16 | 0.187 | 0.15 |
Film | 0.447 | 0.156 | 0.237 | 0.258 |
Healthy living | 0.453 | 0.489 | 0.205 | 0.376 |
Agricultural is military | 0.182 | 0.234 | 0.106 | 0.78 |
Tourism cuisines | 0.234 | 0.316 | 0.232 | 0.627 |
Society's legal system | 0.261 | 0.599 | 0.16 | 0.209 |
Sports | 0.121 | 0.251 | 0.226 | 0.124 |
Family ethic | 0.515 | 0.252 | 0.385 | 0.188 |
TV play | 0.899 | 0.238 | 0.196 | 0.204 |
The idol of teenagers | 0.736 | 0.283 | 0.192 | 0.132 |
Fashion music | 0.208 | 0.241 | 0.615 | 0.145 |
Table 7
The option that in matrix E shown in table 4, coefficient is larger is: TV play, the idol of teenagers, healthy living, these 4 attributes of film.
Can by the play of the factor 1 called after idol and domestic play class (the video display factor); The factor 2 called after news and social health class (the information factor) factor 3 called after amusement and fashion class (the trend factor); The factor 4 called after: life and science and education class (the science popularization factor of lying fallow).
3rd step: obtain the 3rd matrix A 3 according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.
Second matrix A 2 is matrixes of capable * 15 row of n, be multiplied by the factor coefficient loading matrix E that top n factor pair that second step calculates answers and can obtain the 3rd matrix A 3, E is the matrix that 15 row * 4 arrange, and the 3rd matrix A 3 is the matrix of capable * 4 row of n, n representative of consumer quantity.
In step S104, adopt K-mean algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.
In embodiments of the present invention, 3rd matrix A 3 is normalized and standardization, eliminate the impact of dimension, obtain the 4th matrix A 4 after processing, first random selecting K object is as initial cluster centre, calculate the distance (generally selecting Euclidean distance) between each object and each seed cluster centre again, each object is distributed to the cluster centre nearest apart from it, until all objects all distribute, the cluster centre of each cluster can be recalculated according to object existing in class, after iteration repeats, until meet cluster arrange end condition (front and back of cluster centre no longer change, or the value at Liang Ge center is less than threshold value), cluster terminates, just can obtain K the cluster result set, this result contains the center point coordinate of cluster, the case quantity etc. of every class.
Preferably, under the Data Clustering Algorithm of classics is all stand-alone environment, when data to be processed are mass datas, well can not complete data mining task, so need data mining and other technologies to combine the parallelization realizing mining algorithm, utilize the resource of multimachine, improve the efficiency of mining task, based on the concurrent operation of K-mean algorithm structural drawing as shown in Figure 2, detailed step is as follows:
Step 11, according to the data in the 4th matrix A 4, generate elasticity distribution formula data set (Resilient Distributed Datasets, RDD).
Step 12, Map is used to operate the distance of data object in calculating the 4th matrix A 4 and K initial cluster center to RDD, again Reduce operation is carried out to the MapRDD generated, generate the individual new cluster centre of K, judge the relation (cluster centre position that front and back are twice no longer changes or iterations reaches the predetermined number of times of setting) of the change of cluster centre and threshold value, if be greater than threshold value, substitute initial cluster center with new cluster centre and repeat Map and Reduce operation, until iteration forms a stable K cluster centre.
The present embodiment, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.Solve the clustering method of the TV user behavioral data provided of prior art, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.
In addition, classification process can also be carried out based on the concurrent operation of K-mean algorithm to the data in the 4th matrix A 4, Multi-processor Resources can be utilized fully, improve the efficiency of mining task.
One of ordinary skill in the art will appreciate that all or part of step realized in the various embodiments described above method is that the hardware that can carry out instruction relevant by program has come, corresponding program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk or CD etc.
Embodiment two
Fig. 3 shows the concrete structure block diagram of the clustering apparatus of the TV user behavioral data that the embodiment of the present invention two provides, and for convenience of explanation, illustrate only the part relevant to the embodiment of the present invention.The clustering apparatus of this TV user behavioral data can be the unit of software unit, hardware cell or the software and hardware combining be built in the large data platform of Spark, and the clustering apparatus 11 of this TV user behavioral data comprises: data capture unit 111, first dimensionality reduction unit 112, second dimensionality reduction unit 113 and the first cluster cell 114.
Wherein, data capture unit 111, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
First dimensionality reduction unit 112, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;
Second dimensionality reduction unit 113, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
First cluster cell 114, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.
Further, described first dimensionality reduction unit 112, comprising:
First processing module, for calling principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Second processing module, for the value of the characteristic root λ 1 based on each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component;
3rd processing module, for the factor coefficient loading matrix C based on a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
Merging module, for merging the video attribute in described first matrix A 1 according to described attribute reduction rule list, obtaining the second matrix A 2.
Further, described second dimensionality reduction unit 113, comprising:
3rd processing module, for calling the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
4th processing module, for the eigenwert root λ 2 based on each factor, and in conjunction with described factor rubble figure, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
5th processing module, obtains the 3rd matrix A 3 for the factor coefficient loading matrix E answered according to described second matrix A 2 and described top n factor pair.
Further, described device 11, also comprises:
Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.The device that the embodiment of the present invention provides can be applied in the embodiment of the method one of aforementioned correspondence, and details, see the description of above-described embodiment one, do not repeat them here.
It should be noted that in said system embodiment, included unit is carry out dividing according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit, also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.
Claims (9)
1. a clustering method for TV user behavioral data, is characterized in that, described method comprises:
Obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
Use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n;
The method of usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
Adopt K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.
2. the method for claim 1, is characterized in that, the method for described use principal component analysis (PCA) carries out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, comprising:
Call principal component analysis (PCA) code, described first matrix A 1 is processed, obtain the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Based on the value of the characteristic root λ 1 of each major component, pick out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculate the accumulative variance contribution degree D2 of a front M major component;
Based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
According to described attribute reduction rule list, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.
3. method as claimed in claim 1 or 2, it is characterized in that, the method for described usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, comprising:
Call the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
Based on the eigenwert root λ 2 of each factor, and in conjunction with described factor rubble figure, draw the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
The 3rd matrix A 3 is obtained according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.
4. the method for claim 1, is characterized in that, the method analyzed in described usage factor carries out attribute conversion process to described second matrix A 2, after obtaining the 3rd matrix A 3, also comprises:
Concurrent operation based on K-mean algorithm carries out clustering processing to described 3rd matrix A 3, obtains cluster result.
5. a clustering apparatus for TV user behavioral data, is characterized in that, described device comprises:
Data capture unit, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
First dimensionality reduction unit, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;
Second dimensionality reduction unit, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
First cluster cell, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.
6. device as claimed in claim 5, it is characterized in that, described first dimensionality reduction unit, comprising:
First processing module, for calling principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Second processing module, for the value of the characteristic root λ 1 based on each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component;
3rd processing module, for the factor coefficient loading matrix C based on a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
Merging module, for merging the video attribute in described first matrix A 1 according to described attribute reduction rule list, obtaining the second matrix A 2.
7. the device as described in claim 5 or 6, is characterized in that, described second dimensionality reduction unit, comprising:
3rd processing module, for calling the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
4th processing module, for the eigenwert root λ 2 based on each factor, and in conjunction with described factor rubble figure, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
5th processing module, obtains the 3rd matrix A 3 for the factor coefficient loading matrix E answered according to described second matrix A 2 and described top n factor pair.
8. device as claimed in claim 5, it is characterized in that, described device, also comprises:
Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.
9. the large data platform of Spark, is characterized in that, the large data platform of described Spark comprises the clustering apparatus of the TV user behavioral data as described in claim 5 to 8 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510355359.5A CN104899331A (en) | 2015-06-24 | 2015-06-24 | Television used behavior data clustering method and device and Spark big data platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510355359.5A CN104899331A (en) | 2015-06-24 | 2015-06-24 | Television used behavior data clustering method and device and Spark big data platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104899331A true CN104899331A (en) | 2015-09-09 |
Family
ID=54031993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510355359.5A Pending CN104899331A (en) | 2015-06-24 | 2015-06-24 | Television used behavior data clustering method and device and Spark big data platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104899331A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294882A (en) * | 2016-08-30 | 2017-01-04 | 北京京东尚科信息技术有限公司 | Data digging method and device |
CN107203772A (en) * | 2016-03-16 | 2017-09-26 | 阿里巴巴集团控股有限公司 | A kind of user type recognition methods and device |
CN107480187A (en) * | 2017-07-10 | 2017-12-15 | 北京京东尚科信息技术有限公司 | User's value category method and apparatus based on cluster analysis |
CN108899930A (en) * | 2018-07-09 | 2018-11-27 | 西华大学 | Wind-powered electricity generation station equivalent modeling method based on Principal Component Analysis Method and hierarchical clustering algorithm |
CN110598963A (en) * | 2018-06-13 | 2019-12-20 | 顺丰科技有限公司 | Method, device, equipment and storage medium for matching human posts |
CN111339294A (en) * | 2020-02-11 | 2020-06-26 | 普信恒业科技发展(北京)有限公司 | Client data classification method and device and electronic equipment |
CN111651755A (en) * | 2020-05-08 | 2020-09-11 | 中国联合网络通信集团有限公司 | Intrusion detection method and device |
CN111866001A (en) * | 2020-07-27 | 2020-10-30 | 周蓉 | Intelligent equipment data processing method based on big data and cloud computing and cloud server |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101702653A (en) * | 2009-10-27 | 2010-05-05 | 中国科学院声学研究所 | Message announcing system based on locating user behavior and method thereof |
CN103106615A (en) * | 2013-01-28 | 2013-05-15 | 上海交通大学 | Excavated user behavior analysis method based on television watching log |
CN103377242A (en) * | 2012-04-25 | 2013-10-30 | Tcl集团股份有限公司 | User behavior analysis method, user behavior analytical prediction method and television program push system |
-
2015
- 2015-06-24 CN CN201510355359.5A patent/CN104899331A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101702653A (en) * | 2009-10-27 | 2010-05-05 | 中国科学院声学研究所 | Message announcing system based on locating user behavior and method thereof |
CN103377242A (en) * | 2012-04-25 | 2013-10-30 | Tcl集团股份有限公司 | User behavior analysis method, user behavior analytical prediction method and television program push system |
CN103106615A (en) * | 2013-01-28 | 2013-05-15 | 上海交通大学 | Excavated user behavior analysis method based on television watching log |
Non-Patent Citations (2)
Title |
---|
李新蕊: "主成分分析_因子分析_聚类分析的比较与应用", 《山东教育学院学报》 * |
钟燕等: "我国农业上市公司经营绩效的实证研究——基于主成分分析、因子分析与聚类分析", 《技术经济与管理研究》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107203772A (en) * | 2016-03-16 | 2017-09-26 | 阿里巴巴集团控股有限公司 | A kind of user type recognition methods and device |
CN107203772B (en) * | 2016-03-16 | 2020-11-06 | 创新先进技术有限公司 | User type identification method and device |
CN106294882A (en) * | 2016-08-30 | 2017-01-04 | 北京京东尚科信息技术有限公司 | Data digging method and device |
CN107480187A (en) * | 2017-07-10 | 2017-12-15 | 北京京东尚科信息技术有限公司 | User's value category method and apparatus based on cluster analysis |
CN110598963A (en) * | 2018-06-13 | 2019-12-20 | 顺丰科技有限公司 | Method, device, equipment and storage medium for matching human posts |
CN108899930A (en) * | 2018-07-09 | 2018-11-27 | 西华大学 | Wind-powered electricity generation station equivalent modeling method based on Principal Component Analysis Method and hierarchical clustering algorithm |
CN111339294A (en) * | 2020-02-11 | 2020-06-26 | 普信恒业科技发展(北京)有限公司 | Client data classification method and device and electronic equipment |
CN111651755A (en) * | 2020-05-08 | 2020-09-11 | 中国联合网络通信集团有限公司 | Intrusion detection method and device |
CN111651755B (en) * | 2020-05-08 | 2023-04-18 | 中国联合网络通信集团有限公司 | Intrusion detection method and device |
CN111866001A (en) * | 2020-07-27 | 2020-10-30 | 周蓉 | Intelligent equipment data processing method based on big data and cloud computing and cloud server |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104899331A (en) | Television used behavior data clustering method and device and Spark big data platform | |
US10650245B2 (en) | Generating digital video summaries utilizing aesthetics, relevancy, and generative neural networks | |
CN106326391B (en) | Multimedia resource recommendation method and device | |
CN108737856B (en) | Social relation perception IPTV user behavior modeling and program recommendation method | |
US20190045273A1 (en) | Enhanced program guide | |
US8639636B2 (en) | System and method for user behavior modeling | |
CN109511015B (en) | Multimedia resource recommendation method, device, storage medium and equipment | |
US8849798B2 (en) | Sampling analysis of search queries | |
Fontanini et al. | Web video popularity prediction using sentiment and content visual features | |
WO2015081915A1 (en) | File recommendation method and device | |
CN105808581B (en) | Data clustering method and device and Spark big data platform | |
CN111259195A (en) | Video recommendation method and device, electronic equipment and readable storage medium | |
US10601953B2 (en) | Decomposing media content accounts for persona-based experience individualization | |
CN106294794A (en) | A kind of content recommendation method and device | |
US11051070B2 (en) | Clustering television programs based on viewing behavior | |
CN107635143B (en) | Method for predicting user's drama chase on television based on watching behavior | |
CN107592572B (en) | Video recommendation method, device and equipment | |
CN104144181A (en) | Terminal aggregation method and system for network videos | |
CN109784365A (en) | A kind of feature selection approach, terminal, readable medium and computer program | |
WO2023087914A1 (en) | Method and apparatus for selecting recommended content, and device, storage medium and program product | |
CN105681910A (en) | Video recommending method and device based on multiple users | |
CN104967690A (en) | Information push method and device | |
Bulysheva et al. | Segmentation modeling algorithm: a novel algorithm in data mining | |
CN113283351A (en) | Video plagiarism detection method using CNN to optimize similarity matrix | |
CN104217016B (en) | Webpage search keyword statistical method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150909 |
|
RJ01 | Rejection of invention patent application after publication |