CN104899331A

CN104899331A - Television used behavior data clustering method and device and Spark big data platform

Info

Publication number: CN104899331A
Application number: CN201510355359.5A
Authority: CN
Inventors: 冯研
Original assignee: TCL Corp
Current assignee: TCL Corp
Priority date: 2015-06-24
Filing date: 2015-06-24
Publication date: 2015-09-09

Abstract

The invention is applicable to the technical field of digital televisions and provides a television user behavior data clustering method and device and a Spark big data platform. The television user behavior data clustering method comprises the steps of obtaining television user behavior data and storing the television user behavior data in a first matrix A1 which is an n rows * m column matrix, using a principal component analysis method to conduct attribute reduction treatment on the first matrix A1 to obtain a second matrix A2 which is a n rows * 15 column matrix, using a factor analysis method to conduct attribute conversion on the second matrix A2 to obtain a third matrix A3 which is a n rows * 4 column matrix, adopting a K-mean value clustering algorithm to cluster the third matrix A3 so as to obtain a clustering result. The third matrix A3 is a low-dimensionality matrix, and the phenomenon of geometric increase of calculated quantity does not occur.

Description

The clustering method of TV user behavioral data, device and the large data platform of Spark

Technical field

The invention belongs to digital television techniques field, particularly relate to a kind of clustering method of TV user behavioral data, device and the large data platform of Spark.

Background technology

Along with the develop rapidly of modern communication technology and progressively popularizing of multimedia television, Digital Television has become the main path of vast family obtaining information.The change of technology makes us can obtain a large amount of TV user behavioral datas every day, how based on high-dimensional TV user behavioral data, user to be classified, and carry out corresponding marketing based on classification and also become problem demanding prompt solution with marketing activity.But the clustering method of traditional TV user behavioral data also exists following defect when analyzing high-dimensional TV user behavioral data:

(1) high-dimensional data may be concentrated and there is a large amount of irrelevant attribute, make the possibility that there is bunch (cluster result) in all dimensions be almost 0;

(2) Data distribution8 in high-dimensional data is more sparse than the Data distribution8 in lower dimensional space, and wherein data pitch is commonplace from almost equal situation;

(3) traditional clustering algorithm (such as hierarchical clustering, K-mean cluster) is conventional data clustering method, these algorithm service range matrixes, so its time and spatial complexity is all very high, when the dimension of data is higher, (when space complexity improves) can cause the geometric increase of calculated amount.

(4) because the Data Clustering Algorithm of classics is all based under stand-alone environment, when data to be processed are mass datas, the resource restriction of unit well can not complete data mining task.

Summary of the invention

Embodiments provide a kind of clustering method of TV user behavioral data, the large data platform of device Spark, be intended to the clustering method solving the TV user behavioral data that prior art provides, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.

On the one hand, provide a kind of clustering method of TV user behavioral data, described method comprises:

Obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;

Use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n;

The method of usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;

Adopt K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.

Further, the method for described use principal component analysis (PCA) carries out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, comprising:

Call principal component analysis (PCA) code, described first matrix A 1 is processed, obtain the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;

Based on the value of the characteristic root λ 1 of each major component, pick out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculate the accumulative variance contribution degree D2 of a front M major component;

Based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;

According to described attribute reduction rule list, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.

Further, the method for described usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, comprising:

Call the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;

Based on the eigenwert root λ 2 of each factor, and in conjunction with described factor rubble figure, draw the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;

The 3rd matrix A 3 is obtained according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.

Further, the method analyzed in described usage factor carries out attribute conversion process to described second matrix A 2, after obtaining the 3rd matrix A 3, also comprises:

Concurrent operation based on K-mean algorithm carries out clustering processing to described 3rd matrix A 3, obtains cluster result.

On the other hand, provide a kind of clustering apparatus of TV user behavioral data, described device comprises:

Data capture unit, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;

First dimensionality reduction unit, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;

Second dimensionality reduction unit, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;

First cluster cell, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.

Further, described first dimensionality reduction unit, comprising:

First processing module, for calling principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;

Second processing module, for the value of the characteristic root λ 1 based on each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component;

3rd processing module, for the factor coefficient loading matrix C based on a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;

Merging module, for merging the video attribute in described first matrix A 1 according to described attribute reduction rule list, obtaining the second matrix A 2.

Further, described second dimensionality reduction unit, comprising:

3rd processing module, for calling the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;

4th processing module, for the eigenwert root λ 2 based on each factor, and in conjunction with described factor rubble figure, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;

5th processing module, obtains the 3rd matrix A 3 for the factor coefficient loading matrix E answered according to described second matrix A 2 and described top n factor pair.

Further, described device, also comprises:

Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.

Again on the one hand, provide the large data platform of a kind of Spark, the large data platform of described Spark comprises the clustering apparatus of TV user behavioral data as above.

In the embodiment of the present invention, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.Solve the clustering method of the TV user behavioral data provided of prior art, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.

Accompanying drawing explanation

Fig. 1 is the realization flow figure of the clustering method of the TV user behavioral data that the embodiment of the present invention one provides;

Fig. 2 is in the clustering method of the TV user behavioral data that the embodiment of the present invention one provides, the schematic diagram of the concurrent operation structure of K-mean algorithm;

Fig. 3 is the structured flowchart of the clustering apparatus of the TV user behavioral data that the embodiment of the present invention two provides.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

In embodiments of the present invention, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.

Below in conjunction with specific embodiment, realization of the present invention is described in detail:

Embodiment one

Fig. 1 shows the realization flow of the clustering method of the TV user behavioral data that the embodiment of the present invention one provides, and details are as follows:

In step S101, obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing.

In embodiments of the present invention, TV user behavioral data mainly comprises the property content of two aspects, is the Video attribute information of user's viewing on the one hand; On the other hand for user watches the behavioral data produced in the process of video, it is even higher that the dimension that especially Video attribute information comprises can reach dimension up to a hundred.By efficient Spark large data platform TV user behavioral data carried out cleaning and change, getting the duration matrix A 1 of each attribute of programme televised live that each user watches within a period of time.Duration matrix A 1 structure is as follows:

Wherein, line number is n representative of consumer quantity, and columns is the quantity of the video attribute of m representative of consumer viewing.

In step s 102, use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n.

In embodiments of the present invention, the programme televised live of 58333 users' 86 attribute dimensions is watched matrix A as the first matrix A 1 for example is described.

The first step: the first matrix A 1 be directed in R statistical analysis software, call principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component.

Be input as: the first matrix A 1; Output is: the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, the variance contribution degree D of each major component and accumulative variance contribution degree D1, each major component.

Second step: based on the value of the characteristic root λ 1 of each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component.

First threshold is 1, principal component analysis (PCA) code continues to process the value of the characteristic root λ 1 of each major component that the first step obtains, pick out front M the major component that λ 1 is worth >=1, and calculate the accumulative variance contribution degree D2 of a front M major component, if the value >=80% of the D1 of a front M major component, illustrate that the major component quantity of screening is comparatively suitable.

The value of front 15 major components obtained after the first matrix A 1 shown in his-and-hers watches 1 carries out the first step and second step process and the characteristic root λ 1 of front 15 major components, variance contribution degree, accumulative variance contribution degree are as shown in table 2 below.

User ID	Amusement	Physical culture	Information	Tourism
					85374799a028ac2d74c33505cee9904c02bb4188	56.73	3.31	1.92	0.46
5ef937c5b39659d55fe8a1d22fe466cb4042f393	54.72	3.1	2.3	0.34

e6a6412a9c457edaf40bec4a1d80e0c63a6cfda9	53.81	0	1.25	0.44
					fb46307c0fdae3bd948e4a8ab5c47e08a21b1fc7	53.45	8.45	1.44	0.23
6de40a065be1417c83befd4f019b72462175902b	52.95	15.19	2.72	0.02
					4a0c7f3717f4d749e933b63605a75deca936f304	52.7	27.56	6.64	0.11
34f6636f483ed611a2b834947a015906efde69f8	52.56	0.18	1.88	2.14
					4fce63266b4b1fc2a5b97ae506653d37be952491	51.68	0.43	5	1.48
3b3602111c19b54cdcac0172872750a223e100a1	50.81	4.5	24.71	0.5
					ad7c3b868d3fcf59a9076068dc9774dc0e6f511e	50.21	0.75	0	0
d7e8b3251c0b8ada566221e5f85e532d638150d6	50.17	0	0	0.59
					cc306687816e0d22e2e22b75590fcbffeb46864e	49.97	0	2.09	0
917670faccb41b97b6b47758030b8b9641dc8eab	48.24	1.61	1.32	1.64
					2da1bd15d324a6ad28ef6b99732a70689681c415	47.57	0.05	0.04	1.23
961b1e281bb2c49f1cafc4a47a46d41e941ed6db	47.33	3.83	1.84	2.15
					76a66116b0bde1235acfcf5c8a12939ed1355990	46.12	0.15	9.15	0
02750b735181f1949c5225e322cf835e105224c8	45.27	4.28	4.04	4.6
					6f963460c487470f215ae9a7879d6849e31c52ca	44.67	0	0.34	0.05
0e2aa3f1dbe72e15de4398cc655cdcd461bc5723	44.57	0.22	0.23	0.73

Table 1

Major component	Eigenwert (λ)	Variance (%)	Accumulative variance (%)
				1	24.7	28.74	28.74
2	4.47	6.2	34.94
				3	3.6	5.19	40.13
4	3.33	4.87	45
				5	3	4.46	49.46
6	2.5	3.91	53.37
				7	2.33	3.71	57.08
8	1.9	3.21	60.29
				9	1.82	3.12	63.41
10	1.56	2.81	66.22
				11	1.52	2.77	68.99
12	1.44	2.67	71.66
				13	1.33	2.54	74.2
14	1.28	2.49	76.69
				15	1.13	2.31	79

Table 2

3rd step: based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list.

Wherein, the Second Threshold preset is 0.5.A part in the factor coefficient loading matrix C of each major component calculated in the first step be truncated to is as shown in table 3, in table 3, only the coefficient of major component 3 (science and education is humane) is sorted, coefficient according to each major component forms size, and the attribute reduction rule list obtained after screening is as shown in table 4.

	1	2	3	4	5	6	7
								Comprehensively	.916	-.127	.049	-.106	.064	-.291	-.022
Focus	.855	-.060	-.170	-.135	-.109	-.224	.025
								Life	.815	-.025	.035	-.014	.165	-.054	-.073
Domestic	.773	-.299	-.137	-.045	.178	-.198	.028
								TV play	.769	.316	-.044	-.241	.307	.063	-.099
News	.739	-.195	-.073	-.135	.130	-.388	.112
								Interview	.737	-.178	-.027	-.068	-.050	-.184	.001
International	.732	-.305	-.164	-.042	.177	-.181	.060
								Amusement	.731	.113	-.353	-.174	-.433	.100	-.202
The story of a play or opera	.722	.299	-.068	-.211	.276	.049	-.035
								Humane	.721	-.177	.325	-.014	-.156	-.033	.010
Variety	.710	.094	-.341	-.160	-.408	.117	-.168
								Star	.693	.175	-.318	-.205	-.388	.145	-.112
Science and education	.688	-.185	.547	.107	-.188	.095	-.040
								The people's livelihood	.686	-.136	.029	-.083	.095	-.444	-.088
Love	.666	.289	-.082	-.208	.251	.019	-.094
								Interactive	.662	.004	-.341	.065	-.351	.035	-.152

Table 3

Table 4

4th step: according to the attribute reduction rule list that the 3rd step obtains, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.

Being the new TV user behavioural matrix comprising user video attribute information obtained after processing the video attribute in the first matrix A 1 in second matrix A 2, is the matrix of capable * 15 row of n, n representative of consumer quantity.After the first matrix A 1 shown in his-and-hers watches 1 processes, the second matrix A 2 obtained is as shown in table 5.

Table 5

In step s 103, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n.

In embodiments of the present invention, the first step: the second matrix A 2 is imported in R statistical analysis software, call the code of factorial analysis, the method for usage factor analysis processes the second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E.

Be input as: the second matrix A 2; Output is: characteristic root λ 2, the factor rubble figure of each factor, factor coefficient loading matrix E.

Second step: based on the eigenwert root λ 2 of each factor, and in conjunction with the described factor rubble figure that the first step exports, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers.

Wherein, the Second Threshold preset is 1.Based on the eigenwert root λ 2 of each factor, and combine the factor rubble figure exported, from the second matrix A 2, pick 4 factors, i.e. eigenwert (ss loadings) factor that is greater than 1, as shown in table 6, the factor coefficient loading matrix E of these 4 factors is as shown in table 7:

	Factor1	Factor2	Factor3	Factor4
					SSloadings	2.688	2.191	1.86	1.796
ProportionVar	0.179	0.146	0.124	0.12
					CumulativeVar	0.179	0.325	0.449	0.569

Table 6

	Factor1	Factor2	Factor3	Factor4
					Science and education is humane	0.362	0.425	0.236	0.493
Amusement variety	0.319	0.234	0.9	0.167
					Domestic News	0.349	0.8	0.288	0.274
Finance and money management	0.132	0.438	0.186	0.209
					Juvenile education	0.284	0.16	0.187	0.15
Film	0.447	0.156	0.237	0.258
					Healthy living	0.453	0.489	0.205	0.376
Agricultural is military	0.182	0.234	0.106	0.78
					Tourism cuisines	0.234	0.316	0.232	0.627
Society's legal system	0.261	0.599	0.16	0.209
					Sports	0.121	0.251	0.226	0.124
Family ethic	0.515	0.252	0.385	0.188
					TV play	0.899	0.238	0.196	0.204
The idol of teenagers	0.736	0.283	0.192	0.132
					Fashion music	0.208	0.241	0.615	0.145

Table 7

The option that in matrix E shown in table 4, coefficient is larger is: TV play, the idol of teenagers, healthy living, these 4 attributes of film.

Can by the play of the factor 1 called after idol and domestic play class (the video display factor); The factor 2 called after news and social health class (the information factor) factor 3 called after amusement and fashion class (the trend factor); The factor 4 called after: life and science and education class (the science popularization factor of lying fallow).

3rd step: obtain the 3rd matrix A 3 according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.

Second matrix A 2 is matrixes of capable * 15 row of n, be multiplied by the factor coefficient loading matrix E that top n factor pair that second step calculates answers and can obtain the 3rd matrix A 3, E is the matrix that 15 row * 4 arrange, and the 3rd matrix A 3 is the matrix of capable * 4 row of n, n representative of consumer quantity.

In step S104, adopt K-mean algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.

In embodiments of the present invention, 3rd matrix A 3 is normalized and standardization, eliminate the impact of dimension, obtain the 4th matrix A 4 after processing, first random selecting K object is as initial cluster centre, calculate the distance (generally selecting Euclidean distance) between each object and each seed cluster centre again, each object is distributed to the cluster centre nearest apart from it, until all objects all distribute, the cluster centre of each cluster can be recalculated according to object existing in class, after iteration repeats, until meet cluster arrange end condition (front and back of cluster centre no longer change, or the value at Liang Ge center is less than threshold value), cluster terminates, just can obtain K the cluster result set, this result contains the center point coordinate of cluster, the case quantity etc. of every class.

Preferably, under the Data Clustering Algorithm of classics is all stand-alone environment, when data to be processed are mass datas, well can not complete data mining task, so need data mining and other technologies to combine the parallelization realizing mining algorithm, utilize the resource of multimachine, improve the efficiency of mining task, based on the concurrent operation of K-mean algorithm structural drawing as shown in Figure 2, detailed step is as follows:

Step 11, according to the data in the 4th matrix A 4, generate elasticity distribution formula data set (Resilient Distributed Datasets, RDD).

Step 12, Map is used to operate the distance of data object in calculating the 4th matrix A 4 and K initial cluster center to RDD, again Reduce operation is carried out to the MapRDD generated, generate the individual new cluster centre of K, judge the relation (cluster centre position that front and back are twice no longer changes or iterations reaches the predetermined number of times of setting) of the change of cluster centre and threshold value, if be greater than threshold value, substitute initial cluster center with new cluster centre and repeat Map and Reduce operation, until iteration forms a stable K cluster centre.

The present embodiment, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.Solve the clustering method of the TV user behavioral data provided of prior art, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.

In addition, classification process can also be carried out based on the concurrent operation of K-mean algorithm to the data in the 4th matrix A 4, Multi-processor Resources can be utilized fully, improve the efficiency of mining task.

One of ordinary skill in the art will appreciate that all or part of step realized in the various embodiments described above method is that the hardware that can carry out instruction relevant by program has come, corresponding program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk or CD etc.

Embodiment two

Fig. 3 shows the concrete structure block diagram of the clustering apparatus of the TV user behavioral data that the embodiment of the present invention two provides, and for convenience of explanation, illustrate only the part relevant to the embodiment of the present invention.The clustering apparatus of this TV user behavioral data can be the unit of software unit, hardware cell or the software and hardware combining be built in the large data platform of Spark, and the clustering apparatus 11 of this TV user behavioral data comprises: data capture unit 111, first dimensionality reduction unit 112, second dimensionality reduction unit 113 and the first cluster cell 114.

Wherein, data capture unit 111, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;

First dimensionality reduction unit 112, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;

Second dimensionality reduction unit 113, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;

First cluster cell 114, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.

Further, described first dimensionality reduction unit 112, comprising:

Further, described second dimensionality reduction unit 113, comprising:

Further, described device 11, also comprises:

Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.The device that the embodiment of the present invention provides can be applied in the embodiment of the method one of aforementioned correspondence, and details, see the description of above-described embodiment one, do not repeat them here.

It should be noted that in said system embodiment, included unit is carry out dividing according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit, also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1. a clustering method for TV user behavioral data, is characterized in that, described method comprises:

2. the method for claim 1, is characterized in that, the method for described use principal component analysis (PCA) carries out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, comprising:

3. method as claimed in claim 1 or 2, it is characterized in that, the method for described usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, comprising:

4. the method for claim 1, is characterized in that, the method analyzed in described usage factor carries out attribute conversion process to described second matrix A 2, after obtaining the 3rd matrix A 3, also comprises:

5. a clustering apparatus for TV user behavioral data, is characterized in that, described device comprises:

6. device as claimed in claim 5, it is characterized in that, described first dimensionality reduction unit, comprising:

7. the device as described in claim 5 or 6, is characterized in that, described second dimensionality reduction unit, comprising:

8. device as claimed in claim 5, it is characterized in that, described device, also comprises:

9. the large data platform of Spark, is characterized in that, the large data platform of described Spark comprises the clustering apparatus of the TV user behavioral data as described in claim 5 to 8 any one.