CN104899331A - Television used behavior data clustering method and device and Spark big data platform - Google Patents

Television used behavior data clustering method and device and Spark big data platform Download PDF

Info

Publication number
CN104899331A
CN104899331A CN201510355359.5A CN201510355359A CN104899331A CN 104899331 A CN104899331 A CN 104899331A CN 201510355359 A CN201510355359 A CN 201510355359A CN 104899331 A CN104899331 A CN 104899331A
Authority
CN
China
Prior art keywords
matrix
factor
attribute
major component
obtains
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510355359.5A
Other languages
Chinese (zh)
Inventor
冯研
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TCL Corp
Original Assignee
TCL Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TCL Corp filed Critical TCL Corp
Priority to CN201510355359.5A priority Critical patent/CN104899331A/en
Publication of CN104899331A publication Critical patent/CN104899331A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is applicable to the technical field of digital televisions and provides a television user behavior data clustering method and device and a Spark big data platform. The television user behavior data clustering method comprises the steps of obtaining television user behavior data and storing the television user behavior data in a first matrix A1 which is an n rows * m column matrix, using a principal component analysis method to conduct attribute reduction treatment on the first matrix A1 to obtain a second matrix A2 which is a n rows * 15 column matrix, using a factor analysis method to conduct attribute conversion on the second matrix A2 to obtain a third matrix A3 which is a n rows * 4 column matrix, adopting a K-mean value clustering algorithm to cluster the third matrix A3 so as to obtain a clustering result. The third matrix A3 is a low-dimensionality matrix, and the phenomenon of geometric increase of calculated quantity does not occur.

Description

The clustering method of TV user behavioral data, device and the large data platform of Spark
Technical field
The invention belongs to digital television techniques field, particularly relate to a kind of clustering method of TV user behavioral data, device and the large data platform of Spark.
Background technology
Along with the develop rapidly of modern communication technology and progressively popularizing of multimedia television, Digital Television has become the main path of vast family obtaining information.The change of technology makes us can obtain a large amount of TV user behavioral datas every day, how based on high-dimensional TV user behavioral data, user to be classified, and carry out corresponding marketing based on classification and also become problem demanding prompt solution with marketing activity.But the clustering method of traditional TV user behavioral data also exists following defect when analyzing high-dimensional TV user behavioral data:
(1) high-dimensional data may be concentrated and there is a large amount of irrelevant attribute, make the possibility that there is bunch (cluster result) in all dimensions be almost 0;
(2) Data distribution8 in high-dimensional data is more sparse than the Data distribution8 in lower dimensional space, and wherein data pitch is commonplace from almost equal situation;
(3) traditional clustering algorithm (such as hierarchical clustering, K-mean cluster) is conventional data clustering method, these algorithm service range matrixes, so its time and spatial complexity is all very high, when the dimension of data is higher, (when space complexity improves) can cause the geometric increase of calculated amount.
(4) because the Data Clustering Algorithm of classics is all based under stand-alone environment, when data to be processed are mass datas, the resource restriction of unit well can not complete data mining task.
Summary of the invention
Embodiments provide a kind of clustering method of TV user behavioral data, the large data platform of device Spark, be intended to the clustering method solving the TV user behavioral data that prior art provides, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.
On the one hand, provide a kind of clustering method of TV user behavioral data, described method comprises:
Obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
Use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n;
The method of usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
Adopt K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.
Further, the method for described use principal component analysis (PCA) carries out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, comprising:
Call principal component analysis (PCA) code, described first matrix A 1 is processed, obtain the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Based on the value of the characteristic root λ 1 of each major component, pick out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculate the accumulative variance contribution degree D2 of a front M major component;
Based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
According to described attribute reduction rule list, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.
Further, the method for described usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, comprising:
Call the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
Based on the eigenwert root λ 2 of each factor, and in conjunction with described factor rubble figure, draw the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
The 3rd matrix A 3 is obtained according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.
Further, the method analyzed in described usage factor carries out attribute conversion process to described second matrix A 2, after obtaining the 3rd matrix A 3, also comprises:
Concurrent operation based on K-mean algorithm carries out clustering processing to described 3rd matrix A 3, obtains cluster result.
On the other hand, provide a kind of clustering apparatus of TV user behavioral data, described device comprises:
Data capture unit, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
First dimensionality reduction unit, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;
Second dimensionality reduction unit, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
First cluster cell, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.
Further, described first dimensionality reduction unit, comprising:
First processing module, for calling principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Second processing module, for the value of the characteristic root λ 1 based on each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component;
3rd processing module, for the factor coefficient loading matrix C based on a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
Merging module, for merging the video attribute in described first matrix A 1 according to described attribute reduction rule list, obtaining the second matrix A 2.
Further, described second dimensionality reduction unit, comprising:
3rd processing module, for calling the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
4th processing module, for the eigenwert root λ 2 based on each factor, and in conjunction with described factor rubble figure, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
5th processing module, obtains the 3rd matrix A 3 for the factor coefficient loading matrix E answered according to described second matrix A 2 and described top n factor pair.
Further, described device, also comprises:
Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.
Again on the one hand, provide the large data platform of a kind of Spark, the large data platform of described Spark comprises the clustering apparatus of TV user behavioral data as above.
In the embodiment of the present invention, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.Solve the clustering method of the TV user behavioral data provided of prior art, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.
Accompanying drawing explanation
Fig. 1 is the realization flow figure of the clustering method of the TV user behavioral data that the embodiment of the present invention one provides;
Fig. 2 is in the clustering method of the TV user behavioral data that the embodiment of the present invention one provides, the schematic diagram of the concurrent operation structure of K-mean algorithm;
Fig. 3 is the structured flowchart of the clustering apparatus of the TV user behavioral data that the embodiment of the present invention two provides.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
In embodiments of the present invention, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.
Below in conjunction with specific embodiment, realization of the present invention is described in detail:
Embodiment one
Fig. 1 shows the realization flow of the clustering method of the TV user behavioral data that the embodiment of the present invention one provides, and details are as follows:
In step S101, obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing.
In embodiments of the present invention, TV user behavioral data mainly comprises the property content of two aspects, is the Video attribute information of user's viewing on the one hand; On the other hand for user watches the behavioral data produced in the process of video, it is even higher that the dimension that especially Video attribute information comprises can reach dimension up to a hundred.By efficient Spark large data platform TV user behavioral data carried out cleaning and change, getting the duration matrix A 1 of each attribute of programme televised live that each user watches within a period of time.Duration matrix A 1 structure is as follows:
Wherein, line number is n representative of consumer quantity, and columns is the quantity of the video attribute of m representative of consumer viewing.
In step s 102, use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n.
In embodiments of the present invention, the programme televised live of 58333 users' 86 attribute dimensions is watched matrix A as the first matrix A 1 for example is described.
The first step: the first matrix A 1 be directed in R statistical analysis software, call principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component.
Be input as: the first matrix A 1; Output is: the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, the variance contribution degree D of each major component and accumulative variance contribution degree D1, each major component.
Second step: based on the value of the characteristic root λ 1 of each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component.
First threshold is 1, principal component analysis (PCA) code continues to process the value of the characteristic root λ 1 of each major component that the first step obtains, pick out front M the major component that λ 1 is worth >=1, and calculate the accumulative variance contribution degree D2 of a front M major component, if the value >=80% of the D1 of a front M major component, illustrate that the major component quantity of screening is comparatively suitable.
The value of front 15 major components obtained after the first matrix A 1 shown in his-and-hers watches 1 carries out the first step and second step process and the characteristic root λ 1 of front 15 major components, variance contribution degree, accumulative variance contribution degree are as shown in table 2 below.
User ID Amusement Physical culture Information Tourism
85374799a028ac2d74c33505cee9904c02bb4188 56.73 3.31 1.92 0.46
5ef937c5b39659d55fe8a1d22fe466cb4042f393 54.72 3.1 2.3 0.34
e6a6412a9c457edaf40bec4a1d80e0c63a6cfda9 53.81 0 1.25 0.44
fb46307c0fdae3bd948e4a8ab5c47e08a21b1fc7 53.45 8.45 1.44 0.23
6de40a065be1417c83befd4f019b72462175902b 52.95 15.19 2.72 0.02
4a0c7f3717f4d749e933b63605a75deca936f304 52.7 27.56 6.64 0.11
34f6636f483ed611a2b834947a015906efde69f8 52.56 0.18 1.88 2.14
4fce63266b4b1fc2a5b97ae506653d37be952491 51.68 0.43 5 1.48
3b3602111c19b54cdcac0172872750a223e100a1 50.81 4.5 24.71 0.5
ad7c3b868d3fcf59a9076068dc9774dc0e6f511e 50.21 0.75 0 0
d7e8b3251c0b8ada566221e5f85e532d638150d6 50.17 0 0 0.59
cc306687816e0d22e2e22b75590fcbffeb46864e 49.97 0 2.09 0
917670faccb41b97b6b47758030b8b9641dc8eab 48.24 1.61 1.32 1.64
2da1bd15d324a6ad28ef6b99732a70689681c415 47.57 0.05 0.04 1.23
961b1e281bb2c49f1cafc4a47a46d41e941ed6db 47.33 3.83 1.84 2.15
76a66116b0bde1235acfcf5c8a12939ed1355990 46.12 0.15 9.15 0
02750b735181f1949c5225e322cf835e105224c8 45.27 4.28 4.04 4.6
6f963460c487470f215ae9a7879d6849e31c52ca 44.67 0 0.34 0.05
0e2aa3f1dbe72e15de4398cc655cdcd461bc5723 44.57 0.22 0.23 0.73
Table 1
Major component Eigenwert (λ) Variance (%) Accumulative variance (%)
1 24.7 28.74 28.74
2 4.47 6.2 34.94
3 3.6 5.19 40.13
4 3.33 4.87 45
5 3 4.46 49.46
6 2.5 3.91 53.37
7 2.33 3.71 57.08
8 1.9 3.21 60.29
9 1.82 3.12 63.41
10 1.56 2.81 66.22
11 1.52 2.77 68.99
12 1.44 2.67 71.66
13 1.33 2.54 74.2
14 1.28 2.49 76.69
15 1.13 2.31 79
Table 2
3rd step: based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list.
Wherein, the Second Threshold preset is 0.5.A part in the factor coefficient loading matrix C of each major component calculated in the first step be truncated to is as shown in table 3, in table 3, only the coefficient of major component 3 (science and education is humane) is sorted, coefficient according to each major component forms size, and the attribute reduction rule list obtained after screening is as shown in table 4.
1 2 3 4 5 6 7
Comprehensively .916 -.127 .049 -.106 .064 -.291 -.022
Focus .855 -.060 -.170 -.135 -.109 -.224 .025
Life .815 -.025 .035 -.014 .165 -.054 -.073
Domestic .773 -.299 -.137 -.045 .178 -.198 .028
TV play .769 .316 -.044 -.241 .307 .063 -.099
News .739 -.195 -.073 -.135 .130 -.388 .112
Interview .737 -.178 -.027 -.068 -.050 -.184 .001
International .732 -.305 -.164 -.042 .177 -.181 .060
Amusement .731 .113 -.353 -.174 -.433 .100 -.202
The story of a play or opera .722 .299 -.068 -.211 .276 .049 -.035
Humane .721 -.177 .325 -.014 -.156 -.033 .010
Variety .710 .094 -.341 -.160 -.408 .117 -.168
Star .693 .175 -.318 -.205 -.388 .145 -.112
Science and education .688 -.185 .547 .107 -.188 .095 -.040
The people's livelihood .686 -.136 .029 -.083 .095 -.444 -.088
Love .666 .289 -.082 -.208 .251 .019 -.094
Interactive .662 .004 -.341 .065 -.351 .035 -.152
Table 3
Table 4
4th step: according to the attribute reduction rule list that the 3rd step obtains, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.
Being the new TV user behavioural matrix comprising user video attribute information obtained after processing the video attribute in the first matrix A 1 in second matrix A 2, is the matrix of capable * 15 row of n, n representative of consumer quantity.After the first matrix A 1 shown in his-and-hers watches 1 processes, the second matrix A 2 obtained is as shown in table 5.
Table 5
In step s 103, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n.
In embodiments of the present invention, the first step: the second matrix A 2 is imported in R statistical analysis software, call the code of factorial analysis, the method for usage factor analysis processes the second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E.
Be input as: the second matrix A 2; Output is: characteristic root λ 2, the factor rubble figure of each factor, factor coefficient loading matrix E.
Second step: based on the eigenwert root λ 2 of each factor, and in conjunction with the described factor rubble figure that the first step exports, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers.
Wherein, the Second Threshold preset is 1.Based on the eigenwert root λ 2 of each factor, and combine the factor rubble figure exported, from the second matrix A 2, pick 4 factors, i.e. eigenwert (ss loadings) factor that is greater than 1, as shown in table 6, the factor coefficient loading matrix E of these 4 factors is as shown in table 7:
Factor1 Factor2 Factor3 Factor4
SSloadings 2.688 2.191 1.86 1.796
ProportionVar 0.179 0.146 0.124 0.12
CumulativeVar 0.179 0.325 0.449 0.569
Table 6
Factor1 Factor2 Factor3 Factor4
Science and education is humane 0.362 0.425 0.236 0.493
Amusement variety 0.319 0.234 0.9 0.167
Domestic News 0.349 0.8 0.288 0.274
Finance and money management 0.132 0.438 0.186 0.209
Juvenile education 0.284 0.16 0.187 0.15
Film 0.447 0.156 0.237 0.258
Healthy living 0.453 0.489 0.205 0.376
Agricultural is military 0.182 0.234 0.106 0.78
Tourism cuisines 0.234 0.316 0.232 0.627
Society's legal system 0.261 0.599 0.16 0.209
Sports 0.121 0.251 0.226 0.124
Family ethic 0.515 0.252 0.385 0.188
TV play 0.899 0.238 0.196 0.204
The idol of teenagers 0.736 0.283 0.192 0.132
Fashion music 0.208 0.241 0.615 0.145
Table 7
The option that in matrix E shown in table 4, coefficient is larger is: TV play, the idol of teenagers, healthy living, these 4 attributes of film.
Can by the play of the factor 1 called after idol and domestic play class (the video display factor); The factor 2 called after news and social health class (the information factor) factor 3 called after amusement and fashion class (the trend factor); The factor 4 called after: life and science and education class (the science popularization factor of lying fallow).
3rd step: obtain the 3rd matrix A 3 according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.
Second matrix A 2 is matrixes of capable * 15 row of n, be multiplied by the factor coefficient loading matrix E that top n factor pair that second step calculates answers and can obtain the 3rd matrix A 3, E is the matrix that 15 row * 4 arrange, and the 3rd matrix A 3 is the matrix of capable * 4 row of n, n representative of consumer quantity.
In step S104, adopt K-mean algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.
In embodiments of the present invention, 3rd matrix A 3 is normalized and standardization, eliminate the impact of dimension, obtain the 4th matrix A 4 after processing, first random selecting K object is as initial cluster centre, calculate the distance (generally selecting Euclidean distance) between each object and each seed cluster centre again, each object is distributed to the cluster centre nearest apart from it, until all objects all distribute, the cluster centre of each cluster can be recalculated according to object existing in class, after iteration repeats, until meet cluster arrange end condition (front and back of cluster centre no longer change, or the value at Liang Ge center is less than threshold value), cluster terminates, just can obtain K the cluster result set, this result contains the center point coordinate of cluster, the case quantity etc. of every class.
Preferably, under the Data Clustering Algorithm of classics is all stand-alone environment, when data to be processed are mass datas, well can not complete data mining task, so need data mining and other technologies to combine the parallelization realizing mining algorithm, utilize the resource of multimachine, improve the efficiency of mining task, based on the concurrent operation of K-mean algorithm structural drawing as shown in Figure 2, detailed step is as follows:
Step 11, according to the data in the 4th matrix A 4, generate elasticity distribution formula data set (Resilient Distributed Datasets, RDD).
Step 12, Map is used to operate the distance of data object in calculating the 4th matrix A 4 and K initial cluster center to RDD, again Reduce operation is carried out to the MapRDD generated, generate the individual new cluster centre of K, judge the relation (cluster centre position that front and back are twice no longer changes or iterations reaches the predetermined number of times of setting) of the change of cluster centre and threshold value, if be greater than threshold value, substitute initial cluster center with new cluster centre and repeat Map and Reduce operation, until iteration forms a stable K cluster centre.
The present embodiment, after attribute reduction is carried out by principal component analysis (PCA) to higher-dimension first matrix A 1, obtain the second matrix A 2, by the method for factorial analysis, attribute conversion is carried out to the second matrix A 2 again, obtain the 3rd matrix A 3, the 3rd matrix A 3 obtained is low dimensional matrix of capable * 4 row of n, finally adopts K-means clustering algorithm to carry out cluster to this low dimensional matrix, obtains cluster result.Due to the process of K-means clustering algorithm is the TV user behavioral data of low dimension, so in the process of cluster, there will not be the phenomenon of the geometric increase of calculated amount.Solve the clustering method of the TV user behavioral data provided of prior art, the TV user behavioral data of process is high dimensional data, can cause the problem of the geometric increase of calculated amount.
In addition, classification process can also be carried out based on the concurrent operation of K-mean algorithm to the data in the 4th matrix A 4, Multi-processor Resources can be utilized fully, improve the efficiency of mining task.
One of ordinary skill in the art will appreciate that all or part of step realized in the various embodiments described above method is that the hardware that can carry out instruction relevant by program has come, corresponding program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk or CD etc.
Embodiment two
Fig. 3 shows the concrete structure block diagram of the clustering apparatus of the TV user behavioral data that the embodiment of the present invention two provides, and for convenience of explanation, illustrate only the part relevant to the embodiment of the present invention.The clustering apparatus of this TV user behavioral data can be the unit of software unit, hardware cell or the software and hardware combining be built in the large data platform of Spark, and the clustering apparatus 11 of this TV user behavioral data comprises: data capture unit 111, first dimensionality reduction unit 112, second dimensionality reduction unit 113 and the first cluster cell 114.
Wherein, data capture unit 111, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
First dimensionality reduction unit 112, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;
Second dimensionality reduction unit 113, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
First cluster cell 114, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.
Further, described first dimensionality reduction unit 112, comprising:
First processing module, for calling principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Second processing module, for the value of the characteristic root λ 1 based on each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component;
3rd processing module, for the factor coefficient loading matrix C based on a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
Merging module, for merging the video attribute in described first matrix A 1 according to described attribute reduction rule list, obtaining the second matrix A 2.
Further, described second dimensionality reduction unit 113, comprising:
3rd processing module, for calling the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
4th processing module, for the eigenwert root λ 2 based on each factor, and in conjunction with described factor rubble figure, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
5th processing module, obtains the 3rd matrix A 3 for the factor coefficient loading matrix E answered according to described second matrix A 2 and described top n factor pair.
Further, described device 11, also comprises:
Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.The device that the embodiment of the present invention provides can be applied in the embodiment of the method one of aforementioned correspondence, and details, see the description of above-described embodiment one, do not repeat them here.
It should be noted that in said system embodiment, included unit is carry out dividing according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit, also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (9)

1. a clustering method for TV user behavioral data, is characterized in that, described method comprises:
Obtain TV user behavioral data and store in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
Use the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtain the second matrix A 2, described second matrix A 2 is matrixes of capable * 15 row of n;
The method of usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
Adopt K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtain cluster result.
2. the method for claim 1, is characterized in that, the method for described use principal component analysis (PCA) carries out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, comprising:
Call principal component analysis (PCA) code, described first matrix A 1 is processed, obtain the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Based on the value of the characteristic root λ 1 of each major component, pick out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculate the accumulative variance contribution degree D2 of a front M major component;
Based on the factor coefficient loading matrix C of a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
According to described attribute reduction rule list, the video attribute in described first matrix A 1 is merged, obtain the second matrix A 2.
3. method as claimed in claim 1 or 2, it is characterized in that, the method for described usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, comprising:
Call the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
Based on the eigenwert root λ 2 of each factor, and in conjunction with described factor rubble figure, draw the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
The 3rd matrix A 3 is obtained according to the factor coefficient loading matrix E that described second matrix A 2 and described top n factor pair are answered.
4. the method for claim 1, is characterized in that, the method analyzed in described usage factor carries out attribute conversion process to described second matrix A 2, after obtaining the 3rd matrix A 3, also comprises:
Concurrent operation based on K-mean algorithm carries out clustering processing to described 3rd matrix A 3, obtains cluster result.
5. a clustering apparatus for TV user behavioral data, is characterized in that, described device comprises:
Data capture unit, for obtaining TV user behavioral data and storing in described TV user behavioral data to the first matrix A 1, described first matrix A 1 is the matrix that a capable * m of n arranges, n representative of consumer quantity, the quantity of the video attribute of m representative of consumer viewing;
First dimensionality reduction unit, for using the method for principal component analysis (PCA) to carry out attribute reduction process to described first matrix A 1, obtains the second matrix A 2, and described second matrix A 2 is matrixes of capable * 15 row of n;
Second dimensionality reduction unit, the method for usage factor analysis carries out attribute conversion process to described second matrix A 2, obtains the 3rd matrix A 3, and described 3rd matrix A 3 is matrixes of capable * 4 row of n;
First cluster cell, for adopting K-means clustering algorithm to carry out clustering processing to described 3rd matrix A 3, obtains cluster result.
6. device as claimed in claim 5, it is characterized in that, described first dimensionality reduction unit, comprising:
First processing module, for calling principal component analysis (PCA) code, processes described first matrix A 1, obtains the factor coefficient loading matrix C of the characteristic root λ 1 of each major component, each major component;
Second processing module, for the value of the characteristic root λ 1 based on each major component, picks out front M the major component that λ 1 value is more than or equal to preset first threshold value, and calculates the accumulative variance contribution degree D2 of a front M major component;
3rd processing module, for the factor coefficient loading matrix C based on a front M major component, pick out coefficient in each major component and be greater than the attribute of default Second Threshold, the attribute that coefficient in each major component is greater than default Second Threshold is carried out merging yojan, obtains attribute reduction rule list;
Merging module, for merging the video attribute in described first matrix A 1 according to described attribute reduction rule list, obtaining the second matrix A 2.
7. the device as described in claim 5 or 6, is characterized in that, described second dimensionality reduction unit, comprising:
3rd processing module, for calling the code of factorial analysis, the method for usage factor analysis processes described second matrix A 2, obtains the characteristic root λ 2 of each factor, factor rubble figure, factor coefficient loading matrix E;
4th processing module, for the eigenwert root λ 2 based on each factor, and in conjunction with described factor rubble figure, draws the factor coefficient loading matrix E that eigenwert is greater than the top n factor of default 3rd threshold value and described top n factor pair and answers;
5th processing module, obtains the 3rd matrix A 3 for the factor coefficient loading matrix E answered according to described second matrix A 2 and described top n factor pair.
8. device as claimed in claim 5, it is characterized in that, described device, also comprises:
Second cluster cell, carries out clustering processing for the concurrent operation based on K-mean algorithm to described 3rd matrix A 3, obtains cluster result.
9. the large data platform of Spark, is characterized in that, the large data platform of described Spark comprises the clustering apparatus of the TV user behavioral data as described in claim 5 to 8 any one.
CN201510355359.5A 2015-06-24 2015-06-24 Television used behavior data clustering method and device and Spark big data platform Pending CN104899331A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510355359.5A CN104899331A (en) 2015-06-24 2015-06-24 Television used behavior data clustering method and device and Spark big data platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510355359.5A CN104899331A (en) 2015-06-24 2015-06-24 Television used behavior data clustering method and device and Spark big data platform

Publications (1)

Publication Number Publication Date
CN104899331A true CN104899331A (en) 2015-09-09

Family

ID=54031993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510355359.5A Pending CN104899331A (en) 2015-06-24 2015-06-24 Television used behavior data clustering method and device and Spark big data platform

Country Status (1)

Country Link
CN (1) CN104899331A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294882A (en) * 2016-08-30 2017-01-04 北京京东尚科信息技术有限公司 Data digging method and device
CN107203772A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 A kind of user type recognition methods and device
CN107480187A (en) * 2017-07-10 2017-12-15 北京京东尚科信息技术有限公司 User's value category method and apparatus based on cluster analysis
CN108899930A (en) * 2018-07-09 2018-11-27 西华大学 Wind-powered electricity generation station equivalent modeling method based on Principal Component Analysis Method and hierarchical clustering algorithm
CN110598963A (en) * 2018-06-13 2019-12-20 顺丰科技有限公司 Method, device, equipment and storage medium for matching human posts
CN111339294A (en) * 2020-02-11 2020-06-26 普信恒业科技发展(北京)有限公司 Client data classification method and device and electronic equipment
CN111651755A (en) * 2020-05-08 2020-09-11 中国联合网络通信集团有限公司 Intrusion detection method and device
CN111866001A (en) * 2020-07-27 2020-10-30 周蓉 Intelligent equipment data processing method based on big data and cloud computing and cloud server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702653A (en) * 2009-10-27 2010-05-05 中国科学院声学研究所 Message announcing system based on locating user behavior and method thereof
CN103106615A (en) * 2013-01-28 2013-05-15 上海交通大学 Excavated user behavior analysis method based on television watching log
CN103377242A (en) * 2012-04-25 2013-10-30 Tcl集团股份有限公司 User behavior analysis method, user behavior analytical prediction method and television program push system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702653A (en) * 2009-10-27 2010-05-05 中国科学院声学研究所 Message announcing system based on locating user behavior and method thereof
CN103377242A (en) * 2012-04-25 2013-10-30 Tcl集团股份有限公司 User behavior analysis method, user behavior analytical prediction method and television program push system
CN103106615A (en) * 2013-01-28 2013-05-15 上海交通大学 Excavated user behavior analysis method based on television watching log

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李新蕊: "主成分分析_因子分析_聚类分析的比较与应用", 《山东教育学院学报》 *
钟燕等: "我国农业上市公司经营绩效的实证研究——基于主成分分析、因子分析与聚类分析", 《技术经济与管理研究》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203772A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 A kind of user type recognition methods and device
CN107203772B (en) * 2016-03-16 2020-11-06 创新先进技术有限公司 User type identification method and device
CN106294882A (en) * 2016-08-30 2017-01-04 北京京东尚科信息技术有限公司 Data digging method and device
CN107480187A (en) * 2017-07-10 2017-12-15 北京京东尚科信息技术有限公司 User's value category method and apparatus based on cluster analysis
CN110598963A (en) * 2018-06-13 2019-12-20 顺丰科技有限公司 Method, device, equipment and storage medium for matching human posts
CN108899930A (en) * 2018-07-09 2018-11-27 西华大学 Wind-powered electricity generation station equivalent modeling method based on Principal Component Analysis Method and hierarchical clustering algorithm
CN111339294A (en) * 2020-02-11 2020-06-26 普信恒业科技发展(北京)有限公司 Client data classification method and device and electronic equipment
CN111651755A (en) * 2020-05-08 2020-09-11 中国联合网络通信集团有限公司 Intrusion detection method and device
CN111651755B (en) * 2020-05-08 2023-04-18 中国联合网络通信集团有限公司 Intrusion detection method and device
CN111866001A (en) * 2020-07-27 2020-10-30 周蓉 Intelligent equipment data processing method based on big data and cloud computing and cloud server

Similar Documents

Publication Publication Date Title
CN104899331A (en) Television used behavior data clustering method and device and Spark big data platform
US10650245B2 (en) Generating digital video summaries utilizing aesthetics, relevancy, and generative neural networks
CN106326391B (en) Multimedia resource recommendation method and device
CN108737856B (en) Social relation perception IPTV user behavior modeling and program recommendation method
US20190045273A1 (en) Enhanced program guide
US8639636B2 (en) System and method for user behavior modeling
CN109511015B (en) Multimedia resource recommendation method, device, storage medium and equipment
US8849798B2 (en) Sampling analysis of search queries
Fontanini et al. Web video popularity prediction using sentiment and content visual features
WO2015081915A1 (en) File recommendation method and device
CN105808581B (en) Data clustering method and device and Spark big data platform
CN111259195A (en) Video recommendation method and device, electronic equipment and readable storage medium
US10601953B2 (en) Decomposing media content accounts for persona-based experience individualization
CN106294794A (en) A kind of content recommendation method and device
US11051070B2 (en) Clustering television programs based on viewing behavior
CN107635143B (en) Method for predicting user's drama chase on television based on watching behavior
CN107592572B (en) Video recommendation method, device and equipment
CN104144181A (en) Terminal aggregation method and system for network videos
CN109784365A (en) A kind of feature selection approach, terminal, readable medium and computer program
WO2023087914A1 (en) Method and apparatus for selecting recommended content, and device, storage medium and program product
CN105681910A (en) Video recommending method and device based on multiple users
CN104967690A (en) Information push method and device
Bulysheva et al. Segmentation modeling algorithm: a novel algorithm in data mining
CN113283351A (en) Video plagiarism detection method using CNN to optimize similarity matrix
CN104217016B (en) Webpage search keyword statistical method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150909

RJ01 Rejection of invention patent application after publication