Embodiment
A kind of websites collection method for optimization analysis based on user's mental model, step is as follows:
Step 1, the web log file data are carried out pre-service, are specially:
Step 1-1, the web log file data are purified, irrelevant or have wrong data with analysis purpose in the deletion journal file, the irrelevant data of described and analysis purpose comprise: comprise concept in the split catalog data, comprise product with the data of coded representation; The described data of mistake that exist comprise: misspelling, product description mistake; The attribute of selecting afterwards Data processing to need, described attribute comprises user's name, user region, user cognition concept, product category, described user cognition concept for the user based on the concept of optimizing about web catalogue that the cognition of web catalogue is submitted to, be that the user is when utilizing web catalogue to browse, when can not find only concept, in the more suitably concept of oneself thinking of website interactive interface submission; For example the user utilizes split catalog to search the books of data mining in Joyo.com, find that such books belong to split catalog " database " classification, think improper like this, the user thinks that data mining should directly appear in the split catalog classification, and at this moment " data mining " is exactly described user cognition concept.
Step 1-2, the data that purified among the step 1-1 are carried out format conversion, the form of the user cognition concept extracted and region, three attributes of title is unified, be specially and remove numbering, unified, the single plural number of capital and small letter is unified;
Step 1-3, determine the frequency that the user cognition concept occurs, setting threshold afterwards, threshold value is determined according to actual amount of data and extraction user cognition concept quantity, for example, in all less situation of actual amount of data and user cognition concept quantity, in order to obtain certain data volume, can set less threshold value.Choose the frequency greater than the user cognition concept of this threshold value, and the record frequency;
Step 2, determine the concept co-occurrence whether in user cognition concept and the web catalogue, specifically utilize the mental model category theory, the user cognition concept is retrieved concept and the frequency in the split catalog that occurs in the statistics result for retrieval to the website as search key;
According to the mental model category theory, when the user carries out acquisition of information at the website use split catalog, main employing level, the vertical and impartial click mode of horizontal vertical, in the click process according to the correlativity between concept in the split catalog, select the high concept of correlativity to click, utilize this principle, the user cognition concept is retrieved to the website as search key, concept and the frequency thereof in the split catalog that occurs in the statistics result for retrieval are with the correlativity between concept in analysis user cognitive concept and the web catalogue.
The mental model category theory is Charles Cole, modal three kinds of the mental model that the people such as Yang Lin found through experiments the people is vertical-type (26%), horizontal type (31%), and impartial type (21%) consist of 78% crowd's mental model type altogether.The classification of mental model is to determine according to the hierachy number among the mental model figure and number of regions.Three common class mental model features are as follows:
● vertical: the mental model that the level of vertical dimensions is Duoed than horizontal dimensions
● level: the mental model that the level of horizontal dimensions is Duoed than vertical dimensions
● equalization: the mental model that vertical dimensions and horizontal dimensions level equate
According to this theory, it is expanded to the user utilize split catalog to carry out in the information access process, suppose the user when the website use split catalog carries out acquisition of information, also adopt the mode of vertical, level and horizontal vertical equalization to click.
Step 3, generation co-occurrence matrix, described co-occurrence matrix is symmetric matrix, the first row and first is classified concept as, comprise the concept in user cognition concept and the web catalogue, the remaining element lattice are the co-occurrence frequency between concept, are specially the co-occurrence frequency between concept in the first row corresponding to cell and the first row;
Determine the co-occurrence frequency between concept in the co-occurrence matrix, concrete steps are as follows:
Step 3-1, determine the co-occurrence frequency of concept in the user cognition concept and classification catalogue specifically to be divided into two kinds
Situation: a kind of is the co-occurrence frequency of concept in user cognition concept and the secondary classification catalogue, is designated as F
1,
F
1The frequency that concept occurs in the secondary classification catalogue in the=p*x p=result for retrieval
The frequency that x=user cognition concept occurs
Another kind is the co-occurrence frequency of concept in user cognition concept and the reclassify catalogue, is the frequency that the user cognition concept occurs;
Step 3-2, determine the co-occurrence frequency between the concept in the split catalog, be the smaller value in the co-occurrence frequency of concept and all user cognition concepts in two split catalogs, to its summation, be designated as F afterwards
2, m, n represent respectively concept A in the split catalog, the co-occurrence frequency of B and user cognition concept, and used formula is:
F
2=SUM(MIN(m,n))
Step 3-3, determine the co-occurrence frequency between the user cognition concept, the co-occurrence frequency between the user cognition concept is 0.
Step 4, the basis of co-occurrence matrix generates similarity matrix in step 3;
Generate similarity matrix and specifically adopt the pearson relative coefficient to calculate as similarity, used formula is
In the formula, r is the degree of two linear dependence powers between variable, usually satisfies 0≤r≤1, and n is sample size, x, y and
Be respectively observed reading and the average of two variablees.
Step 5, carry out cluster analysis on the basis of step 4, specifically utilize the pedigree clustering procedure that similarity matrix is carried out cluster, according to the statistic of cluster, determine the cluster result of concept afterwards, described concept comprises the user cognition concept of extraction and the concept in the web catalogue;
Utilize the pedigree clustering procedure that similarity matrix is carried out cluster, afterwards according to the statistic of cluster, determine the cluster result of concept, specifically may further comprise the steps:
Step 5-1, determine the distance between sample, consist of the symmetry distance matrix, adopt T
i, T
jExpression sample i, j, d(T
i, T
j) distance of expression between i, the j, note by abridging and be d
Ij, used variance weighted range formula is
With N sample as N class, M
p, M
qRepresent two classes, contain respectively N
p, N
qIndividual sample, M
p, M
qBetween distance D very
Pq, calculate sample distance between any two, consist of a symmetry distance matrix D (0);
Step 5-2, merging classification, generate new distance matrix, specifically select the least member on the off-diagonal among the D (0), if this least member is Dpq, at this moment Mp={Xp}, Mq={Xq} are merged into new class Mr={Xp, an Xq} with Mp, Mq, the corresponding ranks of cancellation Mp, Mq in D (0), and add by new class Mr to be delegation and the row that the distance between the class of polymerization forms with remaining other, to obtain new Distance matrix D (1) that it is N-1 rank square formations;
Step 5-3, repeating step 5-2 until N sample poly-be 1 large class;
Step 5-4, determine the cluster result of concept according to the statistic of pedigree clustering method, described statistic comprises: R
2Statistic, half is R partially
2Statistic, pseudo-F statistic, pseudo-t
2Statistic.
Step 6, utilize Multidimensional Scaling that the similarity matrix in the step 4 is analyzed, obtain the Multidimensional Scaling space diagram of corresponding dimension, analyze thereby finish websites collection optimization.Utilize Multidimensional Scaling that similarity matrix is analyzed, generate the Multidimensional Scaling space diagram, specifically may further comprise the steps:
Step 6-1, generation observing matrix, specifically utilize Euclid to stimulate the space to carry out spatial description, calculate based on the Minkowski Distance function: supposition is in web catalogue, and is tested cognitive as basic input data to the concept Relations Among, be provided with n object, can get
Individual object to apart from S
Ij, the distance table between some i and the j is shown d
Ij, used formula is:
In the formula, v represents dimension, X
IaCoordinate points i on the expression a dimension, X
JaCoordinate points j on the expression a dimension;
Step 6-2, homomorphic mapping are specifically sought the q dimension space of a dimensionality reduction, do homomorphic mapping and process, and make d in the q dimension space
IjBe object to the distance in the p space with former apart from S
IjBe complementary, if d
IjWith S
IjBe complementary fully, the distance relation is d between each paired object
I1>d
I2>...>d
Im, namely this distance that falls progressively is consistent with original similarity order of rising progressively;
Step 6-3, reliability and validity check, determine optimum number of dimensions, calculated difference degree K specifically, be called Cruise gram coefficient, be used for checking the space diagram that obtains whether to have effective representativeness and stress stress exponent, be the degree of fitting value, be defined as the departure between the distance of the theoretical of similarity assessment data representatives and calculating, Stress adopts formula to be:
D wherein
IjBe to satisfy tested original input concept distance order relation, make again the reference value of stress exponent value minimum simultaneously.Above-mentioned K value is the bigger the better, and is being acceptable more than 0.60 generally; The stress value generally can be accepted with interior 0.20, and stress exponent size and degree of fitting relation sees Table 1 in detail
Table 1 stress exponent size and degree of fitting relation
Stress |
Degree of fitting |
0.200 |
Bad |
0.100 |
All right |
0.050 |
Good |
0.025 |
Very good |
0.000 |
Fully match |
Step 6-4, according to the optimum number of dimensions of determining among the step 6-3, generate the Multidimensional Scaling space diagram.
Below in conjunction with embodiment the present invention is done further detailed description:
Goal in research: the optimization of made in China net illuminating product split catalog is analyzed.
Data declaration: made in China net (international station http://www.made-in-china.com/) product classification catalogue Lights﹠amp; The large class of Lighting Zhejiang, Shanghai, Jiangsu, Guangdong four provinces and cities' User Defined group name data (6872 record).The made in China net is called self-defined group name with the user cognition concept.
Step 1 is carried out pre-service to the web log file data, is specially:
1) the web log file data are purified after, filter out the attribute that Data processing needs, comprise Business Name, province, city and self-defined group name, concrete form is as shown in table 2:
Data layout behind table 2 data purification
2) first the numbering that comprises in the self-defined group name is removed, then self-defined group name is converted into small letter, remove plural form, and according to first letter mother sorts;
3) because the less self-defined group name quantity of the frequency that filters out is very large, threshold value is made as 4, selects the frequency greater than 4 User Defined group name, select at last 114 self-defined group names and record its frequency.The self-defined group name result who filters out is as shown in table 3:
The self-defined group name the selection result of table 3
Step 2 is determined the concept co-occurrence whether in self-defined group name and the web catalogue.Concrete operations are as follows:
1) signs in to the international station of made in China net http://www.made-in-china.com/;
2) the self-defined group name that input need to be retrieved in frame retrieval is selected " Lights﹠amp in " all categories " drop-down menu; Lighting ", click search to then;
3) concept in the secondary classification order that occurs in the statistics result for retrieval " catalog ";
4) click successively the concept that occurs in " catalog ", the concept that occurs in " catalog " at this moment be the concept in the reclassify catalogue of correspondence;
The secondary classification catalogue that occurs among the record catalog, the concept in the reclassify catalogue, the corresponding unit lattice fill in 1, obtain original cooccurrence relation statistical form, and partial results is as shown in table 4:
Table 4 part co-occurrence is statistical form as a result
In the ensuing processing procedure, the processing procedure of concept is similar in self-defined group name and secondary classification catalogue and the reclassify catalogue, the below in the secondary classification catalogue concept and the cooccurrence relation of self-defined group name as example.
Step 3 generates co-occurrence matrix, is specially:
1) determines the co-occurrence frequency between concept and self-defined group name in the secondary classification catalogue; Specifically the frequency number with self-defined group name multiply by the frequency that concept occurs in the secondary classification catalogue, and it is as shown in table 5 to obtain partial results:
The co-occurrence frequency partial results of the concept in the table 5 secondary classification catalogue and self-defined group name
2) determine the co-occurrence frequency between the concept in the secondary classification catalogue;
Calculate on the basis as a result in previous step, illustrate, the co-occurrence frequency such as Interior lighting and LED lighting, the row of these two concepts B, C by name among the excel, therefore formula is SUM(MIN(B, C)), namely at first select the less data of every delegation in two row, then summation;
3) the co-occurrence frequency between self-defined group name all fills out 0, and the co-occurrence matrix that obtains at last is as shown in table 6:
The co-occurrence matrix of concept and self-defined group name in the table 6 part secondary classification catalogue
|
Interiorlighting |
ledlighting |
lightingfixtureg |
bulblamp |
lightingdecoration |
Interior_lighting |
|
14441 |
6587 |
11403 |
10697 |
led_lighting |
14441 |
|
6643 |
12204 |
11108 |
lighting_fixtures |
6587 |
6643 |
|
6467 |
5836 |
bulb_lamp |
11403 |
12204 |
6467 |
|
9433 |
lighting_decoration |
10697 |
11108 |
5836 |
9433 |
|
outdoor_lighting |
14640 |
17255 |
6620 |
12498 |
11189 |
camping_light |
1116 |
1116 |
995 |
1100 |
1110 |
emergency_indicator_light |
2245 |
2240 |
2129 |
2226 |
2205 |
torch |
653 |
653 |
582 |
645 |
648 |
portable_lighting |
1364 |
1388 |
1289 |
1356 |
1353 |
Step 4 generates similarity matrix, adopts SAS software, selects Pearson correlation coefficient to calculate, and obtains similarity matrix, and partial results is as shown in table 7:
Table 7 similarity matrix partial results
Step 5, cluster analysis, utilize SAS software, choose the pedigree clustering method, carry out cluster analysis, the between class distance method is chosen the methods such as ward, complete, single, through comparing, find result's the best that method=ward obtains, with the mode of sample with each merging two classes, the process operation result of last 15 merging is as shown in table 8:
Table 8SAS cluster process method=ward operation result table
Three statistics, half inclined to one side R according to the pedigree clustering method
2Statistic (SPRSQ), pseudo-F statistic (PSF), pseudo-t
2It is 4 that statistic (PST2) is selected optimum classification number.Totally 127 concepts in the cluster result, 114 self-defined group names wherein, 13 secondary classification catalogue concepts, best classification number is 4, wherein 13 second-level directory concepts are in the middle of two classes.Cluster result (runic is the concept in the secondary classification catalogue, and the concept of overstriking is not self-defined group name) as shown in table 9.
The self-defined group name of table 9 and second-level directory cluster result
Four classes that mark in the clustering tree that Fig. 2 represents are mutually corresponding with the cluster result in the table 9.Cluster result represents to be gathered that correlativity is maximum between the concept in a class, as led_plug_light, induction_lamp in the 4th class, led_module, led_rigid_bar, led_moving_head, led_rope_light, these eight concepts of led_dance_floor, led_recessed_light by poly-be a class, then illustrate in all concepts, correlativity between these eight concepts is maximum, can place the same class classification.
Step 6, Multidimensional Scaling directly adopts the Multidimensional Scaling function in the SAS software to analyze in this example, can verify the accuracy of cluster result and the visual cluster result that represents by Multidimensional Scaling.
In order to make the Multidimensional Scaling result more clear, concept is replaced with variable X 1~X127, the variable numbering is consistent with concept sequence number in the cluster result.Can be found out that by the Multidimensional Scaling space diagram 127 concepts have been divided into four classes, it has verified cluster result dry straightly, has also showed very intuitively the cluster result of concept, thereby the split catalog optimization of having finished made in China net illumination series products is analyzed.
By above-mentioned example as can be known, method of the present invention directly utilizes the web log file data to carry out user study, saves the cost of user's investigation, can comprehensively obtain user profile.