Detailed description of the invention
A kind of websites collection method for optimization analysis based on user's mental model, step is as follows:
Step 1, web log file data are pre-processed, particularly as follows:
Step 1-1, web log file data are purified, delete unrelated with analysis purpose in journal file or there is mistake
Data, the described data unrelated with analysis purpose include: comprise the data of concept in classified catalogue, comprise product code table
The data shown;The described data that there is mistake include: misspelling, product description mistake;Data processing is selected to need afterwards
Attribute, described attribute includes user's name, user region, user cognition concept, product category, described user cognition concept
The concept optimized about web catalogue submitted to based on the cognition to web catalogue for user, i.e. user utilizes net
When classified catalogue of standing browses, when can not find most suitable concept, oneself think more suitable what website interactive interface was submitted to
Concept;Such as user utilizes classified catalogue to search the books of data mining in Joyo.com, finds that such books belong to classification
Catalogue " database " classification, it is believed that the most improper, user thinks that data mining should occur directly in classified catalogue classification,
At this moment " data mining " is exactly described user cognition concept.
Step 1-2, data cleaned in step 1-1 are carried out format conversion, by the user cognition concept extracted and ground
Territory, the form of three attributes of title are unified, and specially remove numbering, capital and small letter plural number unification unified, single;
Step 1-3, determine the frequency that user cognition concept occurs, set threshold value afterwards, threshold value according to actual amount of data and
Extract user cognition concept quantity to determine, such as, in the case of actual amount of data and user cognition concept quantity are the least, for
The data volume that acquisition is certain, can set less threshold value.Choose the frequency user cognition concept more than this threshold value, and record frequency
Secondary;
Step 2, the concept determined in user cognition concept and web catalogue whether co-occurrence, specifically utilizes intelligence mould
Type category theory, retrieves user cognition concept to website as search key, and occur in statistics retrieval result divides
Concept in class catalogue and the frequency;
According to mental model category theory, user, when website use classified catalogue carries out acquisition of information, mainly uses water
Flat, vertically and horizontally vertical impartial click mode, according to correlation between concept in classified catalogue during click, selects phase
The concept of Guan Xinggao is clicked on, and utilizes this principle, user cognition concept is retrieved as search key to website,
Concept in the classified catalogue occurred in statistics retrieval result and the frequency thereof, to analyze user cognition concept and web catalogue
Correlation between middle concept.
Mental model category theory is that Charles Cole, Yang Lin et al. is found through experiments the mental model of people
Common three kinds are vertical-type (26%), horizontal type (31%), and impartial type (21%), altogether constitute the mental model class of 78% crowd
Type.The classification of mental model is based on what the hierachy number in mental model figure and number of regions determined.Three common class mental models
Feature is as follows:
● vertical: the mental model that the level of vertical dimensions is more than horizontal dimensions
● level: the mental model that the level of horizontal dimensions is more than vertical dimensions
● impartial: vertical dimensions and the equal mental model of horizontal dimensions level
Theoretical according to this, expanded to user and utilize classified catalogue to carry out in information access process, it is assumed that Yong Hu
When website use classified catalogue carries out acquisition of information, the mode being also adopted by vertical, level and horizontal vertical equalization is clicked on.
Step 3, generation co-occurrence matrix, described co-occurrence matrix is symmetrical matrix, and the first row and first is classified as concept, including using
Concept in family cognitive concept and web catalogue, the co-occurrence frequency that remaining element lattice are between concept, specially cell
The co-occurrence frequency between concept in corresponding the first row and first row;
Determine the co-occurrence frequency between concept in co-occurrence matrix, specifically comprise the following steps that
Step 3-1, determine the co-occurrence frequency of concept in user cognition concept and classified catalogue, be specifically divided into two kinds
Situation: a kind of is user cognition concept and the co-occurrence frequency of concept in secondary classification catalogue, is designated as F1,
F1The frequency that in=p*x p=retrieval result, in secondary classification catalogue, concept occurs
The frequency that x=user cognition concept occurs
Another kind is user cognition concept and the co-occurrence frequency of concept in three grades of classified catalogues, is user cognition concept and goes out
The existing frequency;
Step 3-2, determine the co-occurrence frequency between concept in classified catalogue, be in two classified catalogues concept with all
Smaller value in the co-occurrence frequency of user cognition concept, sums afterwards, is designated as F2, m, n represent in classified catalogue general respectively
Reading the co-occurrence frequency of A, B and user cognition concept, formula used is:
F2=SUM (MIN (m, n))
Step 3-3, the co-occurrence frequency determined between user cognition concept, the co-occurrence frequency between user cognition concept is 0.
Similarity matrix is generated on the basis of step 4, in step 3 co-occurrence matrix;
Generating similarity matrix specifically uses pearson relative coefficient to calculate as similarity, and formula used is
In formula, r is the degree that the linear correlation between two variablees is strong and weak, generally meets 0≤r≤1, and n is sample size, x, y
WithIt is respectively observation and the average of two variablees.
Step 5, on the basis of step 4, carry out cluster analysis, specifically utilize pedigree clusters that similarity matrix is carried out
Cluster, afterwards according to cluster statistic, determine the cluster result of concept, described concept include extract user cognition concept and
Concept in web catalogue;
Utilize pedigree clusters that similarity matrix is clustered, afterwards according to the statistic of cluster, determine the poly-of concept
Class result, specifically includes following steps:
Step 5-1, the distance determined between sample, constitute symmetry distance matrix, uses Ti, TjRepresent sample i, j, d(Ti,
Tj) represent the distance between i, j, it is abbreviated as dij, variance weighted range formula used is
Using N number of sample as N number of class, Mp、MqRepresent two classes, contain N respectivelyp, NqIndividual sample, Mp、MqBetween distance
Extremely Dpq, calculate sample distance between any two, constitute a symmetry distance matrix D (0);
Step 5-2, merging classification, generate new distance matrix, specifically select the smallest element on off-diagonal in D (0)
Element, if this least member is Dpq, at this moment Mp={Xp}, Mq={Xq}, Mp, Mq are merged into new class Mr={Xp, an Xq},
D (0) eliminates the ranks corresponding to Mp, Mq, and there was added new class Mr with remaining other be the distance institute group between the class be polymerized
The a line become and row, obtain new Distance matrix D (1), and it is N-1 rank square formations;
Step 5-3, repetition step 5-2 are until it is 1 big class that N number of sample gathers;
Step 5-4, statistic according to pedigree clustering method determine the cluster result of concept, and described statistic includes: R2
Statistic, half R partially2Statistic, Pseudo F-Statistics, pseudo-t2Statistic.
Step 6, utilize Multidimensional Scaling that the similarity matrix in step 4 is analyzed, obtain corresponding dimension many
Dimension dimensional analysis space diagram, thus complete websites collection optimization and analyze.Multidimensional Scaling is utilized similarity matrix to be carried out point
Analysis, generates Multidimensional Scaling space diagram, specifically includes following steps:
Step 6-1, generation observing matrix, specifically utilize Euclid to stimulate space to carry out spatial description, based on Min Kefu
This base distance function calculates: assuming that in web catalogue, tested to relation cognition between concept as basic input
Data, are provided with n object, can obtainDistance S between individual object pairij, the distance between point i and j is expressed as dij, institute
With formula it is:
In formula, v represents dimension, XiaRepresent coordinate points i in a dimension, XjaRepresent coordinate points j in a dimension;
Step 6-2, Homomorphic Mapping, specifically find the q dimension space of a dimensionality reduction, does Homomorphic Mapping and processes, makes q dimension space
Interior dijI.e. object is to the distance in p space and former distance SijMatch, if dijWith SijMatch completely, each paired object
Spacing relation is di1> di2> ... > dim, i.e. this distance that falls progressively is consistent with the original similarity order risen progressively;
Step 6-3, reliability and validity inspection, determine optimum number of dimensions, specifically calculate difference degree K, referred to as Cruise
Gram coefficient, whether the space diagram obtained for inspection institute has the most representative and stress stress exponent, for degree of fitting value,
Being defined as the departure between theoretical and the distance of calculating that similarity assessment data represent, Stress employing formula is:
Wherein dijIt is to meet the tested concept distance order relation that is originally inputted, makes again the reference that stress exponent value is minimum simultaneously
Value.Above-mentioned K value is the bigger the better, and is typically above acceptable 0.60;Stress value typically can accept, in detail within 0.20
Thin stress exponent size is shown in Table 1 with degree of fitting relation
Table 1 stress exponent size and degree of fitting relation
Stress |
Degree of fitting |
0.200 |
Bad |
0.100 |
All right |
0.050 |
Good |
0.025 |
The best |
0.000 |
Matching completely |
Step 6-4, according to the optimum number of dimensions determined in step 6-3, generate Multidimensional Scaling space diagram.
Below in conjunction with embodiment the present invention done further detailed description:
Goal in research: made in China net illuminating product classified catalogue optimization is analyzed.
Data illustrate: made in China net (international station http://www.made-in-china.com/) product classification catalogue
Lights&Lighting big class Zhejiang, Shanghai, Jiangsu, Guangdong four provinces and cities User Defined group name data (6872 record).In
State manufactures net and user cognition concept is referred to as self-defined group name.
Web log file data are pre-processed by step 1, particularly as follows:
1), after web log file data being purified, the attribute that Data processing needs is filtered out, including Business Name, province
Part, city and self-defined group name, concrete form is as shown in table 2:
Data form after table 2 data purification
2) first the numbering comprised in self-defined group name is removed, then self-defined group name is converted into small letter, remove plural number
Form, and according to first letter mother is ranked up;
3) very big due to the self-defined group name quantity that the frequency filtered out is less, threshold value is set to 4, selects the frequency more than 4
User Defined group name, finally select 114 self-defined group names and record its frequency.The self-defined group name result filtered out is such as
Shown in table 3:
Table 3 self-defined group name the selection result
Step 2, determines the whether co-occurrence of the concept in self-defined group name and web catalogue.Concrete operations are as follows:
1) station, made in China net world http://www.made-in-china.com/ is signed in;
2) in frame retrieval, input needs the self-defined group name of retrieval, selects in " all categories " drop-down menu
" Lights&Lighting ", then clicks on search;
3) concept in the secondary classification mesh occurred in statistics retrieval result " catalog ";
4) clicking on the concept occurred in " catalog " successively, the concept now occurred in " catalog " is corresponding three grades
Concept in classified catalogue;
Recording the concept in the secondary classification catalogue occurred in catalog, three grades of classified catalogues, corresponding unit lattice fill in 1,
Obtaining original cooccurrence relation statistical form, partial results is as shown in table 4:
Table 4 part co-occurrence result statistical form
In ensuing processing procedure, self-defined group name and the process of concept in secondary classification catalogue and three grades of classified catalogues
Process is similar to, below as a example by the cooccurrence relation of the concept in secondary classification catalogue and self-defined group name.
Step 3, generates co-occurrence matrix, particularly as follows:
1) the co-occurrence frequency between concept and self-defined group name in secondary classification catalogue is determined;Specifically by self-defined group name
Frequency number is multiplied by the frequency that in secondary classification catalogue, concept occurs, obtains partial results as shown in table 5:
Concept in table 5 secondary classification catalogue and the co-occurrence frequency partial results of self-defined group name
2) the co-occurrence frequency between the concept in secondary classification catalogue is determined;
Calculate on the basis of previous step result, illustrate, such as Interior lighting and LED lighting
The co-occurrence frequency, the row of the two concept entitled B, C in excel, therefore formula is SUM(MIN(B, C)), first select two
The data that in row, every a line is less, then sue for peace;
3) the co-occurrence frequency between self-defined group name all fills out 0, and the co-occurrence matrix finally obtained is as shown in table 6:
Concept and the co-occurrence matrix of self-defined group name in table 6 part secondary classification catalogue
|
Interiorlighting |
ledlighting |
lightingfixtureg |
bulblamp |
lightingdecoration |
Interior_lighting |
|
14441 |
6587 |
11403 |
10697 |
led_lighting |
14441 |
|
6643 |
12204 |
11108 |
lighting_fixtures |
6587 |
6643 |
|
6467 |
5836 |
bulb_lamp |
11403 |
12204 |
6467 |
|
9433 |
lighting_decoration |
10697 |
11108 |
5836 |
9433 |
|
outdoor_lighting |
14640 |
17255 |
6620 |
12498 |
11189 |
camping_light |
1116 |
1116 |
995 |
1100 |
1110 |
emergency_indicator_light |
2245 |
2240 |
2129 |
2226 |
2205 |
torch |
653 |
653 |
582 |
645 |
648 |
portable_lighting |
1364 |
1388 |
1289 |
1356 |
1353 |
Step 4, generates similarity matrix, uses SAS software, selects Pearson correlation coefficient to calculate, obtains similar
Property matrix, partial results is as shown in table 7:
Table 7 similarity matrix partial results
Step 5, cluster analysis, utilize SAS software, choose pedigree clustering method, carry out cluster analysis, between class distance method
Choose the methods such as ward, complete, single, through comparing, find that the result that method=ward obtains is optimal, by sample
In the way of merging two classes, last 15 process operation results merged are as shown in table 8 every time:
Table 8SAS cluster process method=ward operation result table
Three statistics according to pedigree clustering method half R partially2Statistic (SPRSQ), Pseudo F-Statistics (PSF), pseudo-t2System
Metering (PST2) selects optimum classification number to be 4.Totally 127 concepts, wherein 114 self-defined group names in cluster result, 13 two
Level classified catalogue concept, optimal classification number is 4, and wherein 13 second-level directory concepts are in the middle of two classes.Cluster result is such as
Shown in table 9 (runic is the concept in secondary classification catalogue, and the concept of non-overstriking is self-defined group name).
The self-defined group name of table 9 and second-level directory cluster result
Four classes marked in the clustering tree that Fig. 2 represents are the most corresponding with the cluster result in table 9.Cluster result represents
Between the concept gathered in a class, correlation is maximum, as led_plug_light in the 4th class, induction_lamp,
led_module、led_rigid_bar、led_moving_head、led_rope_light、led_dance_floor、led_
It is a class that these eight concepts of recessed_light are gathered, then illustrate in all concepts, the correlation between these eight concepts
It is maximum, is placed in same class classification.
Step 6, Multidimensional Scaling, this example directly use the Multidimensional Scaling function in SAS software be analyzed,
The accuracy of cluster result can be verified by Multidimensional Scaling, and visualization represents cluster result.
In order to make Multidimensional Scaling result relatively sharp, concept variable X 1~X127 being replaced, variable numbering is with poly-
Concept sequence number in class result is consistent.Be can be seen that by Multidimensional Scaling space diagram, 127 concepts have been divided into four classes, its knot
Fruit demonstrates cluster result well, illustrates the cluster result of concept the most intuitively, thus completes made in China net and shine
The classified catalogue optimization of bright series products is analyzed.
From above-mentioned example, the method for the present invention directly utilizes web log file data and carries out user study, saves user
The cost of investigation, can comprehensively obtain user profile.