CN103425661B - A kind of website data is analyzed method and analyzes system - Google Patents

A kind of website data is analyzed method and analyzes system Download PDF

Info

Publication number
CN103425661B
CN103425661B CN201210151293.4A CN201210151293A CN103425661B CN 103425661 B CN103425661 B CN 103425661B CN 201210151293 A CN201210151293 A CN 201210151293A CN 103425661 B CN103425661 B CN 103425661B
Authority
CN
China
Prior art keywords
data stream
data
access
page
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210151293.4A
Other languages
Chinese (zh)
Other versions
CN103425661A (en
Inventor
殷霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201210151293.4A priority Critical patent/CN103425661B/en
Publication of CN103425661A publication Critical patent/CN103425661A/en
Application granted granted Critical
Publication of CN103425661B publication Critical patent/CN103425661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This application provides a kind of website data and analyze method and analysis system, from the angle of data stream, whole network data can be analyzed.Described method includes: by analyzing web site daily record data, it is thus achieved that access data stream, and described access data stream have recorded the order accessing webpage;Rejecting the access data stream not comprising the important page, wherein, the described important page is the page meeting pre defined attribute;Flow to numerous excavation of line frequency to calculate to the remaining access data comprising the important page, obtain the occurrence frequency of first m high access data stream of occurrence frequency and each access data stream;Access data stream for described m, calculate the number of times that the important page occurs in each data stream, and the length of each data stream;Utilize each to access the length of the occurrence frequency of data stream, the number of times that the important page occurs and data stream, calculate described m the water accessing each data stream in data stream.The application instructs the design of website UI by the analysis of website data stream.

Description

A kind of website data is analyzed method and analyzes system
Technical field
The application relates to web technology, particularly relates to a kind of website data and analyzes method and analysis system.
Background technology
Website is a three-dimensional system, passes by from a different perspective, and obtain is different results, Behavior is contained in each click behind.Website user's behavioural analysis just can dissect people by network data Specific network behavior, discloses people's heart demand, the raising and lowering of website visiting amount, website visiting Populational subdivision and customer group access be intended to.
For example, having two users, one of them first clicks on " scientific and technological " channel after logging in certain website, with After click " internet ", another one first clicks " scientific and technological " channel of this website, then clicked on " number Code ", but only stopped a very short time " digital " and put " internet " immediately.So, certain In degree there is uniformity in the operating habit of this two users, and can according to their interested content To judge on certain probability that they are IT industry practitioners.By to classification same many times, just The general type of website user can be obtained by the analysis to these data.
Additionally, the mouse of user is clicked on can teach that user's regarding on certain webpage in a way Feel track.Because of the general behavior rule according to people, user can first click on the webpage unit that he notices at first Element, no matter this element be a button or other.Therefore, summary user's mouse clicked on and analysis By teaching that vision on a webpage for the user substantially browses track, therefore deduce that one Whether webpage design is reasonable, if enables to user and really notes and can click website needing to allow The position that he clicks on, eventually affects the information architecture even website structure of whole website.
But, existing website behavioral data is analyzed, and analytic angle is both for unique user or customer group Behavioural analysis, these analysis methods are not particularly suited for similar instructing website UI design etc. otherwise Application, and this kind of application needs the overall angle from website to account for.
Therefore, it is presently required and solves the technical problem that and be: provide a kind of website behavioral data to analyze method, User behavior data is analyzed by the angle that can stand in whole network data.
Content of the invention
This application provides a kind of website data and analyze method and analysis system, can be from the angle of data stream Whole network data is analyzed.
In order to solve the problems referred to above, this application discloses a kind of website data and analyze method, comprising:
By analyzing web site daily record data, it is thus achieved that access data stream, wherein, described access data stream record Access the order of webpage;
Rejecting the access data stream not comprising the important page, wherein, the described important page is predefined for meeting The page of attribute;
Flow to numerous excavation of line frequency to calculate to the remaining access data comprising the important page, obtain occurrence frequency First m high access data stream and the occurrence frequency of each access data stream, m is positive integer;
Access data stream for described m, calculate the number of times that the important page occurs in each data stream, And the length of each data stream;
Each is utilized to access the length of the occurrence frequency of data stream, the number of times that the important page occurs and data stream Degree, calculates described m the water accessing each data stream in data stream.
Preferably, described method also includes: accesses data stream to described m and carries out water ranking; And according to the design of described water ranking analysis each block of the page.
Preferably, described by analyzing web site daily record data, it is thus achieved that to access data stream, comprising: by dividing Analysis web log file data, go out access path from described web log file extracting data;By described access path Be converted to tree, obtain the tree of access path;The tree of access path described in depth-first traversal, To access data stream.
Preferably, the page of described pre defined attribute includes: produce the page of feedback behavior;And/or, The loose-leaf runed.
Preferably, the water of described data stream is directly proportional to the occurrence frequency of data stream, and in data stream Occur that the number of times of the important page is directly proportional, be inversely proportional to the length of data stream.
Preferably, described m the water accessing each data stream in data stream is calculated, comprising:
Calculate according to below equation:
S=a0+(α·frequency(g)+β·quality(g))/γ·lenth(g);
Wherein, g represents a data stream;
Frequency (g) represents the occurrence frequency of data stream, and α is the factor of influence parameter of frequency (g);
Quality (g) represents the number of times occurring the important page in data stream, and β is the factor of influence of quality (g) Parameter;
Lenth (g) represents the length of data stream, and γ is the factor of influence parameter of lenth (g);
a0Represent availability of data parameter.
Preferably, take the web log file data in certain time period for the first time and carry out the calculating of data stream water; Every default time interval, take each Incremental Log data and carry out the calculating of data stream water.
Present invention also provides a kind of website data and analyze system, comprising:
Log analysis module, for by analyzing web site daily record data, it is thus achieved that access data stream, wherein, Described access data stream have recorded the order accessing webpage;
Data reject module, for rejecting the access data stream not comprising the important page, wherein, described heavy The page is wanted to be the page meeting pre defined attribute;
Frequently excavate module, for the numerous excavation of line frequency is flow to the remaining access data comprising the important page Calculate, obtain the occurrence frequency of first m high access data stream of occurrence frequency and each access data stream, M is positive integer;
Newly-increased index computing module, for accessing data stream for described m, calculates each data stream The middle number of times that the important page occurs, and the length of each data stream;
Water computing module, for utilizing each to access the occurrence frequency of data stream, the important page occur Number of times and the length of data stream, calculate described m the high-quality accessing each data stream in data stream Degree.
Preferably, described system also includes: order module, flows to for accessing data to described m Row water ranking, and according to the design of described water ranking analysis each block of the page.
Preferably, described log analysis module includes:
Extract submodule, for by analyzing web site daily record data, from described web log file extracting data Go out access path;
Transform subblock, for described access path is converted to tree, obtains the tree of access path;
Traversal submodule, for the tree of access path described in depth-first traversal, obtains accessing data stream.
Preferably, described water computing module calculates according to below equation:
S=a0+(α·frequency(g)+β·quality(g))/γ·lenth(g);
Wherein, g represents a data stream;
Frequency (g) represents the occurrence frequency of data stream, and α is the factor of influence parameter of frequency (g);
Quality (g) represents the number of times occurring the important page in data stream, and β is the factor of influence of quality (g) Parameter;
Lenth (g) represents the length of data stream, and γ is the factor of influence parameter of lenth (g);
a0Represent availability of data parameter.
Compared with prior art, the application includes advantages below:
First, from the angle of data stream, to the whole network station, the behavior of all users is analyzed the application, rather than The analysis of the special behavior of single minority.Further, instruct website UI's by the analysis of website data stream Design, instructs the work of website operation personnel.
Secondly, the behavioral data mark classification that the application accesses user website, weeds out nonsensical Part data, make target data set be reduced at least one or more the order of magnitude, alleviate amount of calculation.
Again, the application adds two indices during the calculating of data stream water: data stream Length and data stream occur the number of times of the important page, relatively rapid, data stream accurately can be found, Avoid the customer loss that long data conductance causes.
Finally, due to after weeding out batch of data, data volume has the minimizing of magnitude, and some application are more Concern incremental data, data set before is had no effect by incremental data, simply at result set before On the basis of do the operation increasing data, so recalculating without full dose data, the data volume therefore calculating Less, real-time data analysis is realized with regard to this.
Certainly, the arbitrary product implementing the application is not necessarily required to reach all the above excellent simultaneously Point.
Brief description
Fig. 1 is the flow chart that a kind of website data described in the embodiment of the present application analyzes method;
Fig. 2 is the structural representation of the tree of access path in the embodiment of the present application;
Fig. 3 is the flow chart that a kind of website data described in another embodiment of the application analyzes method;
Fig. 4 is the structure chart that a kind of website data described in the embodiment of the present application analyzes system.
Detailed description of the invention
Understandable, below in conjunction with the accompanying drawings for enabling the above-mentioned purpose of the application, feature and advantage to become apparent from With detailed description of the invention, the application is described in further detail.
The application introduces the concept of data stream, from the angle of data stream to the whole network station the behavior of all users enter Row is analyzed.And, calculated by optimizing frequent subtree, relatively rapid, data stream accurately can be found.
Below by embodiment flow process is described in detail to be realized to herein described method.
It with reference to shown in Fig. 1, is the flow chart that a kind of website data described in the embodiment of the present application analyzes method.
Step 101, by analyzing web site daily record data, it is thus achieved that access data stream, wherein, described access Data stream have recorded the order accessing webpage;
Accessing data stream and referring to that user accesses the sequencing of webpage, the access data stream such as certain user is A → C → F, i.e. this user first access webpage A, then jump to webpage C from webpage A and conduct interviews, Jump to webpage F from webpage C again, be consequently formed one and access data stream (being called for short data stream).Wherein, A, C, F can be described as accessing the page node of data stream or abbreviation node.
Can be obtained by analyzing web site daily record data and access data stream, the present embodiment will be enumerated a kind of acquisition and visit Ask the mode of data stream, but the protection domain of the application should not be limited to this.
Specifically include following sub-step:
Sub-step 1, by analyzing web site daily record data, goes out to access from described web log file extracting data Path;
Web log file have recorded the behavioral data that user website accesses, so passing through analyzing web site daily record number According to, it is possible to obtain which webpage a user have accessed within a period of time, forms the access of this user Path.Certainly, a user can have a plurality of access path.
Described access path is converted to tree by sub-step 2, obtains the tree of access path;
Described tree is the data knot of the tree-shaped relation with root node, child node and leaf node Structure.A kind of conversion regime is set forth below, but the protection domain of the application should not be limited to this.
For example, the access according to each step of user, the current step of record and source step (i.e. previous step), often (a, b), wherein a is the currently accessed page to secondary a pair data splitting that be recorded as obtaining, and b is next The source page (i.e. goes up a page), so records the access situation of user, thus draws following record: (a, b), (a, c), (c, d), (d, e), (a, e), (h, g), each of which is all an access path to data splitting. According to these access path, tree as shown in Figure 2 can be drawn.
Sub-step 3, the tree of access path described in depth-first traversal, obtain accessing data stream.
In computer science, traversal of tree refers to access one in a certain order by a kind of method The process of tree.For binary tree, traversal of tree generally has four kinds: preorder traversal, inorder traversal, postorder Traversal, breadth first traversal.Wherein, first three is referred to as depth-first traversal.For multiway tree, tree Traversal generally has two kinds: depth-first traversal, breadth first traversal.
In the present embodiment, depth-first traversal one tree is used to can get all access data of this tree Stream.Wherein, every accesses data stream is all a complete stream, is i.e. all the stream starting from root node. And, a data stream is to be produced by a user, and a user can produce many data stream.
For example, depth-first traversal Fig. 2, available 4 data streams, it is respectively as follows:
a→b;
a→c→d→e;
a→e;
h→g。
Step 102, rejects the access data stream not comprising the important page, and wherein, the described important page is Meet the page of pre defined attribute;
Wherein, the page of described pre defined attribute includes:
Produce the page of feedback behavior, i.e. have the page of feedback;
And/or, the loose-leaf runed.
Wherein, the feedback behavior producing in the page have feedback specifically may include that and places an order, to seller Online message, click on seller contact method, click on seller commercial number, click on seller trade lead to, Click is signed a contract and is put on record.
Compared with prior art, the improvement that the embodiment of the present application is done is: the data that user is produced Flow point is two classes, has feedback flow and feedback-less stream.Visiting owing to data analysis process being concerned with user The feedback producing during asking website, this type of data are focal points, by the extreme saturation of tree, mark Keep those operations stream (data stream) to have in tree feedback step in mind.The target data thus analyzed is main It is to have feedback flow.
Specifically, during the extreme saturation of tree, can there is the page of feedback to these or run Loose-leaf tagged, identifying these pages is the important page meeting pre defined attribute, with display Its particularity.Then, the access data stream obtaining extreme saturation does rejecting operation, if certain data Stream does not has the important page that any one labels, then rejects this stream.
This rejecting operation can weed out the part data nonsensical to data analysis, so that mesh Mark data set minimizing one or more the order of magnitude.
The remaining access data comprising the important page are flow to numerous excavation of line frequency and calculate by step 103, First m the access data stream high to occurrence frequency and the occurrence frequency of each access data stream, m is for just Integer;
The occurrence frequency of data stream refers to the occurrence number of data stream, if party A-subscriber produces a data stream A → C → F, party B-subscriber also produces same data stream A → C → F, then the appearance frequency of this data stream Degree is 2.
This step employs the digging technology of frequent subtree, for the remaining access number comprising the important page According to stream, according to the frequent mining algorithm of any one of the prior art, recursive calculation goes out occurrence frequency sequence Forward m frequent subtree (i.e. data stream), and the occurrence frequency of every data stream.Wherein, m Value can determine according to actual needs.For the relatively low data stream of occurrence frequency, then at subsequent step Calculating in do not consider further that, to reduce calculating data volume.
It should be noted that frequently excavate the data stream obtaining to be different from the data that depth-first traversal obtains Stream.As an example it is assumed that the data stream that depth-first traversal obtains is A → B → C → D → E, This data stream comprises the important page meeting pre defined attribute, then flow to this data that line frequency is numerous to be passed Calculated data stream packets is returned to include: B → C → D → E, C → D → E, D → E.
Based on this, a kind of method that by frequent subtree be calculated data stream is set forth below, as follows:
Such as A → B → C → D → E, A1 → B → C → D → E1, A → B → C → D → E2, The such four kinds of streams of A2 → B → C1 → D → E1, it is assumed that wherein all comprise the predefined attribute page, then pass Returning B → C → D occurrence frequency in calculated data stream to be 3, A → B → C → D occurrence frequency is 2, D → E1 occurrence frequency is 2, remaining be all 1 (in this result, be not required to consider B → C, C → D, Because the occurrence frequency of the two data stream itself is all 3, the two data stream is same already contained in frequency Be 3 B → C → D suffer), then if desired take the subtree of before occurrence frequency 3, be then B → C → D, A → B → C → D, D → E1.
Step 104, accesses data stream for described m, calculates in each data stream and the important page occur Number of times, and the length of each data stream;
Wherein, occur that the number of times of the important page refers to contain in a data stream several important page, example As: data stream A → B → C → D → E, wherein webpage B and D is the important page, then in this data stream The number of times important page occur is 2.
The length of described data stream refers to the page nodes comprising in a data stream, such as data stream A length of the 5 of A → B → C → D → E.
Step 105, utilizes each to access the occurrence frequency of data stream, the number of times sum important page occur According to the length of stream, calculate described m the water accessing each data stream in data stream.
Existing frequent subtree algorithm, mostly only focuses on the occurrence frequency of frequent subtree (data stream), is No higher than certain threshold value, and do not consider other influences factor.During the analysis of the embodiment of the present application, only Occurrence frequency can not meet demand, increases the index of two other final water score of impact, including The quality of the scale (nodes of subtree) of frequent subtree (data stream) and subtree (important joint in subtree The number of point), totally three indexs finally calculate the high-quality of topN (sort forward N number of data stream) Degree score S.This point is also that the embodiment of the present application is to one of improvement that existing frequent subtree calculates.
Wherein, being described below of each index:
1) frequency that subtree occurs, the also referred to as occurrence frequency of data stream, its corresponding factor of influence is joined Number is α, S and α direct proportionality;
2) quality of subtree, also referred to as data stream there is the number of times of the important page, its corresponding impact Factor parameter is β, S and β direct proportionality;
3) scale of subtree, the also referred to as length of data stream, its corresponding factor of influence parameter is γ, S The inversely proportional relation with γ.
Described factor of influence parameter alpha, β and γ, can roots it can be appreciated that the corresponding weight of three indexs Border application scenarios sets factually.
From the foregoing, it will be observed that water score S of described data stream not only just becomes with the occurrence frequency of data stream Ratio, also to data stream occurring, the number of times of the important page is directly proportional, and is inversely proportional to the length of data stream.
In other words, it is not that the data stream that occurrence frequency described in prior art is high is optimum, if data The length of stream is oversize, also can traffic impacting water.
Based on this positive inverse relation, the present embodiment enumerates a kind of computing formula, but the protection of the application is anti- Should not be limited to this again, specific as follows:
S=a0+(α·frequency(g)+β·quality(g))/γ·lenth(g);
Wherein, g represents a data stream;
Frequency (g) represents the occurrence frequency of data stream, and α is the factor of influence parameter of frequency (g);
Quality (g) represents the number of times occurring the important page in data stream, and β is the factor of influence of quality (g) Parameter;
Lenth (g) represents the length of data stream, and γ is the factor of influence parameter of lenth (g);
a0Expression availability of data parameter, for example: calculate the water of data stream in certain shopping website, The availability of this website data at non-weekend is better than the availability of data at weekend, therefore for data at non-weekend Different a can be set with data at weekend0Value.
For example:
Following data stream is data stream to be analyzed, and wherein H and D is the page producing feedback behavior:
1)A→B→D→F→H
2)A→C→H
3)A→B→D→F→H→A→B→H
Then analysis result is as follows:
A number stream occurrence number is 2, and feedback node number is 2, a length of the 5 of stream;
No. two stream occurrence numbers are 1, and feedback node number is 1, a length of the 3 of stream;
No. three stream occurrence numbers are 1, and feedback node number is 3, a length of the 8 of stream.
Investigating according to Primary Stage Data, providing each weight α=2.5, β=4, (concrete setting can in γ=1 Data investigation and analysis according to early stage, or specifically demand sets applicable value).
To sum up draw: S1=(2.5*2+4*2)/1*5=2.6
S2=(2.5*1+4*1)/1*3=2.2
S3=(2.5*1+4*3)/1*8=1.8125
Thus show that a stream is data stream optimum in three kinds of data streams.
In sum, above-described embodiment provide website data analyze method, prior art is made that with Some improvement lower:
First, existing website data is analyzed method and is less used this global concept of data stream to analyze User behavior, and prior art is the analysis for certain particular user;And the embodiment of the present application is from number Being analyzed website data according to the angle of stream, to the whole network station, the behavior of all users is analyzed, rather than The analysis of the special behavior of single minority.
Second, existing frequent subtree algorithm does not does first run screening on initial result set and rejects;And this The behavioral data mark classification that application embodiment accesses user website, weeds out nonsensical part Data, this process makes target data set be reduced at least one or more the order of magnitude, alleviates amount of calculation;
3rd, existing frequent subtree algorithm only focuses on the occurrence number of subtree mostly, and does not considers other Influence factor;And the embodiment of the present application adds two indices during the calculating of data stream water, Including (there is the important page in the quality of the scale of subtree (length of data stream) and subtree in data stream Number of times), relatively rapid, data stream accurately can be found, it is to avoid the user that long data conductance causes Run off.
Based on Fig. 1 embodiment, below in conjunction with website UI design, carried out in more detail by Fig. 3 embodiment Ground explanation.In Fig. 3 embodiment, water ranking can be carried out to accessing data stream, and according to described excellent The design of matter degree ranking analysis each block of the page.
It with reference to shown in Fig. 3, is the flow process that a kind of website data described in another embodiment of the application analyzes method Figure.
Wherein, step 201a and step 201b can executed in parallel, it is possible to perform according to sequencing, and And both sequencing interchangeable.Shown in Fig. 2 is to first carry out step 201b after step 201a Situation.
Step 201a, defines some specialized page;
Access path, to daily record data process, is converted to tree by step 201b;
Described specialized page specifically mays include: the loose-leaf runed, the page having feedback.
Step 202, the tree of depth-first traversal access path, draw all of access data stream;
Step 203, does rejecting operation, it is judged that whether comprise specialized page in data stream to above-mentioned data stream;
Such as the specialized page defining before without reference to any one in certain stream, then reject this data Stream, the process of this data stream terminates.In actual application, this operation can remove about 70% useless Data, decrease data set to be analyzed.
Step 204, to remaining data stream, according to the frequent algorithm excavating, recursive calculation goes out topM (M Can draft according to demand) frequent subtree (i.e. data stream), draw data stream and the existing frequency of this outflow;
Wherein, can sort the frequent subtree selecting high front M the data stream of occurrence frequency as topM.
Step 205, calculates the number of times occurring the important page in each data stream;
Step 206, calculates the length of each data stream;
Step 207, investigates according to Primary Stage Data and analyzes, drawing the weight of each index, and draw one The computing formula of individual data stream water;
Step 208, brings each index and weight into computing formula, draws every water flowing;
Step 209, according to the ranking of water, analyzes the design of each block of the page.
Concrete, can pass through to analyze each block design precisely effective linked contents at the page, thus Do the guiding of optimum access path to each visiting subscriber, reduce churn rate, improve feedback rates.
For example, selecting an optimum data stream A → B → C, certain the eye-catching block at website homepage sets The linked contents of meter webpage A, at the linked contents of the eye-catching block design webpage B of webpage A, at net The linked contents of eye-catching block design webpage C of page B, thus guide user open every time one new Webpage, can find, in region the most eye-catching, the link oneself desiring access to and click on.
Additionally, in above process, the web log file data that can also take for the first time in certain time period are carried out Data stream water calculates, and then every default time interval, takes each Incremental Log data and carries out Data stream water calculates.
For example, the full dose data choosing certain time period region for the first time are analyzed calculating, follow-up from step Visitor's behavioral data that every day, website newly gathered can be analyzed calculating by rapid 202~209.Further, from step Newly-increased data can also be done once in every 5 minutes by rapid 202~209, and the concrete time period can be according to demand from plan. Rejecting operation due to step 203, it is ensured that the minimizing of incremental data set magnitude, and then ensure that in real time The data of increment are carried out identical analysis by feasibility every time that calculate, and superposition enters final result collection, Realize analyzing in real time.
In sum, Fig. 3 embodiment not only has several advantages of Fig. 1 embodiment, also to prior art It is made that following improvement:
First, the behavioral data analysis of existing website is all the hobby analyzing unique user, comes to special User carry out the recommendation of specific single or multiple commodity, the algorithm of excavation uses the angle at commodity, And seldom in view of the global design of page block, the work of website operation personnel is also simply runing this page Face, seldom considers the process streams from other association pages to the operation page.And the embodiment of the present application is by website The analysis of data stream is applied in the UI design of website, can find relatively rapid, data stream accurately, Instruct the design of website UI, instruct the work of website operation personnel.
Second, prior art, due to the big problem of data volume, seldom relates to analyzing in real time.And the application is real Execute the classification by initial data all to website for the example, after weeding out the data being not concerned with, available website number According to the little part only accounting for all website datas, can be by real-time (such as every 5 minutes) increment (variable quantity) New data, join in data set, incremental data is analyzed calculate.Accordingly, because pretreatment After batch of data, data volume has the minimizing of magnitude, and incremental data, increment number are more paid close attention in some application Having no effect according to data set before, simply doing on the basis of result set before increases the behaviour of data Making, so recalculating without full dose data, the data volume therefore calculating is less, realizes real with regard to this When data analysis.
Above-described embodiment is to illustrate as a example by website UI design, but also can be by net in concrete application The analysis of data of standing stream is applied to other aspects, and it is similar to the aforementioned embodiment that it implements principle, therefore no longer superfluous State.
It should be noted that for aforesaid embodiment of the method, in order to be briefly described, therefore it is all stated For a series of combination of actions, but those skilled in the art should know, the application is not by described The restriction of sequence of movement because according to the application, some step can use other orders or simultaneously Carry out.Secondly, those skilled in the art also should know, embodiment described in this description belongs to Preferred embodiment, necessary to involved action not necessarily the application.
Based on the explanation of said method embodiment, present invention also provides corresponding website data and analyze system Embodiment.
It with reference to Fig. 4, is the structure chart that a kind of website data described in the embodiment of the present application analyzes system.
Described website data is analyzed system and specifically can be included with lower module:
Log analysis module 10, for by analyzing web site daily record data, it is thus achieved that access data stream, wherein, Described access data stream have recorded the order accessing webpage;
Data reject module 20, for rejecting the access data stream not comprising the important page, wherein, described The important page is the page meeting pre defined attribute;
Frequently excavating module 30, for flowing to the remaining access data comprising the important page, line frequency is numerous to be dug Pick calculates, and obtains first m high access data stream of occurrence frequency and each accesses the appearance of data stream frequently Degree, m is positive integer;
Newly-increased index computing module 40, for accessing data stream for described m, calculates each data Stream occurs the number of times of the important page, and the length of each data stream;
Water computing module 50, for utilizing each to access the occurrence frequency of data stream, important page occur The number of times in face and the length of data stream, calculate described m and access the excellent of each data stream in data stream Matter degree.
Wherein, the page of described pre defined attribute includes: produce the page of feedback behavior;And/or, just Loose-leaf in operation.
Preferably, described log analysis module 10 specifically can include following submodule:
Extract submodule, for by analyzing web site daily record data, from described web log file extracting data Go out access path;
Transform subblock, for described access path is converted to tree, obtains the tree of access path;
Traversal submodule, for the tree of access path described in depth-first traversal, obtains accessing data stream.
Wherein, the water of described data stream is directly proportional to the occurrence frequency of data stream, goes out in data stream The number of times of the existing important page is directly proportional, and is inversely proportional to the length of data stream.
In one embodiment, based on described positive inverse relation, described water computing module 50 can be according to Below equation calculates:
S=a0+(α·frequency(g)+β·quality(g))/γ·lenth(g);
Wherein, g represents a data stream;
Frequency (g) represents the occurrence frequency of data stream, and α is the factor of influence parameter of frequency (g);
Quality (g) represents the number of times occurring the important page in data stream, and β is the factor of influence of quality (g) Parameter;
Lenth (g) represents the length of data stream, and γ is the factor of influence parameter of lenth (g);
a0Represent availability of data parameter.
Preferably, above-mentioned website data is analyzed system and can be carried out data and analyze in real time, and concrete mode is:
Take the web log file data in certain time period for the first time and carry out the calculating of data stream water;
Every default time interval, take each Incremental Log data and carry out the calculating of data stream water.
Preferably, in one embodiment, in described website data Application of analysis system to UI can being designed, Therefore described system can also include with lower module:
Order module 60, carries out water ranking for accessing data stream to described m, and according to institute State the design of water ranking analysis each block of the page.
Analyze for system embodiment for above-mentioned website data, due to the basic phase of itself and embodiment of the method Seemingly, so describe is fairly simple, related part sees the part of embodiment of the method shown in Fig. 1 and Fig. 3 Illustrate.
In sum, described website data analysis system has the advantage that
First, website data is analyzed by the application from the angle of data stream, all users to the whole network station Behavior be analyzed, rather than the analysis of the special behavior of single minority.Further, by website data stream Analyze the design instructing website UI, instruct the work of website operation personnel.
Secondly, the behavioral data mark classification that the application accesses user website, weeds out nonsensical Part data, make target data set be reduced at least one or more the order of magnitude, alleviate amount of calculation.
Again, the application adds two indices during the calculating of data stream water: data stream Length and data stream occur the number of times of the important page, relatively rapid, data stream accurately can be found, Avoid the customer loss that long data conductance causes.
Finally, due to after weeding out batch of data, data volume has the minimizing of magnitude, and some application are more Concern incremental data, data set before is had no effect by incremental data, simply at result set before On the basis of do the operation increasing data, so recalculating without full dose data, the data volume therefore calculating Less, real-time data analysis is realized with regard to this.
Each embodiment in this specification all uses the mode gone forward one by one to describe, and each embodiment stresses Be all the difference with other embodiments, between each embodiment, identical similar part sees mutually ?.
Above " and/or " represent both contained herein " and " relation, also contains " or " Relation, wherein: if option A and option b be " and " relation, then it represents that can in certain embodiment To include option A and option b simultaneously;If option A and option b be " or " relation, then table Show and certain embodiment can individually include option A, or individually include option b.
Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system or Computer program.Therefore, the application can use complete hardware embodiment, complete software implementation, Or the form of the embodiment in terms of combining software and hardware.And, the application can use one or more Wherein include computer-usable storage medium (the including but not limited to disk of computer usable program code Memory, CD-ROM, optical memory etc.) form of the upper computer program implemented.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart of product and/or block diagram describe.It should be understood that flow process can be realized computer program instructions Stream in each flow process in figure and/or block diagram and/or square frame and flow chart and/or block diagram Journey and/or the combination of square frame.These computer program instructions can be provided to all-purpose computer, dedicated computing The processor of machine, Embedded Processor or other programmable data processing device, to produce a machine, makes The instruction that must be performed by the processor of computer or other programmable data processing device is produced in fact Present one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame are specified The device of function.
These computer program instructions may be alternatively stored in and can guide computer or other programmable data process In the computer-readable memory that equipment works in a specific way so that be stored in the storage of this computer-readable Instruction in device produces the manufacture including command device, and this command device realizes in one flow process of flow chart Or the function specified in multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, Make on computer or other programmable devices, perform sequence of operations step to realize to produce computer Process, thus on computer or other programmable devices perform instruction provide for realize in flow process The function specified in one flow process of figure or multiple flow process and/or one square frame of block diagram or multiple square frame Step.
Analyze method above to a kind of website data provided herein and analyze system, having carried out in detail Introducing, principle and embodiment to the application for the specific case used herein is set forth, above The explanation of embodiment is only intended to help and understands the present processes and core concept thereof;Simultaneously for this The those skilled in the art in field, according to the thought of the application, in specific embodiments and applications all Will change, in sum, this specification content should not be construed as the restriction to the application.

Claims (11)

1. a website data analyzes method, it is characterised in that include:
By analyzing web site daily record data, it is thus achieved that access data stream, wherein, described access data stream record Access the order of webpage;
Rejecting the access data stream not comprising the important page, wherein, the described important page is predefined for meeting The page of attribute;
Flow to numerous excavation of line frequency to calculate to the remaining access data comprising the important page, obtain occurrence frequency First m high access data stream and the occurrence frequency of each access data stream, m is positive integer;
Access data stream for described m, calculate the number of times that the important page occurs in each data stream, And the length of each data stream;
Each is utilized to access the length of the occurrence frequency of data stream, the number of times that the important page occurs and data stream Degree, calculates described m the water accessing each data stream in data stream.
2. method according to claim 1, it is characterised in that also include:
Access data stream to described m and carry out water ranking;And
Design according to described water ranking analysis each block of the page.
3. method according to claim 1, it is characterised in that described by analyzing web site daily record Data, it is thus achieved that access data stream, comprising:
By analyzing web site daily record data, go out access path from described web log file extracting data;
Described access path is converted to tree, obtains the tree of access path;
The tree of access path described in depth-first traversal, obtains accessing data stream.
4. method according to claim 1, it is characterised in that the page of described pre defined attribute Including:
Produce the page of feedback behavior;
And/or, the loose-leaf runed.
5. method according to claim 1, it is characterised in that:
The water of described data stream is directly proportional to the occurrence frequency of data stream, with data stream in occur important The number of times of the page is directly proportional, and is inversely proportional to the length of data stream.
6. method according to claim 5, it is characterised in that calculate described m and access data The water of each data stream in stream, comprising:
Calculate according to below equation:
S=a0+(α·frequency(g)+β·quality(g))/(γ·lenth(g));
Wherein, g represents a data stream;
Frequency (g) represents the occurrence frequency of data stream, and α is the factor of influence parameter of frequency (g);
Quality (g) represents the number of times occurring the important page in data stream, and β is the factor of influence of quality (g) Parameter;
Lenth (g) represents the length of data stream, and γ is the factor of influence parameter of lenth (g);
a0Represent availability of data parameter.
7. method according to claim 1, it is characterised in that:
Take the web log file data in certain time period for the first time and carry out the calculating of data stream water;
Every default time interval, take each Incremental Log data and carry out the calculating of data stream water.
8. a website data analyzes system, it is characterised in that include:
Log analysis module, for by analyzing web site daily record data, it is thus achieved that access data stream, wherein, Described access data stream have recorded the order accessing webpage;
Data reject module, for rejecting the access data stream not comprising the important page, wherein, described heavy The page is wanted to be the page meeting pre defined attribute;
Frequently excavate module, for the numerous excavation of line frequency is flow to the remaining access data comprising the important page Calculate, obtain the occurrence frequency of first m high access data stream of occurrence frequency and each access data stream, M is positive integer;
Newly-increased index computing module, for accessing data stream for described m, calculates each data stream The middle number of times that the important page occurs, and the length of each data stream;
Water computing module, for utilizing each to access the occurrence frequency of data stream, the important page occur Number of times and the length of data stream, calculate described m the high-quality accessing each data stream in data stream Degree.
9. system according to claim 8, it is characterised in that also include:
Order module, for carrying out water ranking to described m access data stream, and according to described The design of water ranking analysis each block of the page.
10. system according to claim 8, it is characterised in that described log analysis module includes:
Extract submodule, for by analyzing web site daily record data, from described web log file extracting data Go out access path;
Transform subblock, for described access path is converted to tree, obtains the tree of access path;
Traversal submodule, for the tree of access path described in depth-first traversal, obtains accessing data stream.
11. system according to claim 8, it is characterised in that
Described water computing module calculates according to below equation:
S=a0+(α·frequency(g)+β·quality(g))/(γ·lenth(g));
Wherein, g represents a data stream;
Frequency (g) represents the occurrence frequency of data stream, and α is the factor of influence parameter of frequency (g);
Quality (g) represents the number of times occurring the important page in data stream, and β is the factor of influence of quality (g) Parameter;
Lenth (g) represents the length of data stream, and γ is the factor of influence parameter of lenth (g);
a0Represent availability of data parameter.
CN201210151293.4A 2012-05-15 2012-05-15 A kind of website data is analyzed method and analyzes system Active CN103425661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210151293.4A CN103425661B (en) 2012-05-15 2012-05-15 A kind of website data is analyzed method and analyzes system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210151293.4A CN103425661B (en) 2012-05-15 2012-05-15 A kind of website data is analyzed method and analyzes system

Publications (2)

Publication Number Publication Date
CN103425661A CN103425661A (en) 2013-12-04
CN103425661B true CN103425661B (en) 2016-10-05

Family

ID=49650419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210151293.4A Active CN103425661B (en) 2012-05-15 2012-05-15 A kind of website data is analyzed method and analyzes system

Country Status (1)

Country Link
CN (1) CN103425661B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484717B (en) * 2015-08-27 2019-12-10 北京国双科技有限公司 Data profiling method and device for path navigation
CN108121749A (en) * 2016-11-30 2018-06-05 北京国双科技有限公司 Website user's behavior analysis method and device
CN108241704B (en) * 2016-12-26 2021-09-17 北京国双科技有限公司 Data processing method and device
CN110020074B (en) * 2017-10-13 2021-04-23 北京国双科技有限公司 Method and device for determining webpage loss rate
CN108900520B (en) * 2018-07-11 2021-04-20 广州虎牙信息科技有限公司 Live broadcast card pause factor determination method and device, server and storage medium
CN111611508B (en) * 2020-05-28 2020-12-15 江苏易安联网络技术有限公司 Identification method and device for actual website access of user
CN113692014B (en) * 2021-08-30 2023-10-27 中国平安人寿保险股份有限公司 APP flow analysis method, apparatus, computer device and storage medium
CN116775148B (en) * 2023-06-19 2024-02-09 深圳市秦丝科技有限公司 Small program optimization management system and method based on data analysis technology

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184230A (en) * 2011-05-11 2011-09-14 北京百度网讯科技有限公司 Method and device for displaying search results
CN102306171A (en) * 2011-08-22 2012-01-04 百度在线网络技术(北京)有限公司 Method and equipment for providing network access suggestions and network search suggestions

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2579691A1 (en) * 2004-09-16 2006-03-30 Telenor Asa A method, system, and computer program product for searching for, navigating among, and ranking of documents in a personal web

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184230A (en) * 2011-05-11 2011-09-14 北京百度网讯科技有限公司 Method and device for displaying search results
CN102306171A (en) * 2011-08-22 2012-01-04 百度在线网络技术(北京)有限公司 Method and equipment for providing network access suggestions and network search suggestions

Also Published As

Publication number Publication date
CN103425661A (en) 2013-12-04

Similar Documents

Publication Publication Date Title
CN103425661B (en) A kind of website data is analyzed method and analyzes system
Liu et al. Coreflow: Extracting and visualizing branching patterns from event sequences
Heo et al. Evolution of the linkage structure of ICT industry and its role in the economic system: The case of Korea
US9575950B2 (en) Systems and methods for managing spreadsheet models
Rajapakse et al. An investigation of cloning in web applications
Zhang et al. Characterizing scientific production and consumption in physics
CN107748752A (en) A kind of data processing method and device
Oliveira Junior et al. Systematic evaluation of software product line architectures
Faber et al. An Agile Framework for Modeling Smart City Business Ecosystems.
Pflanzl et al. Human-oriented challenges of social BPM: an overview
Batabyal et al. Creative capital, information and communication technologies, and economic growth in smart cities
CN108536700A (en) A kind of method that nothing buries a collector journal
CN108920147A (en) A kind of Web page construction method, calculates equipment and storage medium at device
Alizadeh et al. Linear time optimal approaches for reverse obnoxious center location problems on networks
Bhosale et al. Role of business intelligence in digital marketing
US20150032685A1 (en) Visualization and comparison of business intelligence reports
Du et al. Servicification and global value chain upgrading: empirical evidence from China’s manufacturing industry
CN114511353A (en) Data analysis method and device
Altarturi et al. Review of knowledge framework and conceptual structure of Islamic Banking
Saha et al. A web-based integrated environment for simulation and analysis with NS-2
Orlovskyi et al. Enterprise architecture modeling support based on data extraction from business process models
Biermann et al. Parallel independence of amalgamated graph transformations applied to model transformation
CN107145508A (en) Website data processing method, device and system
Nathanael et al. Study of algorithmic method and model for effort estimation in big data software development case study: Geodatabase
CN115409541A (en) Cigarette brand data processing method based on data blood relationship

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant