The content of the invention
It is contemplated that at least solving one of technical problem present in prior art or correlation technique.
Therefore, the invention provides a kind of data analysing method, comprising the following steps that:
Step 1, the data acquisition system stored in reading database;
Step 2, carry out calculating the first threshold value S1 using project set probabilistic method;
Step 3, the second threshold value S2 is calculated according to the first threshold value S1 and item attribute;
Step 4, the data acquisition system in database is extracted, and calculates wherein each project data set particular value, institute is judged
State whether project data set particular value is more than or equal to the first threshold value S1, if more than or equal to the first threshold value S1,
Step 5 is then carried out, if judging to be less than the first threshold value S1, step 6 is carried out;
Step 5, correlation rule is calculated using default first rule-based algorithm;
Step 6, judge whether the project data set particular value is more than or equal to the second threshold value S2, if more than etc.
In the second threshold value S2, then step 7 is carried out, is less than second threshold value, then carry out step 8;
Step 7, correlation rule is calculated using default Second Rule algorithm;
Step 8, the project data set less than second threshold value is deleted;
Repeat the above steps 4-8, until all items data acquisition system in database is all disposed;
Show the correlation rule result calculated.
More specifically, step 2 is specially:
Step 21, ergodic data storehouse, counts each project set probability of occurrence in database;
Step 22, according to the first percent value of user preset, find out closest to first percent value corresponding two
The probable value of individual project set;
Step 23, the probability difference of two project sets is calculated;
Step 24, the probability difference value of the first percent value of user preset and described two project sets, as a result for
First threshold value S1.
More specifically, step 3 is specially:
Second threshold value is calculated by below equation,
Wherein β is the scope for allowing to reduce that user sets, and q (t) is item attribute, and it is counted by below equation
Calculate,
Wherein, t is the profit of each project, and b and c are the value of user preset, and b2+c2≠0。
It is preferred that, the β values are:10%-50%;The value of the b and c is:5<b<50,10<c<80.
It is preferred that, first rule-based algorithm is Apriori algorithm.
More specifically, step 7 is specially:
Step 71, the related support of all data acquisition systems in database is calculated;
Step 72, it is special to be judged to the project data set less than the first threshold value S1 and more than or equal to the second threshold value S2
Whether definite value is more than or equal to related support, if more than or equal to related support, carrying out step 74, if less than related support
Then carry out step 73;
Step 73, the project data set for being less than related support is deleted;
Step 74, repeat step 72-73, until all in database are less than the first threshold value S1 and more than or equal to second
Threshold value S2 project data set is all disposed.
More specifically, the specific formula for calculation of the related support is:
RSup(d1,d2,...,dk)=max (Sup (d1,d2,...,dk)/Sup(d1),
Sup(d1,d2,...,dk)/Sup(d2),...,Sup(d1,d2,...,dk)/Sup(dk))
Wherein RSup (d1,d2,...,dk) it is related support, Sup (dk) it is data set D=(d1,d2,...,dk) in
dkSupport.
The present invention improves the situation of threshold value, it is to avoid in data analysis by designing a kind of new data analysing method
During by certain class factor formation Candidate Set filtered out by too high threshold value, the correlation rule analyzed is relatively intended to certain
The drawbacks of class factor so that overall analysis result is more comprehensively changed, more perfect, additionally by this analysis method so that data
Analysis is more efficiently objective.
Embodiment
Have more research on association rule algorithm at present, most algorithm be based on Apriori algorithm,
It is improved or proposes new thinking.But it needs to be determined that threshold value, is all now the warp for passing through oneself by designer in algorithm
Test and be determined, in order to improve the situation of threshold value, it is to avoid by the Candidate Set quilt of certain class factor formation in data analysis process
Too high threshold value is filtered out, and the correlation rule analyzed is relatively intended to certain class factor, it is proposed that a kind of new data analysis
Method, employs two associated threshold values and is analyzed.
The features such as there is openness, distributivity due to the data in current internet, some data going out in database
Existing frequency is unsatisfactory for analyzing the minimum support that user gives, but in its occurrence number, and with very high ratio and certain
One specific data occur simultaneously, and these data are valuable low volume datas.Because current data analysing method has thresholding
The setting of value, and to be user set this threshold value according to experience, carrying out analysis using this threshold value may be some
Valuable low volume data is filtered out, and the method designed by the present invention can then avoid filtering out valuable a small amount of number
According to, accomplish analysis comprehensive and objectivity, improve the accuracy of analysis.
It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention
Mode is applied the present invention is further described in detail.It should be noted that in the case where not conflicting, the implementation of the application
Feature in example and embodiment can be mutually combined.
Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still, the present invention may be used also
Implemented with being different from mode described here using other, therefore, protection scope of the present invention is not by following public tool
The limitation of body embodiment.
If D={ d1,d2,...,dmIt is collection of data items, DB is the database related to task, and T represents a data item
Subset, i.e.,A is collection of data items, and and if only ifWhen, referred to as T includes A.UpIt is that user specifies data set or spy
Value data set,u∈Up, u is particular value.
Correlation rule is defined as, whereinAnd A ∩ B=φ.
If the record that there is s ratios in database includes A ∪ B, correlation rule is definedFor Sup=s.Wherein
Sup represents support,
Valuable low volume data is defined for RDvalue, d ∈ DB, Sup (d) < Min_Sup, wherein Min_Sup are most ramuscule
Degree of holding, P (d) is d probabilities of occurrence, and such as P (d) ≈ P (u), d is called valuable low volume data.
The particular value for being used to use during the most item set of analysis specified by user is the first threshold value S1, by with
It is the second threshold value S2, wherein S1 that what family was specified, which is used to analyze the particular value used during low volume data item collection,>S2.
Fig. 1 shows data analysing method flow chart according to an embodiment of the invention.
As shown in figure 1, according to a kind of data analysing method of the present invention, including:Step 1, stored in reading database
Data acquisition system;Step 2, carry out calculating the first threshold value S1 using project set probabilistic method;Step 3, according to first thresholding
Value S1 and item attribute calculate the second threshold value S2;Step 4, the data acquisition system in database is extracted, and calculates wherein each item
Mesh data acquisition system particular value, judges whether the project data set particular value is more than or equal to the first threshold value S1, if greatly
In equal to the first threshold value S1, then step 5 is carried out, if judging to be less than the first threshold value S1, carry out step 6;Step
Rapid 5, correlation rule is calculated using default first rule-based algorithm;Step 6, judge whether the project data set particular value is big
In equal to the second threshold value S2, if more than or equal to the second threshold value S2, carrying out step 7, being less than second thresholding
Value, then carry out step 8;Step 7, correlation rule is calculated using default Second Rule algorithm;Step 8, delete and be less than described the
The project data set of two threshold values;Repeat the above steps 4-8, until all items data acquisition system in database has all been handled
Finish;Show the correlation rule result calculated.
The data acquisition system for the various projects that are wherein stored with database, its aggregate form is set by user.It is preferred that,
Data acquisition system is stored using list forms, or is stored in the form of array, wherein the dimension gathered is all
Pre-set by user.
The common rule algorithm that the particular value of data acquisition system belongs in this area is calculated, the present invention is no longer repeated one by one.
The first threshold value is calculated for step 2 then to carry out by following steps.
Fig. 2 shows the schematic flow sheet for calculating the first threshold value.
As shown in Fig. 2 the calculation procedure of the first threshold value is as follows:
Step 21, ergodic data storehouse, counts each project set probability of occurrence in database;
Step 22, according to the first percent value of user preset, find out closest to first percent value corresponding two
The probable value of individual project set;
Step 23, the probability difference of two project sets is calculated;
Step 24, the probability difference value of the first percent value of user preset and described two project sets, as a result for
First threshold value S1.
The probability that each project set occurs in database uses way commonly used in the art, due to belonging to existing skill
Art, then the present invention repeat no more.The disturbance degree of valuable rare data can be improved by calculating the first threshold value using the method,
Some valuable rare data filterings will not be fallen because of fixed threshold value set in advance, make paired data analysis result
Influence.
For example, user sets the first percent value as 60%.By calculating immediate two items in data acquisition system
The probable value of mesh set is 58% and 61%.Then the difference of the two probability is 3%, so the first threshold value S1 finally drawn
For 60%+3%=63%.
Calculating for the second threshold value is specially:
Second threshold value is calculated by below equation,
Wherein β is the scope for allowing to reduce that user sets, and q (t) is item attribute, and it is counted by below equation
Calculate,
Wherein, t is the profit of each project, and b and c are the value of user preset, and b2+c2≠0.Wherein S1>S2.
Its weighted value is higher when profit is high, and the S2 calculated will be relatively low;Its weighted value is relatively low when profit is low, calculates
S2 will be higher.The chance for being parsed out correlation rule will be higher, that is, will greatly improve and analyze the higher project of profit
Correlation rule.
More excellent, the β values are:10%-50%;The value of the b and c is:5<b<50,10<c<80.Above-mentioned b and c
Value be the empirical value that obtains after the data of majority are analyzed, in the case of being worth herein, the S2 values drawn can be accorded with more
The demand of data analysis is closed, some valuable rare data will not be lost because of the much of numerical value.
First rule-based algorithm is Apriori algorithm.
Apriori algorithm is a kind of algorithm of most influential Mining Boolean Association Rules frequent item set.Its core is base
Collect the recursive algorithm of thought in two benches frequency.The correlation rule belongs to one-dimensional, individual layer, Boolean Association Rules in classification.
Apriori algorithm belongs to the conventional algorithm of this area, and the present invention is no longer repeated one by one.
Carried out for the Second Rule algorithm of step 7 by following steps.
Fig. 3 shows the schematic flow sheet of Second Rule algorithm.
As shown in figure 3, the step of Second Rule algorithm is as follows:
Step 71, the related support of all data acquisition systems in database is calculated;
Step 72, it is special to be judged to the project data set less than the first threshold value S1 and more than or equal to the second threshold value S2
Whether definite value is more than or equal to related support, if more than or equal to related support, carrying out step 74, if less than related support
Then carry out step 73;
Step 73, the project data set for being less than related support is deleted;
Step 74, repeat step 72-73, until all in database are less than the first threshold value S1 and more than or equal to second
Threshold value S2 project data set is all disposed.
This algorithm is mainly low containing support for analysis bag, but the high correlation rule of Project profit.
The specific formula for calculation of above-mentioned related support is:
RSup(d1,d2,...,dk)=max (Sup (d1,d2,...,dk)/Sup(d1),
Sup(d1,d2,...,dk)/Sup(d2),...,Sup(d1,d2,...,dk)/Sup(dk))
Wherein RSup (d1,d2,...,dk) it is related support, Sup (dk) it is data acquisition system D=(d1,d2,...,dk) in
DkSupport.Wherein dkRepresent project set.
Database is pre-stored the data acquisition system of N number of project, and N is more than or equal to 2.
The method designed by the present invention can then avoid filtering out valuable low volume data, accomplish the comprehensive of analysis
And objectivity.The correlation rule analyzed using this method includes the higher project of profit, has accomplished the visitor of whole analysis
See and comprehensive.For example, the set for the various category commodity that are stored with the background memory of certain shopping website, wherein have recorded
The information such as the profit and sales volume of each category commodity, and the profit of wherein some category is larger, but because user is provided with door
The situation of limit value, it is likely that so that this category is eventually filtered, cause the not comprehensive of data analysis;And utilize the present invention's
Method then can be very good to avoid to cause the incomprehensive of data analysis because of the setting of threshold value, be more beneficial for finding data
Relation in set.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.