CN107092668A

CN107092668A - A kind of data analysing method

Info

Publication number: CN107092668A
Application number: CN201710227549.8A
Authority: CN
Inventors: 潘碧涛; 曾刚
Original assignee: Guangzhou Petroleum Internet Financial Information Service Co Ltd
Current assignee: Hongmai information technology (Guangzhou) Co., Ltd
Priority date: 2017-04-10
Filing date: 2017-04-10
Publication date: 2017-08-25
Anticipated expiration: 2037-04-10
Also published as: CN107092668B

Abstract

The invention provides a kind of data analysing method, employ two associated threshold values and analyzed, different data analysis algorithms is used under conditions of different threshold values.Because current data analysing method has the setting of threshold value, and this threshold value is user is set according to experience, some valuable low volume datas may be filtered out by carrying out analysis using this threshold value, and the method designed by the present invention can then avoid filtering out valuable low volume data, accomplish the comprehensive and objectivity of analysis, improve the accuracy of analysis.

Description

A kind of data analysing method

Technical field

The present invention relates to Data Management Analysis field, in particular to a kind of data analysing method.

Background technology

Big data (big data) word is often used to describe and refer to the magnanimity information of information explosion epoch generation.Grind The meaning for studying carefully big data is to find and understands contacting between the information content and information and information.

Current most Internet firm can all be stored with the data message of user on backstage, as user's big data letter Breath, by analyzing these data messages, obtains the behavior of user, believes so as to recommended user's webpage interested or commodity Breath, the page for either adjusting webpage meets the hobby or prediction user behavior of user, provides preferably service for it, so More users can be just brought to company, can just be conducive to the development of Internet firm.

The competition of Internet firm, the service of personalization is carried out by suitably analyzing user data to it, User can be effectively kept, to prevent customer churn.It is more currently for the algorithm of data analysis, but algorithm is all in maturation mostly Algorithm in the corresponding improvement that carries out, be not appropriate for all scenes.Analysis efficiency can be improved by how designing one kind, be solved valuable The parser of the rare data of value is the problem of urgently can not be to be solved.

The content of the invention

It is contemplated that at least solving one of technical problem present in prior art or correlation technique.

Therefore, the invention provides a kind of data analysing method, comprising the following steps that：

Step 1, the data acquisition system stored in reading database；

Step 2, carry out calculating the first threshold value S1 using project set probabilistic method；

Step 3, the second threshold value S2 is calculated according to the first threshold value S1 and item attribute；

Step 4, the data acquisition system in database is extracted, and calculates wherein each project data set particular value, institute is judged State whether project data set particular value is more than or equal to the first threshold value S1, if more than or equal to the first threshold value S1, Step 5 is then carried out, if judging to be less than the first threshold value S1, step 6 is carried out；

Step 5, correlation rule is calculated using default first rule-based algorithm；

Step 6, judge whether the project data set particular value is more than or equal to the second threshold value S2, if more than etc. In the second threshold value S2, then step 7 is carried out, is less than second threshold value, then carry out step 8；

Step 7, correlation rule is calculated using default Second Rule algorithm；

Step 8, the project data set less than second threshold value is deleted；

Repeat the above steps 4-8, until all items data acquisition system in database is all disposed；

Show the correlation rule result calculated.

More specifically, step 2 is specially：

Step 21, ergodic data storehouse, counts each project set probability of occurrence in database；

Step 22, according to the first percent value of user preset, find out closest to first percent value corresponding two The probable value of individual project set；

Step 23, the probability difference of two project sets is calculated；

Step 24, the probability difference value of the first percent value of user preset and described two project sets, as a result for First threshold value S1.

More specifically, step 3 is specially：

Second threshold value is calculated by below equation,

Wherein β is the scope for allowing to reduce that user sets, and q (t) is item attribute, and it is counted by below equation Calculate,

Wherein, t is the profit of each project, and b and c are the value of user preset, and b²+c²≠0。

It is preferred that, the β values are：10%-50%；The value of the b and c is：5<b<50,10<c<80.

It is preferred that, first rule-based algorithm is Apriori algorithm.

More specifically, step 7 is specially：

Step 71, the related support of all data acquisition systems in database is calculated；

Step 72, it is special to be judged to the project data set less than the first threshold value S1 and more than or equal to the second threshold value S2 Whether definite value is more than or equal to related support, if more than or equal to related support, carrying out step 74, if less than related support Then carry out step 73；

Step 73, the project data set for being less than related support is deleted；

Step 74, repeat step 72-73, until all in database are less than the first threshold value S1 and more than or equal to second Threshold value S2 project data set is all disposed.

More specifically, the specific formula for calculation of the related support is：

RSup(d₁,d₂,...,d_k)=max (Sup (d₁,d₂,...,d_k)/Sup(d₁),

Sup(d₁,d₂,...,d_k)/Sup(d₂),...,Sup(d₁,d₂,...,d_k)/Sup(d_k))

Wherein RSup (d₁,d₂,...,d_k) it is related support, Sup (d_k) it is data set D=(d₁,d₂,...,d_k) in d_kSupport.

The present invention improves the situation of threshold value, it is to avoid in data analysis by designing a kind of new data analysing method During by certain class factor formation Candidate Set filtered out by too high threshold value, the correlation rule analyzed is relatively intended to certain The drawbacks of class factor so that overall analysis result is more comprehensively changed, more perfect, additionally by this analysis method so that data Analysis is more efficiently objective.

Brief description of the drawings

Fig. 1 shows a kind of data analysing method flow chart according to the present invention；

Fig. 2 shows the schematic flow sheet for calculating the first threshold value；

Fig. 3 shows the schematic flow sheet of Second Rule algorithm.

Embodiment

Have more research on association rule algorithm at present, most algorithm be based on Apriori algorithm, It is improved or proposes new thinking.But it needs to be determined that threshold value, is all now the warp for passing through oneself by designer in algorithm Test and be determined, in order to improve the situation of threshold value, it is to avoid by the Candidate Set quilt of certain class factor formation in data analysis process Too high threshold value is filtered out, and the correlation rule analyzed is relatively intended to certain class factor, it is proposed that a kind of new data analysis Method, employs two associated threshold values and is analyzed.

The features such as there is openness, distributivity due to the data in current internet, some data going out in database Existing frequency is unsatisfactory for analyzing the minimum support that user gives, but in its occurrence number, and with very high ratio and certain One specific data occur simultaneously, and these data are valuable low volume datas.Because current data analysing method has thresholding The setting of value, and to be user set this threshold value according to experience, carrying out analysis using this threshold value may be some Valuable low volume data is filtered out, and the method designed by the present invention can then avoid filtering out valuable a small amount of number According to, accomplish analysis comprehensive and objectivity, improve the accuracy of analysis.

It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention Mode is applied the present invention is further described in detail.It should be noted that in the case where not conflicting, the implementation of the application Feature in example and embodiment can be mutually combined.

Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still, the present invention may be used also Implemented with being different from mode described here using other, therefore, protection scope of the present invention is not by following public tool The limitation of body embodiment.

If D={ d₁,d₂,...,d_mIt is collection of data items, DB is the database related to task, and T represents a data item Subset, i.e.,A is collection of data items, and and if only ifWhen, referred to as T includes A.U_pIt is that user specifies data set or spy Value data set,u∈U_p, u is particular value.

Correlation rule is defined as, whereinAnd A ∩ B=φ.

If the record that there is s ratios in database includes A ∪ B, correlation rule is definedFor Sup=s.Wherein Sup represents support,

Valuable low volume data is defined for RD_value, d ∈ DB, Sup (d) ＜ Min_Sup, wherein Min_Sup are most ramuscule Degree of holding, P (d) is d probabilities of occurrence, and such as P (d) ≈ P (u), d is called valuable low volume data.

The particular value for being used to use during the most item set of analysis specified by user is the first threshold value S1, by with It is the second threshold value S2, wherein S1 that what family was specified, which is used to analyze the particular value used during low volume data item collection,>S2.

Fig. 1 shows data analysing method flow chart according to an embodiment of the invention.

As shown in figure 1, according to a kind of data analysing method of the present invention, including：Step 1, stored in reading database Data acquisition system；Step 2, carry out calculating the first threshold value S1 using project set probabilistic method；Step 3, according to first thresholding Value S1 and item attribute calculate the second threshold value S2；Step 4, the data acquisition system in database is extracted, and calculates wherein each item Mesh data acquisition system particular value, judges whether the project data set particular value is more than or equal to the first threshold value S1, if greatly In equal to the first threshold value S1, then step 5 is carried out, if judging to be less than the first threshold value S1, carry out step 6；Step Rapid 5, correlation rule is calculated using default first rule-based algorithm；Step 6, judge whether the project data set particular value is big In equal to the second threshold value S2, if more than or equal to the second threshold value S2, carrying out step 7, being less than second thresholding Value, then carry out step 8；Step 7, correlation rule is calculated using default Second Rule algorithm；Step 8, delete and be less than described the The project data set of two threshold values；Repeat the above steps 4-8, until all items data acquisition system in database has all been handled Finish；Show the correlation rule result calculated.

The data acquisition system for the various projects that are wherein stored with database, its aggregate form is set by user.It is preferred that, Data acquisition system is stored using list forms, or is stored in the form of array, wherein the dimension gathered is all Pre-set by user.

The common rule algorithm that the particular value of data acquisition system belongs in this area is calculated, the present invention is no longer repeated one by one.

The first threshold value is calculated for step 2 then to carry out by following steps.

Fig. 2 shows the schematic flow sheet for calculating the first threshold value.

As shown in Fig. 2 the calculation procedure of the first threshold value is as follows：

Step 23, the probability difference of two project sets is calculated；

The probability that each project set occurs in database uses way commonly used in the art, due to belonging to existing skill Art, then the present invention repeat no more.The disturbance degree of valuable rare data can be improved by calculating the first threshold value using the method, Some valuable rare data filterings will not be fallen because of fixed threshold value set in advance, make paired data analysis result Influence.

For example, user sets the first percent value as 60%.By calculating immediate two items in data acquisition system The probable value of mesh set is 58% and 61%.Then the difference of the two probability is 3%, so the first threshold value S1 finally drawn For 60%+3%=63%.

Calculating for the second threshold value is specially：

Second threshold value is calculated by below equation,

Wherein, t is the profit of each project, and b and c are the value of user preset, and b²+c²≠0.Wherein S1>S2.

Its weighted value is higher when profit is high, and the S2 calculated will be relatively low；Its weighted value is relatively low when profit is low, calculates S2 will be higher.The chance for being parsed out correlation rule will be higher, that is, will greatly improve and analyze the higher project of profit Correlation rule.

More excellent, the β values are：10%-50%；The value of the b and c is：5<b<50,10<c<80.Above-mentioned b and c Value be the empirical value that obtains after the data of majority are analyzed, in the case of being worth herein, the S2 values drawn can be accorded with more The demand of data analysis is closed, some valuable rare data will not be lost because of the much of numerical value.

First rule-based algorithm is Apriori algorithm.

Apriori algorithm is a kind of algorithm of most influential Mining Boolean Association Rules frequent item set.Its core is base Collect the recursive algorithm of thought in two benches frequency.The correlation rule belongs to one-dimensional, individual layer, Boolean Association Rules in classification. Apriori algorithm belongs to the conventional algorithm of this area, and the present invention is no longer repeated one by one.

Carried out for the Second Rule algorithm of step 7 by following steps.

Fig. 3 shows the schematic flow sheet of Second Rule algorithm.

As shown in figure 3, the step of Second Rule algorithm is as follows：

Step 73, the project data set for being less than related support is deleted；

This algorithm is mainly low containing support for analysis bag, but the high correlation rule of Project profit.

The specific formula for calculation of above-mentioned related support is：

RSup(d₁,d₂,...,d_k)=max (Sup (d₁,d₂,...,d_k)/Sup(d₁),

Sup(d₁,d₂,...,d_k)/Sup(d₂),...,Sup(d₁,d₂,...,d_k)/Sup(d_k))

Wherein RSup (d₁,d₂,...,d_k) it is related support, Sup (d_k) it is data acquisition system D=(d₁,d₂,...,d_k) in D_kSupport.Wherein d_kRepresent project set.

Database is pre-stored the data acquisition system of N number of project, and N is more than or equal to 2.

The method designed by the present invention can then avoid filtering out valuable low volume data, accomplish the comprehensive of analysis And objectivity.The correlation rule analyzed using this method includes the higher project of profit, has accomplished the visitor of whole analysis See and comprehensive.For example, the set for the various category commodity that are stored with the background memory of certain shopping website, wherein have recorded The information such as the profit and sales volume of each category commodity, and the profit of wherein some category is larger, but because user is provided with door The situation of limit value, it is likely that so that this category is eventually filtered, cause the not comprehensive of data analysis；And utilize the present invention's Method then can be very good to avoid to cause the incomprehensive of data analysis because of the setting of threshold value, be more beneficial for finding data Relation in set.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

1. a kind of data analysing method, it is characterised in that comprise the following steps：

Step 1, the data acquisition system stored in reading database；

Step 4, the data acquisition system in database is extracted, and calculates wherein each project data set particular value, the item is judged Whether mesh data acquisition system particular value is more than or equal to the first threshold value S1, if more than or equal to the first threshold value S1, entering Row step 5, if judging to be less than the first threshold value S1, carries out step 6；

Step 6, judge whether the project data set particular value is more than or equal to the second threshold value S2, if more than or equal to the Two threshold value S2, then carry out step 7, be less than second threshold value, then carry out step 8；

Step 7, correlation rule is calculated using default Second Rule algorithm；

Step 8, the project data set less than second threshold value is deleted；

Show the correlation rule result calculated.

2. a kind of data analysing method according to claim 1, it is characterised in that the step 2 is specially：

Step 22, according to the first percent value of user preset, find out closest to corresponding two items of first percent value The probable value of mesh set；

Step 23, the probability difference of two project sets is calculated；

Step 24, the first percent value of user preset and the probability difference value of described two project sets, are as a result first Threshold value S1.

3. a kind of data analysing method according to claim 1, it is characterised in that the step 3 is specially：

Second threshold value is calculated by below equation, S2=S1- β × q (t),

Wherein β is the scope for allowing to reduce that user sets, and q (t) is item attribute, and it is calculated by below equation,

4. a kind of data analysing method according to claim 3, it is characterised in that the β values are：10%-50%；Institute The value for stating b and c is：5<b<50,10<c<80.

5. a kind of data analysing method according to claim 1, it is characterised in that first rule-based algorithm is Apriori algorithm.

6. a kind of data analysing method according to claim 1, it is characterised in that the step 7 is specially：

Step 72, its particular value is judged to the project data set less than the first threshold value S1 and more than or equal to the second threshold value S2 Whether it is more than or equal to related support, if more than or equal to related support, carrying out step 74, enters if less than related support Row step 73；

Step 73, the project data set for being less than related support is deleted；

Step 74, repeat step 72-73, until all in database are less than the first threshold value S1 and more than or equal to the second thresholding Value S2 project data set is all disposed.

7. a kind of data analysing method according to claim 1, it is characterised in that the specific calculating of the related support Formula is：

RSup(d₁,d₂,...,d_k)=max (Sup (d₁,d₂,...,d_k)/Sup(d₁),

Sup(d₁,d₂,...,d_k)/Sup(d₂),...,Sup(d₁,d₂,...,d_k)/Sup(d_k))

Wherein RSup (d₁,d₂,...,d_k) it is related support, Sup (d_k) it is data acquisition system D=(d₁,d₂,...,d_k) in d_k Support.

8. a kind of data analysing method according to claim 1, it is characterised in that the database is pre-stored N number of project Data acquisition system, N be more than or equal to 2.