CN109657016A

CN109657016A - The method for meeting the attribute of homogeney requirement is excavated in a kind of attribute graph model

Info

Publication number: CN109657016A
Application number: CN201811648180.9A
Authority: CN
Inventors: 赵子豪; 杨汉玮
Original assignee: Nupt Institute Of Big Data Research At Yancheng Co Ltd
Current assignee: Nupt Institute Of Big Data Research At Yancheng Co Ltd
Priority date: 2018-12-30
Filing date: 2018-12-30
Publication date: 2019-04-19

Abstract

The present invention relates to the method for meeting the attribute of homogeney requirement is excavated in a kind of attribute graph model.The specifically technical fields such as diagram data, data mining algorithm.This method is intended to quantitatively evaluating using subgraph is divided, to subgraph；Weighted average increment is calculated, standard deviation calculating is carried out in the subgraph after calculating mean values for the difference of obtained standard deviation and obtains attribute corresponding to the value to the satisfaction degree of homogeney.The method for meeting the attribute of homogeney requirement is excavated in attribute graph model provided by the invention, and large-scale graph data is simplified to the method for processing.The subsequent Mining Problems of the more large-scale graph data of attribute classification can be simplified, reduce calculating and storage overhead.

Description

The method for meeting the attribute of homogeney requirement is excavated in a kind of attribute graph model

Technical field

The present invention relates to the method for meeting the attribute of homogeney requirement is excavated in a kind of attribute graph model.Specifically figure number According to technical fields such as, data mining algorithms.

Background technique

All the time, traditional relational data model is constantly in dominant position in data modeling field.But with phase The progress of pass technology and the development in epoch, the range that traditional relational model uses is increasing, some relationships then occurs The scene that model can not be applicable in, such as in social networks, transportation network etc. between the frequent scene of relational operation entity among, because A large amount of contingency table is needed to go the relationship of record a series of complex when dealing with relationship problem for traditional relevant database.? It introduces after more multiple entity, more and more contingency tables will be needed, so that the solution based on relevant database is numerous Trivial fallibility.The defect of data model is not suitable with the status of current data rapid growth but also this data are not easy to extend.

Under this situation, chart database comes into being, and chart database is theoretical originating from Euler and Tu, alternatively referred to as towards figure Database.Its basic meaning be to scheme the storage of this data structure and inquiry data, its data model mainly with node and While advantage is can quickly to solve complicated relations problems to embody.

With flourishing for chart database, engineering and scientific research personnel can select graph model among many actual scenes Data are modeled, therefore also increasingly burning hot about the research of the related algorithm of diagram data.In various related algorithms, and to belong to Property weighted graph mining algorithm correlative study it is the most popular, such as community detection, cluster, figure divide, outlier detection.These belong to Mining algorithm on property figure is all based on the same hypothesis substantially: on all properties to be studied, homogeney must all be expired Foot.So-called homogeney, from the perspective of nodal community, exactly node is more likely to be connected to those increasingly similar with oneself On node.Newman delivered entitled Mixing patterns in network's on Physical Review in 2003 Article, he defines the mixed mode on meshed network for the first time in article, and proposes different with mixing (disassortative Mixing) problem.There are nodes to tend to the case where being connected to node lower with oneself similarity i.e. in network.

Therefore the premise work for doing the mining algorithm of attributed graph is to find out the attribute for meeting homogeney, then could be at these Excavation is done on attribute.Some scholars propose certain methods and can find out among multiple attributes in attribute graph model at present Meet the attribute of homogeney requirement, but existing method is all based on Numeric Attributes, therefore proposes a kind of suitable for a variety of categories Property, can find out meet in attribute graph model homogeney requirement attribute method it is very necessary.

Summary of the invention

The present invention provides the side that the attribute for meeting homogeney requirement is excavated in a kind of attribute graph model in view of the above deficiencies Method.

The present invention adopts the following technical scheme:

The method for meeting the attribute of homogeney requirement is excavated in a kind of attribute graph model of the present invention, the method is as follows:

1) subgraph, is divided by network structure based on figure and property set；Degree is divided as parameter regulation using module " degree ", Subgraph is divided into subgraph 1, subgraph 2 until subgraph n；

2) ordered categories type attribute and unordered type attribute in each subgraph, are distinguished；Each ordered categories type attribute is carried out It is intended to quantitatively evaluating；

3), for step 2 for quantization after ordered categories type subgraph 1, subgraph 2 until subgraph n in each subgraph calculate plus Weight average numerical value；

4) standard deviation calculating is carried out in the subgraph after, calculating mean values for step 3), for the difference of obtained standard deviation, Attribute corresponding to the value is obtained to the satisfaction degree of homogeney.

The method for meeting the attribute of homogeney requirement: institute in step 1) is excavated in a kind of attribute graph model of the present invention " degree " stated is the ratio for connecting side between the Lian Bianyu subgraph evaluated in the subgraph for dividing and obtaining.

The method for meeting the attribute of homogeney requirement is excavated in a kind of attribute graph model of the present invention: being united in step 2 Subgraph 1, subgraph 2 are counted until the number that each value of unordered type attribute occurs in subgraph n；Calculate the occupation ratio in the subgraph Example and distribution, by the otherness being distributed in each subgraph, obtain the node that attribute meets the requirement of homogeney.

The method for meeting the attribute of homogeney requirement: the step is excavated in a kind of attribute graph model of the present invention 2) it in after ordered categories type attribute weight average, only makes comparisons between the different subgraph of numerical value.

The method for meeting the attribute of homogeney requirement is excavated in a kind of attribute graph model of the present invention:: in step 3) Judge whether the attribute in each subgraph meets homogeney according to weighted average increment；If the weighted average between different subgraphs It differs greatly, then it is assumed that this attribute is more likely to meet homogeney.

Beneficial effect

The method for meeting the attribute of homogeney requirement is excavated in attribute graph model provided by the invention, and large-scale graph data is simplified The method of processing.The subsequent Mining Problems of the more large-scale graph data of attribute classification can be simplified, reduce and calculate and store Expense.

Detailed description of the invention

Fig. 1 is processing flow schematic diagram of the invention.

Specific embodiment

To keep purpose and the technical solution of the embodiment of the present invention clearer, below in conjunction with the attached of the embodiment of the present invention Figure, is clearly and completely described the technical solution of the embodiment of the present invention.Obviously, described embodiment is of the invention A part of the embodiment, instead of all the embodiments.Based on described the embodiment of the present invention, those of ordinary skill in the art Every other embodiment obtained, shall fall within the protection scope of the present invention under the premise of being not necessarily to creative work.

It is as shown in the figure: for the homogeney attribute excavation problem of a large-scale graph data collection, some category can be converted into Property whether meet homogeney require the problem of.The present invention in a kind of a kind of attribute graph model proposed based on calculation procedure by digging Pick meets the method for the attribute of homogeney requirement, the method is as follows:

Subgraph is divided by network structure based on figure and property set；Degree is divided as parameter regulation using module " degree ", it will be sub Figure is divided into subgraph 1, subgraph 2 until subgraph n；

Distinguish ordered categories type attribute and unordered type attribute in each subgraph；The amount of desire is carried out to each ordered categories type attribute Change evaluation；

For step 2 for the subgraph 1 of ordered categories type attribute, the subgraph 2 after quantization until each subgraph meter in subgraph n Calculate weighted average increment；

Standard deviation calculating is carried out in subgraph after calculating mean values for step 3) to obtain for the difference of obtained standard deviation To attribute corresponding to the value to the satisfaction degree of homogeney.

1), in attribute graph model some unordered type attributes (such as trip mode, including walking, bicycle, Iron, public transport, self-driving etc.), it is intended to quantitatively evaluate its requirement for whether meeting homogeney in certain network, and then according to these homogeneities Property attribute do further data mining etc. application, need to proceed as follows.

Attribute is not considered first, and whole figure is carried out according to network structure (node with the situation that is connected on side) i.e. in network It divides.Particularly, when carrying out figure division, this method designs and has used a kind of method for carrying out figure division based on modularity, It uses modularity to divide degree as parameter regulation.The meaning of modularity is that evaluation divides the Lian Bianyu in obtained subgraph Connect the ratio on side between subgraph, it is that good subgraph divides the result is that subgraph Nei Lianbian is intensive, it is sparse that side is connected between subgraph.This method It is a kind of method similar to community's detection, is first that all nodes distribute a label, for the maximum node of degree in figure, It is the most label of frequency of occurrence in its neighbor node by its tag replacement, it is believed that the node with same label belongs to The same subgraph.Iteration executes the above process, until result restrains or meet preset module angle value.In specific operation process In, the granularity divided to figure can be adjusted according to the concrete condition of data, and then reach better effect.

It is dividing among obtained subgraph, is counting the number that each value of the unordered category attribute occurs, and calculate separately Each value of this attribute ratio shared in entire subgraph, obtains distribution situation of the value of the attribute on subgraph.Then exist The distribution situation of the different value of the attribute is counted on whole figure.If distribution situation of the different value of the attribute between different subgraphs It differs greatly, and notable difference is distributed in the distribution situation in certain subgraphs and on entire data set, then it is assumed that this category Property meets the requirement of homogeney, i.e., in the angle of this attribute, node tends to be connected to node similar with oneself.

2) in attribute graph model some ordered categories type attributes (rating achievement rating of such as student, from it is good to difference successively It is A, B, C, D, E), it is intended to quantitatively evaluate its requirement for whether meeting homogeney in some network.

Using thinking similar to above, attribute value is not considered first, data set is used according to network structure Partition (modularity) method is divided.To the appearance feelings of the different value of the attribute in dividing obtained subgraph Condition does statistical analysis, and then understands the distribution situation of the value of the attribute.

It then obtains from 1) the middle each value frequency of occurrence in subgraph of statistics unlike the method for proportion, for having Sequence type attribute can assign an integer value to each classification in sequence and represent its classification, be with student performance grade Example, use 1 represent A, and 2 represent B, and 3 represent C, and 4 represent D, and 5 represent E.Particularly, this numerical value need not also need not be with 1 since 1 Step-length is incremented by, and numerical value apparent for some differences can increase step-length, such as 2, and 3,4,9, but must be passed according to classification sequence Increase or successively decreases.The number that each value occurs is counted in subgraph, and calculates weighted average of the attribute in this subgraph accordingly.

Weighted average are calculated by this method respectively on dividing obtained each subgraph, and according to these weighted average To judge whether this attribute meets homogeney.If weighted average differ greatly between different subgraphs, then it is assumed that this attribute is more Tend to meet homogeney.

Particularly, the judgement of homogeney whether is met for ordered categories type attribute, if will first have using the above method Class switching sequence is numerical value, judges whether it meets the method for homogeney by calculating the weighted average in different subgraphs, Then finally relatively weighted average when, with 1) in relatively category distribution method the difference is that, the method is only needed than less With the numerical value between subgraph, without being compared with the weighted average on whole figure.

Particularly, for ordered categories type data, can also divide between different subgraphs according to 1) middle relatively different attribute value The method of cloth situation to determine whether meet homogeney, concrete operation method with 1) in it is identical.

3) about the satisfaction degree in attributed graph inherent quantization evaluation attributes to homogeney, will belong to the invention proposes a kind of Property to homogeney satisfaction degree quantization method.

It, can will be each as 1), 2) in the method, the distribution situation of each attribute value is acquired between different subgraphs The average acquired on subgraph regards a sequence as, to this sequence ask standard deviation (seeking standard deviation is statistical common operation, Program is calculated).For different attributes, different standard deviations will necessarily be acquired, by the normalization (normalization of these standard deviations It is statistical common operation, refers to and pass through different size of data in appropriate suitable means scaling to 0 to 1 section), Obtained numerical value is that (obtained numerical value is the quantitatively evaluating attribute to satisfaction degree of the attribute corresponding to the value to homogeney It to a kind of index of homogeney satisfaction degree, artificially does judge according to actual needs).

Particularly, the method is only applicable to the comparison inside single attributed graph between different attribute, is not suitable for across figure ratio Compared with.

4) in view of the particularity of this method, if using traditional diagram data distributed libray scheme, it, will first to data cutting Data fragmentation is distributed on different nodes and executes calculating, will affect the effect of method.The present invention is for the digging proposed in the invention The method for meeting homogeney attribute in pick attribute graph model devises a kind of new two-part distributed schemes.

First stage executes division methods on cluster and does figure division, and the result after division is then stored in master section Point on.

The copy for the figure that second stage generates after dividing to different slave node distributions.Each slave node is responsible for one The judgement of a or multiple attributes.Master node leaves out when distributing copy to specific slave node and is not required to slave section The attribute of point processing.After each node receives copy, meet the side of homogeney attribute in local runtime excavation proposed by the present invention Method.

According to above-mentioned statement, citing is illustrated below:

It is successively A, B, C, D, E from getting well difference such as the rating achievement rating of student；Illustrate process based on the data combination this method.

By taking the social networks that certain primary school Third school grade school is constituted as an example, the node on behalf in this social networks is single Student, Lian Bian represent friend relation, and there are many attributes on node, and one of attribute is this term final examination achievement, from good It is respectively as follows: A, B, C, D, E to difference.Whether present analytic learning achievement attribute meets homogeney on this social networks.Step It is as follows:

1. setting module degree is done subgraph according to subgraph division methods described previously and is divided.(assuming that marking off 10 subgraphs)

2. setting the corresponding numerical value of performance level: A-1, B-2, C-3, D-7, E-10

3. calculated on different subgraphs achievement average value (assuming that the average value on 10 subgraphs is respectively a1, a2,,, A10)；

4. calculating the standard deviation of this sequence of a1-a10；

5. by the standard deviation worked it out in this standard deviation and other attributes, (calculation method and 1-4 of other attributes difference are walked What is described in rapid is consistent) in common scaling to the section of 0-1, obtained final numerical value is bigger, attribute corresponding to the numerical value It is higher to the satisfaction of homogeney.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of protection of the claims Subject to.

Claims

1. excavating the method for meeting the attribute of homogeney requirement in a kind of attribute graph model, it is characterised in that: method is as follows:

3), the subgraph 1 for the ordered categories type attribute being directed to after quantization for step 2, subgraph 2 are up to each subgraph in subgraph n Calculate weighted average increment；

2. excavating the method for meeting the attribute of homogeney requirement in attribute graph model according to claim 1, feature exists In: " degree " described in step 1) is the ratio for connecting side between the Lian Bianyu subgraph evaluated in the subgraph for dividing and obtaining.

3. excavating the method for meeting the attribute of homogeney requirement in attribute graph model according to claim 1, feature exists In: subgraph 1, subgraph 2 are counted in step 2 until the number that each value of unordered type attribute occurs in subgraph n；It calculates at this Occupation ratio and distribution in subgraph obtain the requirement that attribute meets homogeney by the otherness being distributed in each subgraph Node.

4. excavating the method for meeting the attribute of homogeney requirement in attribute graph model according to claim 1, feature exists In: in the step 2) after ordered categories type attribute weight average, only make comparisons between the different subgraph of numerical value.

5. excavating the method for meeting the attribute of homogeney requirement in attribute graph model according to claim 1, feature exists In: judge whether the attribute in each subgraph meets homogeney according to weighted average increment in step 3)；If different subgraphs it Between weighted average differ greatly, then it is assumed that this attribute is more likely to meet homogeney.