CN109993313A

CN109993313A - Sample label processing method and processing device, community partitioning method and device

Info

Publication number: CN109993313A
Application number: CN201811612712.3A
Authority: CN
Inventors: 司书强
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2019-07-09

Abstract

This specification embodiment provides a kind of sample label processing method, by adding the default label realization diffusion not have the sample of default label in target group, purification is realized by deleting the default label to the sample in non-targeted group with the default label, and realize diffusion and purification again by multiple iterative processing, improve the accuracy and recall rate of sample.

Description

Sample label processing method and processing device, community partitioning method and device

Technical field

This specification embodiment is related to technical field of data processing more particularly to a kind of sample label processing method and dress It sets, community partitioning method and device.

Background technique

Machine learning is to study how to simulate the mankind movable Men Xueke of study using machine, utilizes a large amount of sample Notebook data is trained, and is obtained various forms of data models and is carried out solving practical problems.Machine learning can be generally divided into four classes: Supervised learning, unsupervised learning, semi-supervised learning and intensified learning, wherein supervised learning, unsupervised learning and half The main distinction of supervised learning is whether there is label (label) for trained sample.Supervised learning is from labeling The machine learning task of function is inferred in sample set；Unsupervised learning is from the sample set of classification unknown (not being labeled) It is inferred to the machine learning task of function；Semi-supervised learning only has fraction sample to have label, be a kind of supervised learning and The learning method that unsupervised learning can use.In practical applications, be frequently encountered need using supervised learning or Semi-supervised learning solves the problems, such as but the situation of sample label inaccuracy.

Summary of the invention

This specification embodiment provides and a kind of sample label processing method and processing device, community partitioning method and device.

In a first aspect, this specification embodiment provides a kind of sample label processing method, comprising:

Sample set is obtained, the part sample in the sample set has default label；

According to the incidence relation between sample each in the sample set, the sample set is divided into H group, H is Positive integer；

L iterative processing is carried out to the H group, until meeting the condition of convergence, and will be carried out at last time iteration For the label information of each sample as processing result, the label information of each sample is corresponding to characterize each sample after reason Whether this has the default label, and L is positive integer；

Wherein, the iterative processing includes: every time

The group characteristics of each group are determined according to the label information of current each sample；

Determine that target group and non-targeted group, the target group are with the pre- bidding according to the group characteristics The group of the sample aggregation of label, the non-targeted group are other groups in one above group in addition to the target group Group；

The default label is added not have the sample of the default label in the target group；

The default label is deleted to the sample in the non-targeted group with the default label.

Second aspect, this specification embodiment provide a kind of community division methods, comprising:

According to the incidence relation between sample each in sample set, generate using single sample as the relational network figure of node；

Calculate the degree of each node of the relational network figure；

According to each node of node spent sequence from big to small and successively access the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

If any one group is not added for present node, the new group centered on present node is generated；

Determine that more than one expanding node, the expanding node are to pass through N side phase with present node according to present node Associated node, N are positive integer；

The new group is added in one above expanding node.

The third aspect, this specification embodiment provide another community division methods, comprising:

Calculate the degree of each node of the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

Each expanding node handle into group；

Wherein, it is described enter group processing include:

Judge the added group's quantity of the expanding node whether less than the first preset threshold；

If the added group's quantity of expanding node is less than first preset threshold, the expanding node is added Enter the new group.

Fourth aspect, this specification embodiment provide a kind of sample label processing unit, comprising:

Sample set obtains module, and for obtaining sample set, the part sample in the sample set has default label；

Sample set division module, for according to the incidence relation between sample each in the sample set, by the sample Collection is divided into H group, and H is positive integer；

Iterative processing module, for carrying out L iterative processing to the H group, up to meeting the condition of convergence, and will be into The label information of each sample is as processing result, the label information pair of each sample after row last time iterative processing It should characterize whether each sample has the default label, L is positive integer；

Wherein, the iterative processing module includes:

Characteristic determination module, for determining the group characteristics of each group according to the label information of current each sample；

Group determination module, for determining target group and non-targeted group, the target complex according to the group characteristics Group is the group of the sample aggregation with the default label, and the non-targeted group is in one above group except described Other groups outside target group；

Label adding module, for described default not have the sample addition of the default label in the target group Label；

Label removing module, it is described default for deleting the sample in the non-targeted group with the default label Label.

5th aspect, this specification embodiment provide a kind of community dividing device, comprising:

Network generation module, for generating with single sample according to the incidence relation between sample each in sample set For the relational network figure of node；

Node degree computing module, the degree of each node for calculating the relational network figure；

Access modules successively access each section of the relational network figure for the sequence of the degree according to node from big to small Point；

Wherein, the access modules include:

First judgment module, for judging whether present node has been added any one group；

New cluster generating module is generated with present node and is for when any one group is not added for present node The new group of the heart；

Expanding node determining module, for determining that more than one expanding node, the expanding node are according to present node With present node by the associated node in N side, N is positive integer；

First is added module, for the new group to be added in one above expanding node.

6th aspect, this specification embodiment provide another community dividing device, comprising:

Wherein, the access modules include:

Enter group processing module, for handle into group to each expanding node；

Wherein, it is described enter group processing module include:

Second judgment module, for judging the added group's quantity of the expanding node whether less than the first default threshold Value；

Second is added module, for being less than first preset threshold in the added group's quantity of the expanding node When, the new group is added in the expanding node.

7th aspect, this specification embodiment provide a kind of server, including memory, processor and are stored in described On memory and the computer program that can run on the processor, the processor are realized when executing the computer program Above-mentioned sample label processing method and community division methods.

Eighth aspect, this specification embodiment provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence, the computer program realize above-mentioned sample label processing method and community division methods when being executed by the processor.

This specification embodiment has the beneficial effect that:

In this specification embodiment, according to the incidence relation between each sample, part sample had into default label Sample set is divided into H group, and the sample aggregation with the default label is obtained according to the group characteristics of each group Target group and the non-targeted group in addition to the target group, by in the target group do not have the default label The sample addition default label realize diffusion, by being deleted to the sample in the non-targeted group with the default label It except the default label realizes purification, and realizes diffusion and purification again by multiple iterative processing, improves the accurate of sample Property and recall rate.The sample label processing method that this specification embodiment provides carries out group by directly qualitative again to a Body is qualitative to adjust, and all calculating reduce the complexity of calculating all in group；It only needs to obtain when each iterative processing every The group characteristics of a group, and do not have to calculate each sample to update, thus it is very low to calculate cost；The condition of convergence easily reaches, Generally carrying out four to five iterative processings can exit.With existing LPA (Label Propagation Algorithm, label Propagation algorithm) it compares, sample accuracy and recall rate are higher；The single label of support or few label starting, and single label, less label, The diffusion purification of multi-tag can synchronize progress.

Detailed description of the invention

Fig. 1 is the application scenarios schematic diagram that the sample label of this specification embodiment is handled；

Fig. 2 is the flow chart of the sample label processing method of this specification embodiment；

Fig. 3 is the flow chart that sample set is divided into H group of this specification embodiment；

Fig. 4 is a kind of flow chart of the community division methods of embodiment of this specification；

Fig. 5 is the schematic diagram of the relational network figure of this specification embodiment；

Fig. 6 is the flow chart of the community division methods of this specification another kind embodiment；

Fig. 7 and Fig. 8 is the schematic diagram of the group of this specification embodiment；

Fig. 9 is the structural schematic diagram of the sample label processing unit of this specification embodiment；

Figure 10 is the structural schematic diagram of the server of this specification embodiment.

Specific embodiment

Above-mentioned technical proposal in order to better understand, below by attached drawing and specific embodiment to this specification embodiment Technical solution elaborate, it should be understood that the specific features in this specification embodiment and embodiment are to this specification The detailed description of embodiment technical solution, rather than the restriction to this specification technical solution.In the absence of conflict, this theory Technical characteristic in bright book embodiment and embodiment can be combined with each other.

Referring to Figure 1, application scenarios schematic diagram is handled for the sample label of this specification embodiment.Wherein, sample label Processing unit 100 uses specific iterative algorithm, to accuracy and recall rate low, can not support policy application and model training Sample carries out tag processes, obtains accuracy and the high output sample of recall rate, and the output sample is supplied to model and is instructed Practice device 200, model training is carried out according to the output sample by the model training apparatus 200, acquisition is able to solve a certain The supervised learning model or semi-supervised learning model of particular problem.

In a first aspect, this specification embodiment provides a kind of sample label processing method.Fig. 2 is the sample label processing The flow chart of method, the sample label processing method include step S201 to step S207.

S201 obtains sample set, and the part sample in the sample set has default label.

In machine learning field, sample refers to that the particular instance of data, the set of sample are the sample set.According to machine The particular problem that device study solves is different, and the form of expression of sample is also different.Such as using machine learning in network trading Risk subscribers identified that then sample corresponds to user；Classified for another example using machine learning to text, then sample pair It should be text.Sample, which can be divided into, exemplar and unlabeled exemplars, and label is the data for needing to predict, such as can be commodity Any data such as the meaning of the type of goods, audio clips that are shown in following price, picture.In this specification embodiment, Sample in the sample set be accuracy and recall rate it is low, can not support policy application and model training sample.To net For risk subscribers in network transaction are identified, sample accuracy and the low user for being embodied in devoid of risk of recall rate are added to Risk label, risky user are not added risk label.

Part sample in the sample set has the default label, and part sample does not have the default label.Institute Stating default label can be a certain specific label, or certain class label belonging to certain several specific label.For example, for Risk behavior in network trading, the risk behavior include but is not limited to cheat class risk behavior, baseline class risk behavior, warp Class risk behavior and financial class risk behavior are sought, if only needing to identify the user for carrying out risk behavior, without concern for progress Which kind of risk behavior what the user of risk behavior specifically carried out is, then the default label is risk label；If desired it identifies The user of certain specific risk behavior is carried out, then the default label is this kind of specific risk label, such as risk of fraud label. It should be noted that the total sample number amount in the sample set and the sample size with the default label are by practical application It determines, this specification embodiment is to this without limiting.Sample in the sample set can be obtained by web crawlers technology, Be also possible to extract from some database, can also be from other systems or channel acquisition, this specification embodiment to this not It is defined.

The sample set is divided into H group according to the incidence relation between sample each in the sample set by S202 Group, H are positive integer.

The incidence relation can be device relationships, cyberrelationship, social networks and community relations etc..If the sample There are natural groups for concentration, such as there is chat group, particular community group, particular network group or particular device environment group etc., then It is a group by each natural group division, otherwise carries out group identification using community segmentation or community discovery algorithm.Ginseng Fig. 3 is examined, this specification embodiment provides a kind of concrete methods of realizing that the sample set is divided into H group, including step S301 and step S302.

S301 is generated according to the incidence relation between sample each in the sample set using single sample as the pass of node It is network.

The relational network figure is a kind of graph structure being made of several nodes, each node one sample of corresponding characterization This, the relationship between two samples is indicated using line.For example, being used if having incidence relation between sample A and sample B Line connects the corresponding node of sample A and the corresponding node of sample B；If onrelevant relationship between sample A and sample B, sample A Without line between corresponding node and the corresponding node of sample B.It should be noted that the relational network figure can be undirected Figure, or digraph, depending on actual demand.

S302 carries out community division to the relational network figure, obtains the H group.

With reference to Fig. 4, this specification embodiment provides a kind of quick community division methods, including step S401 is to step S405。

S401 calculates the degree of each node of the relational network figure.

The degree of some node is the item number on side associated with the node.If the relational network figure is digraph, institute The degree for stating each node of relational network figure is the sum of in-degree and out-degree.Wherein, the in-degree of some node is to be directed toward the node The item number on side, the out-degree of some node are the item number from the side that the node is pointed out.Obtain each node of the relational network figure Degree after, each node of the relational network figure is successively accessed according to the degree sequence from big to small of node, wherein described The each node for accessing the relational network figure includes step S402 to step S405.It should be noted that for two or more The identical node of the degree of node, can access in any order.

S402, judges whether present node has been added any one group.

If any one group is not added for present node, S403 is thened follow the steps, is generated new centered on present node Otherwise group accesses next node.New group of the generation centered on present node, that is, create a group, and The new group is added in present node.

S404 determines that more than one expanding node, the expanding node are to pass through N item with present node according to present node The associated node in side, N are positive integer.

According to the classical theory of complex network, any two node only needs N step that can establish connection.The value of N can It is configured according to actual needs, the value of N is smaller, may miss important connection；The value of N is bigger, and calculation amount is bigger.? In a kind of optional implementation, the value of N is 3.Further, if the relational network figure is digraph, the N side is successively Present node is directed toward from the expanding node or is directed toward the expanding node from present node.

The new group is added in one above expanding node by S405.

The community division methods that this specification embodiment provides, it is multiple that the classical theory based on complex network realizes the low time Community segmentation is quickly carried out under miscellaneous degree.By taking the value of relational network figure non-directed graph as shown in Figure 5 and N are 1 as an example, below Community division methods shown in Fig. 4 are described in detail:

Relational network figure shown in fig. 5 shares seven nodes of a, b, c, d, e, f, g, the degree for calculating each node corresponds to 3,3, 4,3,3,2,2；C node is accessed first according to the sequence of node spent from big to small, since any one group is not added for c node Group generates the new group centered on c node；Since a node, b node, d node and e node are by 1 side and c node Associated node, thus a node, b node, d node and e node are determined as expanding node, and by a node, b node, d The new group centered on c node is added in node and e node.Using identical method, other remaining nodes are successively accessed, Since the group centered on c node has been added in a node, b node, d node and e node, thus access f node or g E node and g node are added the new group centered on f node, or e node and f node are added with g node and are by node The new group at center, it is final to obtain Liang Ge group: by group that a node, b node, c node, d node and e node are constituted with And the group being made of e node, f node and g node.

Community division methods shown in Fig. 4 may frequently participate in each group for some nodes in core position The calculating of group, keeps the processing speed of subsequent step slack-off.With reference to Fig. 6, this specification embodiment provides another quick community Division methods, including step S601 to step S606.

S601 calculates the degree of each node of the relational network figure.

The degree of some node is the item number on side associated with the node.If the relational network figure is digraph, institute The degree for stating each node of relational network figure is the sum of in-degree and out-degree.Wherein, the in-degree of some node is to be directed toward the node The item number on side, the out-degree of some node are the item number from the side that the node is pointed out.Obtain each node of the relational network figure Degree after, each node of the relational network figure is successively accessed according to the degree sequence from big to small of node, wherein described The each node for accessing the relational network figure includes step S602 to step S606.It should be noted that for two or more The identical node of the degree of node, can access in any order.

S602, judges whether present node has been added any one group.

If any one group is not added for present node, S603 is thened follow the steps, is generated new centered on present node Otherwise group accesses next node.New group of the generation centered on present node, that is, create a group, and The new group is added in present node.

S604 determines that more than one expanding node, the expanding node are to pass through N item with present node according to present node The associated node in side, N are positive integer.

According to the classical theory of complex network, any two node only needs N step that can establish connection.The value of N can It is configured according to actual needs, the value of N is smaller, may miss important connection；The value of N is bigger, and calculation amount is bigger.? In a kind of optional implementation, the value of N is 3.Further, if the relational network figure is digraph, the N side is successively Present node is directed toward from the expanding node or is directed toward the expanding node from present node.Obtain one above extension After node, to each expanding node carry out into group handle, wherein it is described enter group processing include step S605 and step S606.It needs It is noted that can handle into group to each expanding node simultaneously, successively each expanding node can also be entered Group's processing, this specification embodiment is to this without limiting.

Whether S605 judges the added group's quantity of the expanding node less than the first preset threshold.

The value of first preset threshold can be configured according to practical application, and the value of first preset threshold is got over It is small, some groups may be missed；The value of first preset threshold is bigger, and calculation amount is bigger.

If the added group's quantity of expanding node is less than first preset threshold, S606 is thened follow the steps, it will The new group is added in the expanding node.

It is still 1 with the value of relational network figure non-directed graph as shown in Figure 5, N and first preset threshold takes Value is also to be below described in detail community division methods shown in fig. 6 for 1:

Relational network figure shown in fig. 5 shares seven nodes of a, b, c, d, e, f, g, the degree for calculating each node corresponds to 3,3, 4,3,3,2,2；C node is accessed first according to the sequence of node spent from big to small, since any one group is not added for c node Group generates the new group centered on c node；Since a node, b node, d node and e node are by 1 side and c node Associated node, thus a node, b node, d node and e node are determined as expanding node；Due to a node, b node, d Any one group, i.e. a node, b node, d node and the added group's quantity of e node are not added for node and e node It is 0, is less than first preset threshold, thus a node, b node, d node and e node is added centered on c node New group.Using identical method, other remaining nodes are successively accessed, due to a node, b node, d node and e node The group centered on c node has been added, thus has accessed f node or g node, if access f node, expanding node is e section Point and g node, if access g node, expanding node is e node and f node；Since e node has been added in centered on c node New group, i.e. the added group's quantity of e node is 1, is not less than first preset threshold, thus cannot add e node Enter the new group centered on f node or g node, the new group centered on f node only is added in g node, or f is saved The new group centered on g node is added in point, final to obtain Liang Ge group: being saved by a node, b node, c node, d node and e The group that point is constituted and the group being made of f node and g node.

In practical applications, since the sample size in the sample set is larger, according to sample each in the sample set Between incidence relation the sample set would generally be divided into more than two groups, i.e. H is the positive integer not less than 2, in pole It is possible that the case where sample set is divided into a group in the case of end.It continues to refer to figure 1, the sample set is drawn It is divided into after H group, L iterative processing is carried out to the H group, until meeting the condition of convergence, L is positive integer.Wherein, Each iterative processing includes step S203 to step S206.

S203 determines the group characteristics of each group according to the label information of current each sample.

The label information of each sample is corresponding to characterize whether each sample has the default label.It carries out first When secondary iterative processing, the label information of current each sample is the label information of each sample in the sample set；Carry out second When secondary and second of above iterative processing, the label information of current each sample is each after carrying out last iterative processing The label information of sample.For the group characteristics as the foundation for determining target group, particular content can be according to the actual situation Depending on, as long as it can ensure that determining the target group according to the group characteristics, wherein the target group is tool There is the group of the sample aggregation of the default label.In a kind of optional implementation, the group characteristics include cluster label Concentration；In another optional implementation, the group characteristics can also include group in addition to including the cluster label concentration Group scale.Wherein, the group size is the quantity of all samples in the corresponding group of the group characteristics, the cluster label Concentration be the corresponding group of the group characteristics in the default label sample quantity and all samples quantity it Than.

S204 determines that target group and non-targeted group, the target group are with described according to the group characteristics The group of the sample aggregation of default label, the non-targeted group be in one above group in addition to the target group Other groups.

Specifically, judge whether each group characteristics meet preset condition；If the group characteristics meet the default item The corresponding group of the group characteristics is then determined as the target group by part, otherwise by the corresponding group of the group characteristics It is determined as the non-targeted group.It is described to judge whether each group characteristics meet preset condition, it can be successively to each group Feature is judged, can also be judged simultaneously each group characteristics, this specification embodiment is to this without limiting.Institute It states preset condition to be determined according to the particular problem that the group characteristics and machine learning solve, specifically be asked with what machine learning solved It is described for entitled identification group risk behavior, the group characteristics include the group size and the cluster label concentration Preset condition can be greater than preset quantity for the group size and the cluster label concentration is greater than default percentage.

S205 is that the sample for not having the default label in the target group adds the default label.

By taking the target group includes 10 samples as an example, if wherein 7 samples have the default label, for residue 3 samples add the default label.By adding institute not have the sample of the default label in the target group Default label is stated, realizes label diffusion.

S206 deletes the default label to the sample in the non-targeted group with the default label.

By taking the non-targeted group includes 10 samples as an example, if wherein 3 samples have the default label, to this 3 samples delete the default label.Described in deleting the sample in the non-targeted group with the default label Pre- bidding realizes label purification.It should be noted that this specification embodiment does not limit holding for step S205 and step S206 Row sequence, it can first carry out step S205, then execute step S206；Step S206 can also be first carried out, then executes step S205。

After the completion of each iterative processing, judge whether to meet the condition of convergence.If meeting the condition of convergence, execute Otherwise step S207 is carried out down using the label information of each sample after progress last time iterative processing as processing result An iteration processing, wherein the label information of each sample is corresponding to characterize whether each sample has described preset Label.The condition of convergence can reach preset times for the number of iterations L, and the preset times can be set based on practical experience It sets.The condition of convergence can be with are as follows: meetsWherein, a is to carry out current iteration to handle the described pre- of addition It is marked with the quantity of label, b is the quantity for carrying out the default label that current iteration processing is deleted, and M is to carry out current iteration processing Before with the sum of the quantity of sample of the default label in each group, ε is the second preset threshold.Certainly, the convergence Condition can also be other conditions, and this specification embodiment is to this without limiting.

If the group characteristics include the group size and the cluster label concentration, the preset condition is the group Group scale is greater than 6 and the cluster label concentration is greater than 0.5, and the condition of convergence is to meetAnd described second Preset threshold ε is 0.2, by taking one above group group R as shown in Figure 7, group S and group T as an example, wherein black Color dot indicates the sample with the default label, and white dot indicates the sample without the default label, right below The iterative processing is described in detail:

The group size for calculating group R shown in Fig. 7 is 11, cluster label concentration is 8/11, and the group size of group S is 6, cluster label concentration is 2/6, and the group size of group T is 7, cluster label concentration is 5/7.Therefore, by group R and group T It is determined as the target group, group S is determined as the non-targeted group.It is described default not have in group R and group T The sample of label adds the default label, deletes the default label to the sample in group S with the default label, when The group of preceding iterative processing output is as shown in Figure 8.Since the quantity of the default label of current iteration processing addition is 2, when The quantity for the default label that preceding iterative processing is deleted is 1, is had in each group of last iterative processing output described The sum of the quantity of sample of default label is 11, i.e.,Thus current iteration after treatment Meet the condition of convergence, stop the iterative processing, by each sample in group R shown in Fig. 8, group S and group T Label information is as processing result.

In this specification embodiment, by directly to group carry out it is qualitative again to individual it is qualitative adjust, it is all calculating all In group, the complexity of calculating is reduced；It only needs to obtain the group characteristics of each group when each iterative processing, and does not have to Each sample is calculated and is updated, thus it is very low to calculate cost；The condition of convergence easily reaches, and generally carries out four to five iterative processings It can exit.It is compared, the sample of acquisition with existing LPA (Label Propagation Algorithm, label propagation algorithm) Accuracy and recall rate are higher；Support single label or few label starting, and the diffusion purification of single label, few label, multi-tag is all Progress can be synchronized.

Second aspect, based on the same inventive concept, this specification embodiment provide a kind of community division methods, comprising:

Calculate the degree of each node of the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

The new group is added in one above expanding node.

The community division methods that this specification embodiment second aspect provides, the classical theory based on complex network realize Community segmentation is quickly carried out under low time complexity, specifically refers to the description to step S401 to step S405, herein no longer It repeats.

The third aspect, based on the same inventive concept, this specification embodiment provide another community division methods, comprising:

Calculate the degree of each node of the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

Each expanding node handle into group；

Wherein, it is described enter group processing include:

The community division methods that this specification embodiment third aspect provides, the classical theory for being based not only on complex network are real Show and quickly carried out community segmentation under low time complexity, and has limited some nodes in core position and repeat that group is added The quantity of group, so that the processing speed of subsequent step be made to become faster, can specifically join so that it will not frequently participate in the calculating of each group The description to step S601 to step S606 is examined, details are not described herein.

Fourth aspect, based on the same inventive concept, this specification embodiment provide a kind of sample label processing unit.Fig. 9 It is the structural schematic diagram of the sample label processing unit, the sample label processing unit includes:

Sample set obtains module 901, and for obtaining sample set, the part sample in the sample set has default label；

Sample set division module 902, for according to the incidence relation between sample each in the sample set, by the sample This collection is divided into H group, and H is positive integer；

Iterative processing module 903, for carrying out L iterative processing to the H group, until meet the condition of convergence, and Using the label information of each sample after progress last time iterative processing as processing result, the label of each sample is believed Breath is corresponding to characterize whether each sample has the default label, and L is positive integer；

Wherein, the iterative processing module 903 includes:

Characteristic determination module 9031, for determining that the group of each group is special according to the label information of current each sample Sign；

Group determination module 9032, for determining target group and non-targeted group, the mesh according to the group characteristics The group that group is the sample aggregation with the default label is marked, the non-targeted group is to remove in one above group Other groups outside the target group；

Label adding module 9033, for for described in the sample addition in the target group without the default label Default label；

Label removing module 9034, described in deleting the sample in the non-targeted group with the default label Default label.

In a kind of optional implementation, the sample set division module 902 includes:

Network generation module, for generating with single according to the incidence relation between sample each in the sample set Sample is the relational network figure of node；

Community division module obtains the H group for carrying out community division to the relational network figure.

In a kind of optional implementation, the community division module includes:

Wherein, the access modules include:

In a kind of optional implementation, the community division module includes:

Wherein, the access modules include:

Enter group processing module, for handle into group to each expanding node；

Wherein, it is described enter group processing module include:

In a kind of optional implementation, the relational network figure is digraph, each node of the relational network figure Degree be the sum of in-degree and out-degree, the N side is successively directed toward present node from the expanding node or is referred to from present node To the expanding node.

In a kind of optional implementation, the group characteristics include:

Group size and cluster label concentration；Or

Cluster label concentration；

Wherein, the group size is the quantity of all samples in the corresponding group of the group characteristics, group's mark Sign the quantity of quantity and all samples that concentration is the sample in the corresponding group of the group characteristics with the default label The ratio between.

In a kind of optional implementation, the group determination module 9032 includes:

Third judgment module, for judging whether each group characteristics meet preset condition；

Target group determining module, for when the group characteristics meet the preset condition, by the group characteristics Corresponding group is determined as the target group；

Non-targeted group determination module, for when the group characteristics are unsatisfactory for the preset condition, by the group The corresponding group of feature is determined as the non-targeted group.

In a kind of optional implementation, the condition of convergence includes:

L reaches preset times；Or,

MeetWherein, a is the quantity for carrying out the default label of current iteration processing addition, and b is The quantity for the default label that current iteration processing is deleted is carried out, M is to have in each group before carrying out current iteration processing There is the sum of the quantity of sample of the default label, ε is the second preset threshold.

In a kind of optional implementation, H is the positive integer not less than 2.

5th aspect, based on the same inventive concept, this specification embodiment provides a kind of community dividing device, comprising:

Wherein, the access modules include:

6th aspect, based on the same inventive concept, this specification embodiment provides a kind of community dividing device, comprising:

Wherein, the access modules include:

Enter group processing module, for handle into group to each expanding node；

Wherein, it is described enter group processing module include:

7th aspect, is based on invention structure same as sample label processing method in previous embodiment and community division methods Think, the present invention also provides a kind of servers.With reference to Figure 10, the server includes memory 1004, processor 1002 and storage On the memory 1004 and the computer program that can run on the processor 1002, the processor 1002 execute institute The step of either sample label processing method and community division methods described previously method is realized when stating computer program.

In Figure 10, bus architecture (is represented) with bus 1000, and the bus 1000 may include any number of mutual The bus and bridge of connection, the bus 1000 will include the one or more processors represented by the processor 1002 and described deposit The various circuits for the memory that reservoir 1004 represents link together.The bus 1000 can also will such as peripheral equipment, steady Various other circuits of depressor and management circuit or the like link together, and these are all it is known in the art, therefore, It will not be further described herein.Bus interface 1005 is in the bus 1000 and receiver 1001 and transmitter 1003 Between interface is provided.The receiver 1001 and the transmitter 1003 can be the same element, i.e. transceiver, provide and are used for The unit communicated over a transmission medium with various other devices.The processor 1002 is responsible for the management bus 1000 and usually Processing, and the memory 1004 can be used to store the processor 1002 used data when executing operation.

Eighth aspect is based on invention structure same as sample label processing method in previous embodiment and community division methods Think, the present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer program is by institute State the step of realizing sample label processing method and community division methods described previously when processor executes.

This specification is referring to the method, equipment (system) and computer program product according to this specification embodiment Flowchart and/or the block diagram describes.It should be understood that can be realized by computer program instructions every in flowchart and/or the block diagram The combination of process and/or box in one process and/or box and flowchart and/or the block diagram.It can provide these computers Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices To generate a machine, so that generating use by the instruction that computer or the processor of other programmable data processing devices execute In setting for the function that realization is specified in one or more flows of the flowchart and/or one or more blocks of the block diagram It is standby.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of equipment, the commander equipment realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although the preferred embodiment of this specification has been described, once a person skilled in the art knows basic wounds The property made concept, then additional changes and modifications may be made to these embodiments.So the following claims are intended to be interpreted as includes Preferred embodiment and all change and modification for falling into this specification range.

Obviously, those skilled in the art can carry out various modification and variations without departing from this specification to this specification Spirit and scope.In this way, if these modifications and variations of this specification belong to this specification claim and its equivalent skill Within the scope of art, then this specification is also intended to include these modifications and variations.

Claims

1. a kind of sample label processing method, comprising:

Sample set is obtained, the part sample in the sample set has default label；

According to the incidence relation between sample each in the sample set, the sample set is divided into H group, H is positive whole Number；

L iterative processing is carried out to the H group, until meet the condition of convergence, and will progress last time iterative processing it The label information of each sample is as processing result, the corresponding characterization each sample of the label information of each sample afterwards No to have the default label, L is positive integer；

Wherein, the iterative processing includes: every time

Determine that target group and non-targeted group, the target group are with the default label according to the group characteristics The group of sample aggregation, the non-targeted group are other groups in one above group in addition to the target group；

2. according to the method described in claim 1, the incidence relation according between sample each in the sample set, by institute It states sample set and is divided into H group, comprising:

According to the incidence relation between sample each in the sample set, generate using single sample as the relational network figure of node；

Community division is carried out to the relational network figure, obtains the H group.

3. obtaining the H group according to the method described in claim 2, described carry out community division to the relational network figure Group, comprising:

Calculate the degree of each node of the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

Determine that more than one expanding node, the expanding node are associated by N side with present node according to present node Node, N is positive integer；

The new group is added in one above expanding node.

4. obtaining the H group according to the method described in claim 2, described carry out community division to the relational network figure Group, comprising:

Calculate the degree of each node of the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

Each expanding node handle into group；

Wherein, it is described enter group processing include:

If the added group's quantity of expanding node is less than first preset threshold, institute is added in the expanding node Shu Xin group.

5. the method according to claim 3 or 4, the relational network figure is digraph, each of described relational network figure The degree of node is the sum of in-degree and out-degree, and the N side is successively directed toward present node from the expanding node or from working as prosthomere Point is directed toward the expanding node.

6. according to the method described in claim 1, the group characteristics include:

Group size and cluster label concentration；Or

Cluster label concentration；

Wherein, the group size is the quantity of all samples in the corresponding group of the group characteristics, and the cluster label is dense Degree is the quantity of the sample in the corresponding group of the group characteristics with the default label and the ratio of number of all samples.

7. according to the method described in claim 1, described determine that target group and non-targeted group are wrapped according to the group characteristics It includes:

Judge whether each group characteristics meet preset condition；

If the group characteristics meet the preset condition, the corresponding group of the group characteristics is determined as the target complex Otherwise the corresponding group of the group characteristics is determined as the non-targeted group by group.

8. according to the method described in claim 1, the condition of convergence includes:

L reaches preset times；Or,

MeetWherein, a is the quantity for carrying out the default label of current iteration processing addition, and b is to carry out The quantity for the default label that current iteration processing is deleted, M be before carrying out current iteration processing in each group with institute The sum of the quantity of sample of default label is stated, ε is the second preset threshold.

9. according to the method described in claim 1, H is the positive integer not less than 2.

10. a kind of community division methods, comprising:

Calculate the degree of each node of the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

The new group is added in one above expanding node.

11. a kind of community division methods, comprising:

Calculate the degree of each node of the relational network figure；

Wherein, each node of the access relational network figure includes:

Judge whether present node has been added any one group；

Each expanding node handle into group；

Wherein, it is described enter group processing include:

12. a kind of sample label processing unit, comprising:

Sample set division module, for according to the incidence relation between sample each in the sample set, the sample set to be drawn It is divided into H group, H is positive integer；

Iterative processing module until meeting the condition of convergence, and will carry out most for carrying out L iterative processing to the H group For the label information of each sample as processing result, the label information of each sample corresponds to table after an iterative processing afterwards Levy whether each sample has the default label, L is positive integer；

Wherein, the iterative processing module includes:

Group determination module, for determining that target group and non-targeted group, the target group are according to the group characteristics The group of sample aggregation with the default label, the non-targeted group are that the target is removed in one above group Other groups outside group；

Label adding module, for adding the pre- bidding not have the sample of the default label in the target group Label；

Label removing module, for deleting the pre- bidding to the sample in the non-targeted group with the default label Label.

13. device according to claim 12, the sample set division module include:

Network generation module, for generating with single sample according to the incidence relation between sample each in the sample set For the relational network figure of node；

14. device according to claim 13, the community division module include:

Access modules successively access each node of the relational network figure for the sequence of the degree according to node from big to small；

Wherein, the access modules include:

New cluster generating module, for generating centered on present node when any one group is not added for present node New group；

Expanding node determining module, for determining that more than one expanding node, the expanding node are and work as according to present node For front nodal point by the associated node in N side, N is positive integer；

15. device according to claim 13, the community division module include:

Wherein, the access modules include:

Enter group processing module, for handle into group to each expanding node；

Wherein, it is described enter group processing module include:

Second judgment module, for judging the added group's quantity of the expanding node whether less than the first preset threshold；

Second is added module, is used for when the added group's quantity of the expanding node is less than first preset threshold, will The new group is added in the expanding node.

16. device according to claim 14 or 15, the relational network figure is digraph, the relational network figure it is every The degree of a node is the sum of in-degree and out-degree, and the N side is successively directed toward present node from the expanding node or from current Node is directed toward the expanding node.

17. device according to claim 12, the group characteristics include:

Group size and cluster label concentration；Or

Cluster label concentration；

18. device according to claim 12, the group determination module include:

Target group determining module, for when the group characteristics meet the preset condition, the group characteristics to be corresponded to Group be determined as the target group；

Non-targeted group determination module, for when the group characteristics are unsatisfactory for the preset condition, by the group characteristics Corresponding group is determined as the non-targeted group.

19. device according to claim 12, the condition of convergence include:

L reaches preset times；Or,

20. device according to claim 12, H is the positive integer not less than 2.

21. a kind of community dividing device, comprising:

Network generation module is section for according to the incidence relation between sample each in sample set, generating with single sample The relational network figure of point；

Wherein, the access modules include:

22. a kind of community dividing device, comprising:

Wherein, the access modules include:

Enter group processing module, for handle into group to each expanding node；

Wherein, it is described enter group processing module include:

23. a kind of server, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, the processor realize any one of claim 1 to 11 the method when executing the computer program The step of.

24. a kind of computer readable storage medium is stored thereon with computer program, the computer program is by the processor The step of any one of claim 1 to 11 the method is realized when execution.