Specific embodiment
The embodiment of this specification proposes a kind of new data digging method based on incidence relation, and all main bodys are divided
It is connected to subset for several, makes in each connection subset to include the main body that there is incidence relation with each of which member's main body, so that
It less include that two all connection subsets for belonging to the member's main body for excavating target type carry out data mining as data source, quite
In the connection subset for deleting the member's main body for not including or only belonging to comprising one excavation target type, reduces and need to handle
Data volume, improve the speed of data mining, due to remain it is all belong to excavate target type member's main bodys between
Incidence relation does not influence the effect of data mining substantially.
The embodiment of this specification may operate in any equipment with calculating and storage capacity, such as mobile phone, plate
The equipment such as computer, PC (Personal Computer, PC), notebook, server;Can also by operate in two or
The logical node of more than two equipment realizes the various functions in this specification embodiment.
In the embodiment of this specification, from the data source for being used to carry out data mining, it can extract between main body
Incidence relation.Wherein, data source can be the record of various network activity, and network activity can be related to user, such as
User initiates that request, server-side respond user's request, user A to user B buys commodity etc. by account;It can also be only
It is related with the node of non-user in network, such as business service end is to database service end request data.Main body can be net
It the movable participant of network and carries out when these network activities some or all of in required various resources.Wherein,
The participant of network activity can be user account, the server-side for providing a certain network service etc.;Need various resources to be used
It can be mark (the i.e. unique identification of equipment, as the Device-ID of Android device, the unique of apple equipment set of user equipment
Standby identifier etc.), (InternationalMobile Equipment Identity, world movement are set by the IMEI of user equipment
Standby identity code), WiFi (Wireless Fidelity) mark of user equipment access network, customer mobile terminal number, user equipment or fortune
MAC (Media Access Control, media access control) address, user equipment or operation service of the equipment of row server-side
The IP address of equipment at end etc. can also be identification card number, bank's card number of user etc. in some specific business procedures.
Since the resource used when participating in each side and the progress network activity of network activity is varied, network activity is related to
And main body usually there is different types.How type of subject divides, can be according to different subjects pair in practical application scene
The influence of data mining results determines that the embodiment of this specification is without limitation.For example, in the first application scenarios, certain
The number of devices that a account uses has a certain impact to Result, can be using account as a kind of type of subject, by user
The equipment used is as another type of subject;In second of application scenarios, using personal account and using collective's account into
Influence of the identical network activity of row to Result is different, then can be using personal account as a kind of type of subject, will
Collective's account is as another type of subject;In the third application scenarios, the network activity recorded in data source is without logging in
It can carry out, and whether network activity do not influence Result substantially by same account, identical equipment, then it can be with
The equipment that account and user are used is as a kind of type of subject.
The participation for usually require multiple main bodys of network activity, a specific network activity are each involved in it
Incidence relation is established between a main body.For example, user A buys commodity to user B using mobile phone C, then this purchase commodity
Network activity could set up incidence relation user A, mobile phone C and user B these three main bodys between any two.
In most of practical application scenes, data mining primarily directed to one or more certain types of main bodys come into
Capable, in other words, the incidence relation between these certain types of main bodys, the emphasis paid close attention to when being data mining, and these are special
Determine the incidence relation between the main body of type and the main body of non specified type and between the main body of non specified type to dig data
The influence for digging result is then fairly limited.In the embodiment of this specification, these specific types are known as to excavate target type.In reality
It, can specific requirements, main body class according to data mining using which type of subject as target type is excavated in the application scenarios of border
The influence of the division, different type main body of type to Result etc. is because usually determining, without limitation.For example, identifying black production group
In the application scenarios of partner, account is usually to excavate target type;In the application for predicting consumer consumption behavior with ustomer premises access equipment
In scene, mobile phone and all conducts of both type of subject of tablet computer can be excavated into target type.
In the embodiment of this specification, the process of the data digging method based on incidence relation is as shown in Figure 1.
It include the main body of at least two types in the embodiment of this specification, in the data source for carrying out data mining,
Wherein at least one type is to excavate target type.Based on the network activity recorded in data source, in several same types or
Incidence relation is established between different types of main body.
It should be noted that can be according to the characteristics of practical application scene and data mining demand, to select data source
In which network activity participant, and/or while carrying out network activity need which resource to be used as main body, and it is true
The network activity of settled implementation establishes incidence relation between these main bodys when having which feature;Without limitation.
Step 110, according to the incidence relation between main body, all main bodys are divided into several connection subsets.Each company
Logical subset includes at least one member's main body, includes having owning for incidence relation with each member's main body in a connection subset
Main body.
According to the network activity recorded in data source, all main bodys that these available network activities are related to and this
The incidence relation of a little main body formation when carrying out network activity.All main bodys are divided into several connections according to incidence relation
Subset, so that all main bodys with incidence relation all become member's main body of the same connection subset, and connection
Member's main body that member's main body of collection is connected to subset with other does not have incidence relation.That is, with one be connected to subset at
The relevant main body of member's main body is all member's main body of this connection subset.In this way, member's main body of each connection subset
It can directly or indirectly be connected by incidence relation, and be closed between member's main body of different connection subsets without association
System.
The embodiment of this specification is to the specific side taken when dividing connection subset according to the incidence relation between main body
Formula is without limitation, illustrated below.
It in one implementation, can be using main body as node, using incidence relation as side structure figures.Due to this specification reality
The main body at least two types in example are applied, constructed figure is isomery figure.Each maximal connected subgraphs of isomery figure are searched, often
A maximal connected subgraphs correspond to a connection subset, and all nodes of each maximal connected subgraphs are corresponding connection subset
All member's main bodys.All nodes in isomery figure constitute the set of whole main bodys, and obtain the mistake of maximal connected subgraphs
Journey is the process that each main body with incidence relation is divided into the same connection subset.Therefore, a maximal connected subgraphs
In each node be corresponding connection subset member's main body, the sides of the maximal connected subgraphs corresponds to connection
Concentrate the incidence relation between member's main body.
The concrete mode of maximal connected subgraphs is generated, the embodiment of this specification is equally without limitation.For example, can use
Various existing connection algorithms obtain each maximal connected subgraphs of isomery figure.
For another example, maximal connected subgraphs can be generated in the following way: is made with two endpoints on certain side in isomery figure
Newly gather for Element generation, if at least one in two endpoints be some have set element if by this have set merge
Enter in new set (this has set and no longer exists due to being incorporated to new set), traverse it is all have set after new set is added to
Have in set;Traverse isomery figure in all sides after, using obtain each have gather in all elements as one most
All nodes of big connected subgraph.Specifically, take a line in isomery figure, if two endpoints on this side be node a and
Node b generates the new set T using node a and node b as elementab;Search one by one it is each have set, if node a is certain
A element for having set P will then have set P and be merged into new set TabIn, if node b is some element for having set Q
To then have set Q and is merged into new set TabIn, traverse that all have will new set T after setabGather as having;To different
All sides repeat the above process in composition, resulting each to have set i.e. corresponding to a maximal connected subgraphs.
Step 120, data are carried out using the connection subset for containing at least two the member's main body for belonging to excavation target type
It excavates.
After obtaining all connection subsets, counts in each connection subset and belong to the number for excavating member's main body of target type
Amount, if not including belonging to the member's main body for excavating target type or only including one to belong to excavation in some connection subset
Member's main body of target type does not use the connection subset then in data mining.In other words, belonged to containing at least two
The data source used when excavating all connection subsets of member's main body of target type as data mining.
When it only includes the main body of an excavation target type that one, which is connected in subset, what which was embodied is to dig
Dig being associated between the main body and the main body of non-excavating target type of target type and between the main body of non-excavating target type
Relationship, and cannot reflect the incidence relation between the main body for excavating target type;When do not include in a connection subset excavate mesh
When marking the main body of type, what which embodied is the incidence relation between the main body of non-excavating target type, equally cannot
The incidence relation between the main body of target type is excavated in reflection.Due to excavating the main body and non-excavating target type of target type
Influence of the incidence relation to data mining results between main body and between the main body of non-excavating target type is fairly limited,
Data mining is carried out after deleting both connection subsets in data source, data volume to be treated can be reduced, accelerates to excavate
Speed, and Result is not influenced substantially.
There is no limit for algorithm used when in this specification embodiment to the concrete mode of data mining, data mining etc..
For example, feature extraction first can be carried out to the connection subset for containing at least two the member's main body for belonging to excavation target type, then
The feature of extraction is subjected to data mining as the input of machine learning model;It will can also directly contain at least two and belong to
Input of the connection subset of member's main body of target type as machine learning model is excavated, to carry out data mining.
For another example, in the implementation for carrying out connection subset division based on maximal connected subgraphs aforementioned, graphic calculation can be used
Method carries out network structure feature extraction to the maximal connected subgraphs for belonging to the node for excavating target type are contained at least two, then makes
Further data mining is carried out with the network structure feature of extraction.
As it can be seen that in the embodiment of this specification, all main bodys are divided into several connection subsets, in each connection subset
Including having the main body of incidence relation with each of which member's main body, belong to excavation in all connection subsets to contain at least two
The connection subset of member's main body of target type carries out data mining as data source;It is equivalent to delete and does not include or only include
One belongs to the connection subset for excavating member's main body of target type, reduces data volume to be treated, accelerates data digging
The speed of pick improves digging efficiency, and the influence to data mining results almost can be ignored.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment
It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable
Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can
With or may be advantageous.
In an application example of this specification, the freight charges that third party's shopping platform provides a user the return of goods are nearly serviced,
After user buys freight charges danger, when the merchandise return bought, the available compensation to a certain degree to back freight.In order to anti-
Zhi Heichan clique nearly carries out large-scale insurance fraud using freight charges, needs to find the insurance fraud account of clique's form in time.
Since the usually used number of user equipment of Hei Chan clique is limited, inevitably will appear during insurance fraud more
A account uses the situation of same user device, therefore when carrying out the data mining of Hei Chan clique discovery, with predetermined amount of time
All login behavior records of interior account on a user device are as data source, using account and user equipment as two kinds of main body classes
Type.Since the purpose of data mining is desirable to that this abnormal case can be logged on same user equipment by multiple accounts,
It was found that the account of Hei Chan clique, therefore using account as excavation target type.
Using each account of data source logged in behavior record as a node, using each user equipment as one
Node implements the login behavior (pass i.e. between account node and user equipment node using some user equipment with some account
Connection relationship) it is used as side, generate the isomery figure including two types node.
Newly gathered using two endpoints on certain side in isomery figure as Element generation, if at least one in two endpoints is
Some, which has element of set this is then had set, is merged into new set, traverse it is all have set after will newly gather conduct
Has set;After traversing all sides in isomery figure, to obtain each having all elements in gathering as a maximum
All nodes of connected subgraph.In a specific example, a kind of possible treatment process is as follows:
A line is obtained from isomery figure, if two endpoints on this side are node a and node b, generates new set Tab, by
Have set in currently not yet existing, it will new set TabIt is added to and has in set, has collection after addition and be combined into Tab;
Article 2 side is obtained from isomery figure, if two endpoints on this side are node c and node d, generates new set Tcd;By
One lookup has set, this is had set if the element that some has set includes node c and is merged into new set TcdIn,
This is had into set if the element that some has set includes node d and is merged into new set TcdIn;Due to node c and node
D is not to have set TabElement, be not required to carry out set merging, new set be added to and is had in set, has collection after addition
It is combined into TabAnd Tcd;
Article 2 side is obtained from isomery figure, if two endpoints on this side are node c and node d, generates new set Tcd;By
One lookup has set, this is had set if the element that some has set includes node c and is merged into new set TcdIn,
This is had into set if the element that some has set includes node d and is merged into new set TcdIn;Due to node c and node
D is not to have set TabElement, be not required to carry out set merging, new set be added to and is had in set, has collection after addition
It is combined into TabAnd Tcd;
Article 3 side is obtained from isomery figure, if two endpoints on this side are node a and node e, generates new set Tae;By
One lookup has set, this is had set if the element that some has set includes node a and is merged into new set TaeIn,
This is had into set if the element that some has set includes node e and is merged into new set TaeIn;Since node a is existing
Set TabElement, by TabIt is merged into new set Tae, T after mergingaeThere are tri- node a, node b, node e elements;To newly it collect
Close TaeIt is added to and has in set, has collection after addition and be combined into TaeAnd Tcd, originally have set TabBecause being merged into TaeAnd
No longer exist.
After being repeated the above process to each side remaining in isomery figure, each of obtains having set and correspond to one most
Big connected subgraph, each all elements having in set are all nodes in corresponding maximal connected subgraphs.
After obtaining all maximal connected subgraphs, the Account Type node counted in each maximal connected subgraphs (belongs to
In excavate target type node) quantity, delete only include an Account Type node maximal connected subgraphs.Due to should
With in example, login behavior must will be carried out using account, and each login behavior record is at least in an account and one
Incidence relation is established between user equipment, thus in this application example there is no only include user device type node, without
Maximal connected subgraphs including Account Type node.
Assuming that two maximal connected subgraphs difference are as shown in Figures 2 and 3, the section of an Account Type is represented in figure with dot
Point represents the node of a user device type with rectangle.Maximal connected subgraphs shown in Fig. 2 have 4 user device types
The node of node and 1 Account Type, maximal connected subgraphs shown in Fig. 3 have the node and 2 accounts of 2 user device types
The node of type;Then delete the maximal connected subgraphs in Fig. 2, the maximal connected subgraphs in reserved graph 3.
When carrying out the data mining of Hei Chan clique discovery, all maximal connected subgraphs that use is not deleted are as data
Source.
Present inventor has found in testing, to the isomery figure of more than one hundred million node sizes, with Node2Vec (node to
Amount modeling) algorithm carry out network structure feature extraction when, using include it is all log in behavior records data sources, time-consuming up to 45
Hour;And use the maximal connected subgraphs for containing at least two Account Type node as data source, it is time-consuming under same parameter
It only needs 8 hours, effect is very significant.
Corresponding with the realization of above-mentioned process, the embodiment of this specification additionally provides a kind of data mining based on incidence relation
Device.The device can also be realized by software realization by way of hardware or software and hardware combining.It is implemented in software
For, it is CPU (Central Process Unit, the central processing by place equipment as the device on logical meaning
Device) by corresponding computer program instructions be read into memory operation formed.For hardware view, in addition to shown in Fig. 4
Except CPU, memory and memory, also typically included based on the equipment where the data mining device of incidence relation for carrying out
Other hardware such as the chip of wireless signal transmitting-receiving, and/or for realizing other hardware such as board of network communicating function.
Fig. 5 show a kind of data mining device based on incidence relation of this specification embodiment offer, the association
Relationship is established between several main bodys;The main body includes at least two types, and wherein at least one type is to excavate target
Type;Described device includes connection subset unit and excavation execution unit, in which: connection subset unit is used for according between main body
Incidence relation, by all main bodys be divided into several connection subset;The connection subset includes at least one member's main body, and one
It include all main bodys that there is incidence relation with each member's main body in a connection subset;Excavate execution unit be used for using comprising
At least two connection subsets for belonging to the member's main body for excavating target type carry out data mining.
In one example, the connection subset unit is specifically used for: constructing using main body as node, by side of incidence relation different
Composition generates several maximal connected subgraphs of the isomery figure, using all nodes of each maximal connected subgraphs as one
It is connected to member's main body of subset.
In above-mentioned example, the connection subset unit generates several maximal connected subgraphs of isomery figure, comprising: with isomery
Two endpoints on certain side are newly gathered as Element generation in figure, if at least one in two endpoints is that some has set
Element then has described set and is merged into new set, traverse it is all have to be added to new set after set have set
In;After traversing all sides in isomery figure, to obtain each having all elements in gathering as a largest connected son
All nodes of figure.
Optionally, the excavation execution unit is specifically used for: belonging to excavation target to containing at least two using nomography
The maximal connected subgraphs of the node of type carry out network structure feature extraction.
Optionally, the execution unit that excavates is specifically used for: belonging to the member for excavating target type to containing at least two
The connection subset of main body carries out feature extraction;Or, to contain at least two the connection for belonging to the member's main body for excavating target type
Input of the subset as machine learning model.
Optionally, the type of subject includes: account, user equipment;The incidence relation includes: that some account uses certain
A user equipment implements login behavior;The excavation target type includes: account.
The embodiment of this specification provides a kind of computer equipment, which includes memory and processor.
Wherein, the computer program that can be run by processor is stored on memory;Computer program of the processor in operation storage
When, execute each step of the data digging method based on incidence relation in this specification embodiment.To based on incidence relation
The detailed description of each step of data digging method refer to before content, be not repeated.
The embodiment of this specification provides a kind of computer readable storage medium, is stored with computer on the storage medium
Program, these computer programs execute the data in this specification embodiment based on incidence relation and dig when being run by processor
Each step of pick method.Before being referred to the detailed description of each step of the data digging method based on incidence relation
Content is not repeated.
The foregoing is merely the preferred embodiments of this specification, all the application's not to limit the application
Within spirit and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the application protection.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitorymedia), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want
There is also other identical elements in the process, method of element, commodity or equipment.
It will be understood by those skilled in the art that the embodiment of this specification can provide as the production of method, system or computer program
Product.Therefore, the embodiment of this specification can be used complete hardware embodiment, complete software embodiment or combine software and hardware side
The form of the embodiment in face.Moreover, it wherein includes that computer is available that the embodiment of this specification, which can be used in one or more,
It is real in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code
The form for the computer program product applied.