CN106777339A

CN106777339A - A kind of method that author is recognized based on heterogeneous network incorporation model

Info

Publication number: CN106777339A
Application number: CN201710025800.2A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-01-13
Filing date: 2017-01-13
Publication date: 2017-05-31

Abstract

A kind of method that author is recognized based on heterogeneous network incorporation model proposed in the present invention, its main contents are included：Node insertion, heterogeneous network incorporation model, shared insertion, joint training, input paper identification writer identity, its process is, anonymous paper is input into first, determination key message is analyzed to it, construction feature is represented, heterogeneous network incorporation model enhanced with path is guided using task, submodel and the enhanced general embedded submodel in path is specifically embedded in including task to produce a joint objective, and joint training is performed, finally determine possible author's ranking and export top ranked i.e. true authors.The present invention break through tradition it is embedded in because using general insertion independently of specific tasks, ignore network isomerism and the limitation of homogeneous network can only be processed, by using the enhanced internet startup disk in path for thering is task to guide, can be when true authors be recognized than existing method more precise and high efficiency.

Description

A kind of method that author is recognized based on heterogeneous network incorporation model

Technical field

The present invention relates to writer identity identification field, make more particularly, to a kind of identification based on heterogeneous network incorporation model The method of person.

Background technology

Writer identity identification, is usually used in the possibility author of automatic identification anonymity paper, recognizing the source of anonymous comment, for General information is retrieved and commending system is even more with extensive implication, the field that the document inquired about need to be such as matched with known target Close.And depend on the specific Feature Engineering of problem this field, the existing method more, take and not transferable, and according to network Embedded method, faces two primary limitations again：One is to use general embedding grammar, and unrelated with specific tasks；Two is only Homogeneous network can be processed, and have ignored the isomerism of network.Therefore it is applied to recognize the technological means of writer identity, Shang You at present Inconvenience.

The present invention proposes a kind of method that author is recognized based on heterogeneous network incorporation model, creates a task guiding Heterogeneous network incorporation model enhanced with path.It is embedded in node as vector in potential feature space first, then basis Task is specific and network general target is embedded in share with joint training.Clearly guided with logical using writer identity identification mission Combination learning influence internet startup disk is crossed, and based on internet startup disk is performed, with it for the first path of selection provides implicit guiding, so that True authors are exported by built-up pattern.The present invention break through tradition it is embedded in because using general insertion independently of specific tasks, Ignore network isomerism and the limitation of homogeneous network can only be processed, it is embedding by using the enhanced network in the path guided with task Enter, can be when true authors be recognized than existing method more precise and high efficiency.

The content of the invention

The problems such as time-consuming and not transferable for existing method, heterogeneous network is based on it is an object of the invention to provide one kind The method that incorporation model recognizes author, it is specific using task guiding heterogeneous network incorporation model enhanced with path, including task The enhanced general embedded submodel of embedded submodel and path come determine may author ranking so that identification true authors Shi Gengjia is accurate and efficient.

To solve the above problems, the present invention provides a kind of method that author is recognized based on heterogeneous network incorporation model, its master Wanting content includes：

(1) node insertion；

(2) heterogeneous network incorporation model；

(3) insertion is shared；

(4) joint training；

(5) input paper identification writer identity.

Wherein, described node insertion, each node represents a node type, node type include keyword, with reference to, Place and time etc., node are embedded in potential feature space as vector.

Wherein, described heterogeneous network incorporation model, is made up of two major parts：The author of the specific insertion of task based access control The enhanced universal network insertion of identification and path, the two is combined into a Unified frame, according to writer identity identification mission The first path of selection.

Further, the specific insertion of task of described writer identity identification, this model can be according to the opinion for being given The information (such as keyword, reference and place) of text carries out ranking to possible author, and this model is based on node and is embedded in gradually Anonymous paper p construction features are represented, finally the paper of polymerization is represented for scoring possible author.

Further, described construction feature represents, the polymerization that paper p construction features are represented is included based on node insertion Two stages：

First stage, it is by rightIn node insertion be averaging, be each t-th node type construction feature Vector, i.e.,：

WhereinIt is t-th character representation of node type (such as keyword node type), u_nIt is n-th node insertion (such as keyword node)；

Second stage, it is paper p construction features vector using the weighted array of different node types：

Anonymous paper p is by this characteristic vector V_pRepresent, and can be used for by calculating dot product for possible author is (embedded Vector) scoring, paper and author to reserved portion, be defined as follows：

For learning parameter U and ω, it is goal ordering to be based on hinge loss function using stochastic gradient descent (SGD), right In each triple (p, a, a '), wherein a is one of true authors of paper p, and a' is not the author of paper p, hinge loss letter Number is defined as：

max(0,f(p,a′)-f(p,a)+ζ) (4)

Wherein ζ is a positive number, commonly referred to border, if just to f (p, score a) be not at least ζ times more than f (p, A '), loss will be caused to punish, in order to extract the sample of the triple to be used in SGD (p, a, a '), it is randomly chosen X_pIn Paper a p and A_pIn one of author a, then from predefined noise profileIt is middle to extract one Negative sample.

Wherein, the enhanced universal network insertion in described path, the existing internet startup disk technology of this Model Extension goes to merge Different first path, for developing the information enriched in heterogeneous network.

It is not to use former adjacency matrix { E in the enhanced internet startup disk in path^(l), wherein l is former Linktype or single-hop First path (such as author → write → paper), it is contemplated that more path diversity (such as author → write → paper → comprising → keyword) and make With the enhanced adjacency matrix { M in first path of network^(r)It is used for internet startup disk, wherein each M^(r)Under representing specific first path r Network connectivty, here each M of specification^(r),So that the insertion of study will not be by with big original weight Some first paths dominations because having unlimited number of possible first path (including former link type), it is contemplated that network is embedding It is fashionable, it is necessary to select useful first path of limited quantity.

In order to learn insertion, the neighbouring induction between node can be kept by first path, it then follows neighbor prediction framework, and Neighbours' distribution to the hypothesis of node is modeled, and in heterogeneous network, can there is various path types since node i, because Neighbours' distribution of this node will jointly be limited by node i and given path type r, be defined as follows：

Wherein u_iIt is the insertion of node i, DST (r) represents the set for being possible to node in path r destination ends.

For learning parameter U and b, use to maximize likelihood function as the stochastic gradient descent (SGD) of target, trained Journey is given as follows：Sample a path r first, then according to them in M^(r)Weight to adopt link (i, j) at random Sample, negative nodal point set { j ' } used in formula 6 is predefined also according to someIt is sampled, for example particular edge type Under the distribution of " smooth " node degree, finally, parameter U, b is updated according to their gradient so that approximate sample log-likelihoodCan maximize.

Further, described Unified frame, because task is specifically embedded in submodel and the enhanced general insertion in path Submodel is responsible for the different aspect of network：The former is more focused on the information directly related with specific task, and the latter can be more Explore more general and various information in Heterogeneous Information network well, thus the two is placed on it is right in a Unified frame They are modeled, and two submodels are combined in following two levels：

(1) joint objective is produced by the target that combined task is specific and network is general, and performs combination learning, Here task is clearly guided for internet startup disk is provided；

(2) the first path used in the general insertion of network chooses according to writer identity identification mission, here Task for internet startup disk is provided, implicit guiding selects first path.

Wherein, described shared insertion, the weighted linear combination that joint objective function is defined as two submodels plus one The regular terms of individual insertion, wherein embedded vector is shared in two submodels：

Wherein ω ∈ [0,1] are specific and network common segment the trade-off factors of task, when ω=1, only use network General insertion；And when ω=0, be embedded in using only supervision, regular terms is added to avoid over-fitting.

Wherein, described joint training, makees clearly to guide to be influenceed by combination learning using writer identity identification mission Internet startup disk, and in the case where internet startup disk is performed, with it for the first path of selection provides implicit guiding：

Clearly guide, using the target in asynchronous stochastic gradient descent (ASGD) optimization method 7, wherein sample is painted at random System, training executed in parallel, a task dispatcher based on sampling of autonomous Design, it is possible to achieve from two different data sources Study：A task is drawn according to ω first, the sample of selected task is then drawn, and according to Sample Refreshment parameter；

Implicit guiding, selects first path, by following two steps using writer identity identification mission as guiding：

(1) single path performance, the internet startup disk for being primarily based on single path runs combination learning successively, then to all times Routing footpath running experiment；

(2) greedy additional path selection, the performance obtained according to step (1) (from good to difference) gives paths ordering, and progressively To being added paths in selected pond, experiment runs for the additional combinations in each path, then road of the selection with optimum performance Combine in footpath.

Wherein, described input paper identification writer identity, input anonymity paper p to this model, model is by anonymous paper Information (such as keyword node, place node and reference mode) as the node in network, based on this information aggregate to hide Possibility author's ranking of name paper p, obtain top ranked is identified as true authors, exports this result.

Brief description of the drawings

Fig. 1 is a kind of system flow chart of the method based on heterogeneous network incorporation model identification author of the present invention.

Fig. 2 is a kind of heterogeneous network synoptic diagram of the method based on heterogeneous network incorporation model identification author of the present invention.

Fig. 3 is a kind of task of the writer identity identification of the method based on heterogeneous network incorporation model identification author of the present invention The calculation process general introduction figure of specific embedded structure.

Fig. 4 is a kind of input paper identification author of method based on heterogeneous network incorporation model identification author of the present invention Flow chart.

Specific embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow chart of the method based on heterogeneous network incorporation model identification author of the present invention.Main bag Include node insertion, heterogeneous network incorporation model, shared insertion, joint training, input paper identification writer identity.

Wherein, node insertion, each node represents a node type, node type includes keyword, with reference to, place and Time etc., node is embedded in potential feature space as vector.

Wherein, heterogeneous network incorporation model, model is made up of two major parts：Author's body of the specific insertion of task based access control Part identification universal network insertion enhanced with path, the two is combined into a Unified frame, is selected according to writer identity identification mission Select first path.

The specific insertion of task of writer identity identification, model can according to be given an information for paper (such as keyword, Reference and place) ranking is carried out to possible author, this model is based on node and is embedded in gradually to anonymous paper p construction features Represent, finally represent for scoring possible author the paper of polymerization.

The enhanced universal network incorporation model in path expands existing internet startup disk technology and goes to merge different first paths, uses The information enriched in exploitation heterogeneous network.

Task is specifically embedded in submodel and the enhanced general embedded submodel in path is responsible for the different aspect of network：The former The information directly related with specific task is more focused on, and the latter can explore more preferably in Heterogeneous Information network General and various information, therefore the two is placed in a Unified frame they are modeled, two submodels are following It is combined in two levels：

Wherein, share insertion, joint objective function be defined as two submodels weighted linear combination plus an insertion Regular terms, wherein embedded vector is shared in two submodels：

Wherein, joint training, makees clearly to guide to influence network embedding by combination learning using writer identity identification mission Enter, and in the case where internet startup disk is performed, with it for the first path of selection provides implicit guiding.

Clearly guide, using the target in asynchronous stochastic gradient descent (ASGD) optimization method 7, wherein sample is painted at random System, training executed in parallel, a task dispatcher based on sampling of autonomous Design, are capable of achieving from two different data sources Practise：A task is drawn according to ω first, the sample of selected task is then drawn, and according to Sample Refreshment parameter；

Wherein, input paper identification writer identity, anonymity paper p is to this model for input, and model is by the information of anonymous paper (such as keyword node, place node and reference mode), as the node in network, is anonymous paper p based on this information aggregate Possibility author's ranking, obtain top ranked i.e. is identified as true authors, export this result.

Fig. 2 is a kind of heterogeneous network synoptic diagram of the method based on heterogeneous network incorporation model identification author of the present invention.Often Individual node represents a node type, and each link represents a connection type.A plurality of first path defined in synoptic diagram, with it As a example by middle paper → keyword ← paper, and paper → time ← paper, can be explained in heterogeneous network, even same type Two nodes (such as paper), can also derive different semantemes along different paths.

Fig. 3 is a kind of task of the writer identity identification of the method based on heterogeneous network incorporation model identification author of the present invention The calculation process general introduction figure of specific embedded structure.First by embedded node, each node is mapped to potential feature space, then It is paper p construction features vector using the weighted array of different node types.Finally, by calculating dot product for possible author (embedded vector) is scored, and according to this scoring ranking.

Fig. 4 is a kind of input paper identification author of method based on heterogeneous network incorporation model identification author of the present invention Flow chart.When author's identification is carried out to paper, paper is imported into model first, determination key message is analyzed to paper, Construction feature represents that carrying out ranking to possible author obtains ranking highest author, outputs it, and is that user is carried out to paper The identification of efficiently and accurately.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. it is a kind of based on heterogeneous network incorporation model recognize author method, it is characterised in that mainly include node insertion (one)； Heterogeneous network incorporation model (two)；Shared embedded (three)；Joint training (four)；Input paper recognizes writer identity (five).

2. based on node insertion () described in claims 1, it is characterised in that each node represents a node type, Including keyword, with reference to, place and etc. the time, node is embedded in potential feature space node type as vector.

3. based on the heterogeneous network incorporation model (two) described in claims 1, it is characterised in that model is by two major parts Composition：The writer identity of the specific insertion of task based access control recognizes universal network insertion enhanced with path, and the two is combined into a system One framework, first path is selected according to writer identity identification mission.

4. the specific insertion of task based on the writer identity identification described in claims 3, it is characterised in that this model can be with root Ranking is carried out to possible author according to the information for paper (such as keyword, reference and place) for providing, this model base Gradually anonymous paper p construction features are represented in node insertion, finally represents for entering to possible author the paper of polymerization Row scoring.

5. represented based on the construction feature described in claims 4, it is characterised in that paper p is built based on node insertion special The polymerization for levying expression includes two stages：

First stage, it is by rightIn node insertion be averaging, be each t-th node type construction feature vector, I.e.：

V_{p}^{(t)} = \underset{n &Element; X_{p}^{(t)}}{Σ} u_{n} / | X_{p}^{(t)} | - - - (1)

WhereinIt is t-th character representation of node type (such as keyword node type), u_nIt is that n-th node insertion (is such as closed Keyword node)；

V_{p} = \underset{t}{Σ} ω_{t} / V_{p}^{(t)} - - - (2)

Anonymous paper p is by this characteristic vector V_pRepresent, and can be used for by calculating dot product is possible author (embedded vector) Scoring, paper and author to reserved portion, be defined as follows：

f (p, a) = u_{a}^{T} V_{p} = u_{a}^{T} (\underset{t}{Σ} \frac{ω_{t}}{V_{p}^{(t)}}) = u_{a}^{T} (\underset{t}{Σ} ω_{t} \underset{n &Element; X_{p}^{(t)}}{Σ} u_{n} / | X_{p}^{(t)} |) - - - (3)

For learning parameter U and ω, it is goal ordering to be based on hinge loss function using stochastic gradient descent (SGD), for every Individual triple (p, a, a '), wherein a is one of true authors of paper p, and a' is not the author of paper p, and hinge loss function is determined Justice is：

max(0,f(p,a′)-f(p,a)+ζ) (4)

Wherein ζ is a positive number, commonly referred to border, if just to f (p, score a) be not at least ζ times more than f (p, a '), Loss will be caused to punish, in order to extract the sample of the triple to be used in SGD (p, a, a '), be randomly chosen X_pIn one A piece paper p and A_pIn one of author a, then from predefined noise profileOne negative sample of middle extraction This.

6. based on the enhanced universal network insertion in path described in claims 3, it is characterised in that this Model Extension is existing Internet startup disk technology goes to merge different first paths, for developing the information enriched in heterogeneous network；

It is not to use former adjacency matrix { E in the enhanced internet startup disk in path^(l), wherein l is former Linktype or single-hop unit road Footpath (such as author → write → paper), it is contemplated that more path diversity (such as author → write → paper → comprising → keyword) and use net Enhanced adjacency matrix { the M in first path of network^(r)It is used for internet startup disk, wherein each M^(r)Represent the network under specific first path r Connectedness, here each M of specification^(r), So that the insertion of study will not be by some with big original weight First path domination, because having unlimited number of possible first path (including former link type), it is contemplated that during internet startup disk, must Useful first path of limited quantity must be selected；

In order to learn insertion, the neighbouring induction between node can be kept by first path, it then follows neighbor prediction framework, and to section Neighbours' distribution of the hypothesis of point is modeled, and in heterogeneous network, can there is various path types, therefore section since node i Neighbours' distribution of point will jointly be limited by node i and given path type r, be defined as follows：

P (j | i; r) = \frac{\exp (u_{i}^{T} u_{j})}{Σ_{j^{'} | &Element; D S T (r)} \exp (u_{i}^{T} u_{j^{'}})} - - - (5)

Wherein u_iIt is the insertion of node i, DST (r) represents the set for being possible to node in path r destination ends；

For learning parameter U and b, use to maximize likelihood function as the stochastic gradient descent (SGD) of target, training process is given Go out as follows：Sample a path r first, then according to them in M^(r)Weight to carry out stochastical sampling to link (i, j), it is public Negative nodal point set { j ' } used in formula 6 is predefined also according to someIt is sampled, such as under particular edge type " smooth " node degree is distributed, and finally, parameter U, b is updated according to their gradient so that approximate sample log-likelihoodCan maximize.

7. based on the Unified frame described in claims 3, it is characterised in that task is specifically embedded in submodel and path enhancing General embedded submodel be responsible for the different aspect of network：The former is more focused on the information directly related with specific task, and The latter can explore more general and various information preferably in Heterogeneous Information network, therefore the two is placed on into a system They are modeled in one framework, two submodels are combined in following two levels：

(1) joint objective is produced by the target that combined task is specific and network is general, and performs combination learning, here Task clearly guided for internet startup disk is provided；

(2) the first path used in the general insertion of network chooses according to writer identity identification mission, here appoint It is engaged in providing implicit guiding for internet startup disk selecting first path.

8. based on the shared insertion (three) described in claims 1, it is characterised in that joint objective function is defined as two sons The weighted linear combination of model adds an embedded regular terms, wherein embedded vector is shared in two submodels：

Wherein ω ∈ [0,1] are specific and network common segment the trade-off factors of task, only general using network when ω=1 It is embedded；And when ω=0, be embedded in using only supervision, regular terms is added to avoid over-fitting.

9. based on the joint training (four) described in claims 1, it is characterised in that make clear and definite using writer identity identification mission Guiding to influence internet startup disk by combination learning, and is that the first path of selection carries with it in the case where internet startup disk is performed For implicit guiding；

Clearly guide, using the target in asynchronous stochastic gradient descent (ASGD) optimization method 7, wherein sample by it is random draw, Training executed in parallel a, task dispatcher based on sampling of autonomous Design is capable of achieving from two different data source study： A task is drawn according to ω first, the sample of selected task is then drawn, and according to Sample Refreshment parameter；

(1) single path performance, the internet startup disk for being primarily based on single path runs combination learning successively, then to all candidate roads Footpath running experiment；

(2) greedy additional path selection, the performance obtained according to step (1) (from good to difference) gives paths ordering, and progressively to choosing Added paths in fixed pond, experiment runs for the additional combinations in each path, then group of paths of the selection with optimum performance Close.

10. writer identity (five) is recognized based on the input paper described in claims 1, it is characterised in that the anonymous paper p of input To this model, model is using the information (such as keyword node, place node and reference mode) of anonymous paper as in network Node, based on possibility author's ranking that this information aggregate is anonymous paper p, obtain top ranked is identified as true authors, Export this result.