CN116561688B

CN116561688B - Emerging technology identification method based on dynamic graph anomaly detection

Info

Publication number: CN116561688B
Application number: CN202310517066.7A
Authority: CN
Inventors: 庄越挺; 宗畅; 邵健; 鲁伟明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2024-03-22
Anticipated expiration: 2043-05-09
Also published as: CN116561688A

Abstract

The invention discloses an emerging technology identification method based on dynamic graph anomaly detection. The invention is based on the new combination assumption of the emerging technology as the prior art, by constructing dynamic diagram data oriented to the technical field, utilizing various space-time coupling characteristics and self-attention depth neural network algorithms, representing the relation between the nodes of the technical field as characteristic vectors fusing structural information and time sequence information, calculating to obtain an abnormal score of the technical combination, further regarding the high-score technical combination as a candidate set of the emerging technical field, and obtaining the final result of the emerging technical field through manual judgment. The method fully utilizes the space and time coupling information in the dynamic diagram in both the feature input and the neural network, achieves the effect superior to other similar latest methods in the conventional abnormality detection task, is innovatively applied to the emerging technology identification task, plays a role in screening candidate fields, and remarkably reduces the cost for solving the task.

Description

Emerging technology identification method based on dynamic graph anomaly detection

Technical Field

The invention relates to the fields of artificial intelligence, data mining, deep learning, anomaly detection, emerging technology identification and the like, in particular to an emerging technology identification method based on dynamic graph anomaly detection.

Background

The emerging technical field is often formed by innovative combination of the prior art, and the accurate identification of the emerging technical field can help enterprises and technicians to quickly find new investment and research directions, so that the method has remarkable social value. The problem of the identification of the emerging technical field can be modeled as an abnormal detection task of the technical combination relation, namely, a dynamic graph is constructed from technological big data such as patents and projects, the technical field is taken as a node in the graph, the co-occurrence relation among the technologies is taken as an edge in the graph, and the possible emerging technical field is found out by mining the abnormal degree of the technical combination relation in the graph, so that various downstream business scenes are assisted.

In the prior art, the invention patent with the application number of CN202210014102.3 discloses a dual self-attention-based dynamic graph anomaly detection method, which applies structure self-attention to a vertex sequence obtained by random walk sampling of a graph, and further extracts structural features and time sequence features of the dynamic graph to detect anomaly edges, so that the extraction of the structural features is enhanced by introducing more important nodes focused by a self-attention mechanism, further, the evolution mode of the vertex is learned, the time sequence features are extracted, and a better effect is achieved on anomaly detection tasks through double attention. The invention patent with the application number of CN202210019006.8 discloses a dynamic graph anomaly detection method based on a community structure, which is used for reconstructing the distances between nodes in communities and among communities by detecting an evolved community of a dynamic graph, so that the characteristics of the nodes in the same community are similar and the distances between communities are far, thereby effectively solving the problem of anomaly detection tasks. The invention patent with the application number of CN202210530965.6 provides a method and a device for identifying an emerging technology based on large-scale corpus, wherein the method is characterized in that key words are extracted from documents, emerging score values are obtained through the number of candidate documents and related information of the key words, and then the obtained candidate emerging technology key words are subjected to a dynamic backtracking method to obtain the target emerging technology field. The invention patent with application number of CN201710356745.5 provides an emerging technology identification method based on patent citation, the method obtains a main classification number with highest coupling degree by calculating the co-citation coupling degree of the patent citation in the last two years, further marks the newly built classification number as an emerging technology, circularly completes technology identification of all patents to obtain labeling data for training a classification model, further predicts the subsequent patent technology, and obtains a better effect on the emerging technology identification task.

However, the existing dynamic graph anomaly detection method has defects in the fusion depth of the time sequence features and the space structure features, so that the detection performance is not high, and the anomaly detection technology has blank in the application task identified by the emerging technology, so that the detection method needs to be further improved and verified.

Disclosure of Invention

The invention aims to solve the problem of low recognition and detection performance of the emerging technology in the prior art and provides a recognition method of the emerging technology based on dynamic image anomaly detection.

The specific technical scheme adopted by the invention is as follows:

an emerging technology identification method based on dynamic graph anomaly detection, comprising:

s1, constructing technical text data into a technical dynamic graph, wherein graph nodes are technical fields, edges are co-occurrence relations among the technical fields, and time stamps are dates of technical text disclosure; taking each edge in the technical dynamic graph as a center edge, and extracting neighbor subgraphs corresponding to each edge through subgraph sampling; the node set of the neighbor subgraph comprises two nodes forming a center edge and all first-order neighbor nodes of the two nodes, and the edges in the neighbor subgraph are edges among all nodes in the subgraph;

s2, aiming at neighbor subgraphs corresponding to each edge in the technical dynamic graph, calculating multi-level node characteristics of a time-space independent characteristic set and a time-space coupling characteristic set in each node in the graph, projecting the multi-level node characteristics into a characteristic space by utilizing weight parameters, and acquiring a space-time characteristic vector corresponding to each node by aggregation;

S3, corresponding node sets of each neighbor subgraph in the technical dynamic graph are spliced in a time sequence to form a dynamic graph node sequence, and space-time feature vectors of all nodes in the dynamic graph node sequence are fused with space-time two-dimensional position coding information to obtain fusion features of all nodes; inputting the fusion characteristics of each node in the dynamic graph node sequence into a self-attention network deep learning model for depth representation calculation, and aggregating the depth representation vectors of all nodes in the dynamic graph node sequence to obtain a depth representation vector corresponding to the center edge of each neighbor subgraph;

s4, inputting the corresponding depth representation vector of each side in the latest snapshot of the technical dynamic graph into a multi-layer perceptron deep learning model, converting the depth representation vector of each side into a corresponding anomaly score, taking the anomaly score as a screening standard, and screening a plurality of sides with the anomaly scores of the positions being ranked from high to low to the front in the latest snapshot of the technical dynamic graph, wherein the two technical field combinations with the co-occurrence relations corresponding to the sides are the emerging technical candidate fields.

Preferably, the technical text is a patent document, in the constructed patent technology dynamic diagram, nodes are patent CPC classification codes, edges are the combination relation among the first three CPCs related to the patent document, and the time stamp is the patent publication date.

Preferably, the technical text is a project text, in the constructed project technical dynamic diagram, the nodes are project technical keywords, the edges are the combination relation among the first five keywords related to the project document, and the time stamp is the project publication date.

Preferably, in the step S1, each edge is formed in the technical dynamic diagram when sub-sampling is performedIs>And->All the neighbor nodes are selected, the nodes and the edges thereof form a neighbor subgraph corresponding to the edge, and any node in the neighbor subgraph is expressed as follows:

wherein,for the kth node at time t, < +.>And->Nodes +.>And->Is described herein).

Preferably, in the step S2, for the neighbor subgraphs corresponding to each edge in the technical dynamic graph, the method for calculating the space-time feature vector corresponding to each node is as follows:

s21, calculating a time-space independent feature set consisting of a global space feature, a local space feature and a time-of-existence feature, wherein:

the global space features are represented by PageRank values of nodes in the global graph, and the calculation formula is as follows:

wherein: s is S ^t Is a snapshot of the global technical dynamic graph at time t, pageRank (&) is PageRank value Calculating a function;

the local spatial features are represented by the minimum distance between node-to-edge constituent nodes, and are calculated as follows:

wherein: dist (·) is a shortest path distance calculation function, and min (·) is a minimum function;

the presence time feature is represented by the time span that exists at the center edge of the subgraph where the node is located, and the calculation formula is as follows:

wherein: t is t _start Is thatThe first time point of generating the center edge of the sub-graph;

s22, calculating a time-space coupling feature set consisting of a distance change feature, an interaction change feature and a co-adjacent change feature, wherein:

the distance change feature is represented by the change of the distance between the nodes formed by the center edges of the subgraph where the nodes are located in the time dimension, and the calculation formula is as follows:

wherein: dist (·) is a shortest path distance calculation function for calculating the shortest distance between two constituent nodes of the edge at the time point t- Δt; Δt is the time step of the feature change of interest;

the interactive change characteristic is represented by the change of the degree of the node formed by the center edge of the subgraph where the node is located in the time dimension, and the calculation formula is as follows:

wherein: deg (·) is a degree calculation function for calculating the degrees of the center edge forming nodes on the snapshots of the technical dynamic graph at different moments respectively;

The co-neighbor change feature is represented by the change of the number of the co-neighbors of the node formed by the center edge of the sub-graph where the node is located in the time dimension, and the calculation formula is as follows:

wherein: v is the edge forming nodeAnd->Nodes in the intersection of the respective neighbor node sets;

s23, aiming at any node in neighbor subgraphsProjecting each feature in the time-space independent feature set and the time-space coupled feature set into a feature vector space by a weight parameter which can be learned, and further aggregating to obtain a node +.>Corresponding space-time feature vector->The calculation formula is as follows:

wherein: w (W) _g ，W _l ，W _t Learnable weight parameters, W, for projection of time-space independent features _d ，W _i ，W _n Is a weight parameter of the projected time-space coupling feature.

Preferably, in the step S3, the method for obtaining the depth representation vector corresponding to the center edge of each neighbor subgraph is as follows:

s31, corresponding node sets of each neighbor subgraph in different snapshots of the technical dynamic graph are spliced in a time sequence to form a dynamic graph node sequence with total length of (C+2) x T

Wherein: u is a splicing operator, C is the number of all neighbor nodes of two constituent nodes on two sides of a center edge,for all neighbor nodes, T is the number of time stamps contained in the technical dynamic graph, namely the total number of snapshots;

S32, each node in the dynamic graph node sequenceSumming the absolute space position projection and the relative space position projection, and splicing the two-dimensional space position projection and the time position projection to obtain space-time two-dimensional position coding information +.>The calculation formula is as follows:

wherein:for vector concatenation operator, W _abs 、W _rel 、W _tmp Three learnable projection matrices;

for node->Is calculated as follows:

wherein: rw=ad ^-1 Is a random walk operation result matrix, A is an adjacency matrix of a technical dynamic diagram, D ^-1 The inverse of the degree matrix of the technical dynamic graph; RW _kk For taking the value of the kth column of the kth row of the random walk operation result matrix, RW _kk The superscript of (a) represents a power;

for node->Is calculated as follows:

PE _tmp is a nodeThe calculation formula is as follows:

wherein: t is a nodeThe current timestamp of the sub-graph in which it is located;

s33, fusing the space-time feature vector of each node in the dynamic graph node sequence with space-time two-dimensional position coding information, and splicing the space-time feature vector and the space-time two-dimensional position coding information into an input feature sequence of the model:

wherein:(·) ^· transpose the operator for the matrix;

s34, inputting the characteristic sequenceIn a multi-layer self-attention network with the total layer number of P, depth representation is carried out on an input characteristic sequence through a multi-layer self-attention mechanism, wherein the depth representation mode in any first-layer self-attention network is as follows:

First, calculate the attention weight A ^(l) The formula is as follows:

wherein: softmax (·) is a Softmax function, l is the current network layer number, l is E [1, P]The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofFor the initial input feature sequence +.>

Then carrying out layer standardization operation and feedforward network calculation on the obtained result to obtain a depth representation vector output by the current network layerThe calculation formula is as follows:

H ^(l) ＝LN(A ^(l) +Q ^(l) ),

wherein: LN (·) is the layer normalization operation, FFN (·) is the feed-forward network calculation;

s35, the depth representation vector outputted by the last layer of self-attention network is as followsAverage value aggregation operation is carried out to obtain the center side +.>The dynamic graph feature results of (2) are expressed as follows:

wherein:representing the vector for depth +.>Corresponding to the dynamic graph node sequence +.>Is the characterization vector of the nth node of (C), l= (c+2) ×t.

Preferably, in the step S4, each edge in the latest snapshot of the technical dynamic graph is processedThe expression of the depth characterization vector of (c) to the corresponding anomaly score is:

wherein, sigmoid (·) is Sigmoid function, and MLP (·) is multi-layer perceptron model.

Preferably, in the step S4, the screened candidate field of the emerging technology needs to be sent to an artificial auditing end for auditing, and a final emerging technical field is generated by combining an artificial auditing result.

Preferably, before the emerging technology recognition framework formed by the S1-S4 is used for actual reasoning, the learnable parameters of each network layer need to be optimized by pre-constructed positive and negative samples in a training stage.

Preferably, the error loss employed by the training emerging technology recognition frameworkThe expression is as follows:

where N is the total number of all samples,and->The anomaly scores for the positive and negative samples, respectively.

Compared with the prior art, the invention has the following beneficial effects:

the invention is based on the new combination assumption of the emerging technology as the prior art, by constructing dynamic diagram data oriented to the technical field, utilizing various space-time coupling characteristics and self-attention depth neural network algorithms, representing the relation between the nodes of the technical field as characteristic vectors fusing structural information and time sequence information, calculating to obtain an abnormal score of the technical combination, further regarding the high-score technical combination as a candidate set of the emerging technical field, and obtaining the final result of the emerging technical field through manual judgment. The method fully utilizes the space and time coupling information in the dynamic diagram in both the feature input and the neural network, achieves the effect superior to other similar latest methods in the conventional abnormality detection task, is innovatively applied to the emerging technology identification task, plays a role in screening candidate fields, and remarkably reduces the cost for solving the task. The method can further support the business scenes such as technical research direction selection, technical field investment, technical development analysis and the like.

Drawings

FIG. 1 is a flowchart of an emerging technology identification method based on dynamic graph anomaly detection in an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

In a preferred embodiment of the present invention, there is provided an emerging technology identification method based on dynamic graph anomaly detection, comprising the steps of:

s1, constructing technical text data into a technical dynamic graph, wherein graph nodes are technical fields, edges are co-occurrence relations among the technical fields, and time stamps are dates of technical text disclosure; taking each edge in the technical dynamic graph as a center edge, and extracting neighbor subgraphs corresponding to each edge through subgraph sampling; the node set of the neighbor subgraph comprises two nodes forming a center edge and all first-order neighbor nodes of the two nodes, and the edges in the neighbor subgraph are edges between all nodes in the subgraph.

In the embodiment of the invention, the technical text can be a patent document, in the constructed patent technical dynamic diagram, nodes are patent CPC classification codes, edges are the combination relation among the first three CPCs related to the patent document, and time stamps are patent publication dates; the technical text can also be project text, in the constructed project technical dynamic diagram, nodes are project technical keywords, edges are combination relations among the first five keywords related to the project document, and time stamps are project publication dates.

In the embodiment of the present invention, each edge is formed in the technical dynamic diagram when the sub-sampling is performedIs>And->All the neighbor nodes are selected, the nodes and the edges thereof form a neighbor subgraph corresponding to the edge, and any node in the neighbor subgraph is expressed as follows:

It should be noted that since the technical text is constantly being disclosed, snapshots of the technical dynamics graph exist at different moments. The present invention assumes that the emerging technology is a novel combination of the prior art, so that each edge in the technology dynamic diagram corresponds to a technology combination.

S2, aiming at neighbor subgraphs corresponding to each edge in the technical dynamic graph, calculating multi-level node features of a time-space independent feature set and a time-space coupling feature set in each node in the graph, projecting the multi-level node features into a feature space by utilizing weight parameters, and obtaining a space-time feature vector corresponding to each node through aggregation.

In the embodiment of the invention, the method for calculating the space-time feature vector corresponding to each node aiming at the neighbor subgraphs corresponding to each edge in the technical dynamic graph is as follows:

wherein: s is S ^t The method is a snapshot of a global technical dynamic diagram at time t, and PageRank (·) is a PageRank value calculation function;

s23, aiming at any node in neighbor subgraphs Projecting each feature in the time-space independent feature set and the time-space coupled feature set into a feature vector space by a weight parameter which can be learned, and further aggregating to obtain a node +.>Corresponding space-time feature vector->The calculation formula is as follows:

S3, corresponding node sets of each neighbor subgraph in the technical dynamic graph are spliced in a time sequence to form a dynamic graph node sequence, and space-time feature vectors of all nodes in the dynamic graph node sequence are fused with space-time two-dimensional position coding information to obtain fusion features of all nodes; and inputting the fusion characteristics of each node in the dynamic graph node sequence into a self-attention network deep learning model to perform depth representation calculation, and aggregating depth representation vectors of all nodes in the dynamic graph node sequence to obtain depth representation vectors corresponding to the center edges of each neighbor subgraph.

In the embodiment of the invention, the method for obtaining the depth representation vector corresponding to the center edge of each neighbor subgraph is as follows:

for node->Is calculated as follows:

wherein: rw=ad ^-1 Is a random walk operation result matrix, A is an adjacency matrix of a technical dynamic diagram, D ^-k1 The inverse of the degree matrix of the technical dynamic graph; RW _kk For taking the value of the kth column of the kth row of the random walk operation result matrix, RW _kk The superscript of (a) represents a power;

for node->Is calculated as follows:

PE _tmp is a nodeThe calculation formula is as follows:

Wherein:(·) ^· transpose the operator for the matrix;

s34, inputting the characteristic sequenceIn a multi-layer self-attention network with the total input layer number of P, depth characterization is carried out on an input characteristic sequence through a multi-layer self-attention mechanism, wherein any first layerThe depth characterization in the self-attention network is as follows:

first, calculate the attention weight A ^(l) The formula is as follows:

wherein: softmax (·) is a Softmax function, l is the number of network layers where the current is located,

l∈[1,P]the method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofFor the initial input feature sequence +.>

H ^(l) ＝LN(A ^(l) +Q ^(l) ),

s35, the depth representation vector outputted by the last layer of self-attention network is as followsAverage value aggregation operation is carried out to obtain an input characteristic sequenceCorresponding center edge->The dynamic graph feature results of (2) are expressed as follows:

It should be noted that, since different snapshots exist in the technology dynamic graph, the emerging technology at the current latest moment is mainly required to be focused when the emerging technology is screened, and therefore, only each edge in the latest snapshot (i.e. t=t) of the technology dynamic graph needs to be identified.

In the embodiment of the present invention, in S4, each edge in the latest snapshot of the technical dynamic graph is determinedThe expression of the depth characterization vector of (c) to the corresponding anomaly score is:

In addition, the screened emerging technical candidate fields may have misjudgment, so that the screening result is preferably sent to an artificial auditing end for auditing, and the final emerging technical field is generated by combining the artificial auditing result.

It should be noted that the steps S1 to S4 actually constitute a model framework identified by the emerging technology, and before the model framework is used for actual reasoning, the learnable parameters of each network layer need to be optimized by using pre-constructed positive and negative samples in the training stage. In embodiments of the present invention, training an emerging technology identifies error loss employed by a frameworkThe expression is as follows:

where N is the total number of all samples, And->The anomaly scores for the positive and negative samples, respectively.

The method of the emerging technology identification method based on the anomaly detection of the dynamic graph described in the above S1 to S4 is applied to a specific example to show the training mode, the technical effect, and the like.

Examples

As shown in fig. 1, in this embodiment, the method for identifying an emerging technology based on the anomaly detection of a dynamic graph includes the following steps:

1. technology dynamic diagram data construction and sampling link

The emerging technology types of interest in this embodiment are two types of patents in the artificial intelligence field and projects in the cancer medical field, so that the two types of technology texts need to be collected in advance for constructing a dataset.

Aiming at the collected project data of the artificial intelligence field patent and the cancer medical field, a technical dynamic graph dataset respectively oriented to two technical subjects is constructed by utilizing a data processing method, and a training set and a testing set which comprise neighbor subgraphs and time sequences are constructed based on graph calculation and random sampling methods.

The technical dynamic graph data construction and sampling link consists of two self links of technical dynamic graph construction, training and test sample set construction.

1.1, technology dynamic diagram construction links

The link comprises three data processing steps of recent technical text acquisition, technical field co-occurrence relation generation and technical field ID, wherein the data processing methods are all from an open source toolkit.

The node of the technical dynamic graph constructed in the link is the technical field, the edges are the co-occurrence relations among the technical fields, and the time stamp is the date of technical text disclosure. For a patent technology dynamic diagram, nodes are patent CPC classification codes, edges are combination relations among the first three CPCs of a patent document, and time stamps are patent publication dates; for the project technology dynamic graph, nodes are project technology keywords, edges are combination relations among the first five keywords related to the project document, and time stamps are project publication dates. The technical dynamic diagram constructed by the two data is shown as follows:

1.2, training and testing sample set construction links

The training set and the testing set constructed in the link are obtained by sequencing dynamic graphs with N sides according to time stamps and then cutting the dynamic graphs with M sides as units, wherein the time sequence length is N/M. The first half of the time sequence is a training set, the second half of the time sequence is a testing set, and the time sequence length is N/M/2. The specific values of N and M can be adjusted and optimized according to the actual data set condition and the model input requirement.

For positive and negative sample label construction, all edges on the graph at the last time point of the training set can be regarded as negative samples, and the same number of edges which do not exist on the graph are randomly generated as positive samples, so that a final training data set is obtained. In addition, all edges on the graph at the last time point of the test set are regarded as negative samples, and 10% of edges of the negative samples which are not existing on the graph are randomly generated as positive samples, so that a final test data set is obtained.

2. Multi-level space-time coupling characteristic calculation link

And constructing multi-level node features covering space-time independent features and time-space coupling features according to the space-time coupling degree by using priori knowledge of the dynamic graph structure and the time sequence and combining general knowledge of abnormal behaviors, calculating to obtain feature values, projecting the feature values into a feature space by using weight parameters, and polymerizing to obtain the node features.

The link consists of four sub-links of sub-sampling, time-space independent feature set, time-space coupling feature set, node feature projection and aggregation.

2.1 sub-sampling

For the nodes forming each edge, selecting all neighbor nodes, and taking a subgraph formed by the nodes and the edges as a sample for extracting the characteristics, wherein any node in the subgraph is represented as follows:

Wherein,for the kth node at time t, < +.>And->Respectively is a constitution edge->Is>Anda first-order set of neighbor nodes, two nodes each.

2.2 time-space independent feature set

The time-space independent feature set consists of three node-oriented spatial or temporal features, including global spatial features, local spatial features, and time-of-existence features. Wherein the global spatial feature is represented by a PageRank value of a node in the global graph, as follows:

wherein S is ^t Is a graph snapshot of the global dynamic graph at time t,as any node in the subgraph,

PageRank (.cndot.) is a PageRank value calculation function. The local spatial features are represented by the minimum distance between node-to-edge constituent nodes, as follows:

wherein Dist (·) is a shortest path distance calculation function, and min (·) is a minimum function. The presence time feature is represented by the time span that exists at the center edge of the subgraph where the node is located, as follows:

wherein t is _start Is thatThe first time point of the center edge of the sub-graph is located, and t is the current time point of the calculation feature.

2.3 time-space coupled feature set

The time-space coupling feature set is composed of three node-oriented space-time evolution features, including distance change features, interaction change features and co-adjacent change features. The distance change feature is represented by the change of the distance between the nodes formed by the center edges of the subgraph where the nodes are located in the time dimension, and is represented as follows:

/>

Wherein Dist (·) is a shortest path distance calculation function for calculating the shortest distance between two constituent nodes of an edge at a time point t- Δt, whereas at the current time point t, the distance is 1 due to the edge presence. In addition, the interactive change characteristic is represented by the change of the degree of the node formed by the center edge of the sub-graph where the node is located in the time dimension, and is represented as follows:

wherein Deg (·) is a degree calculation function for calculating the values of the degree of the edge forming nodes at different moments, respectively. Finally, the co-neighbor change feature is represented by the change of the number of co-neighbors of the node formed by the center edge of the sub-graph where the node is located in the time dimension, and is represented as follows:

wherein v is an edge forming nodeAnd->Nodes in the intersection of respective sets of neighbor nodes. In the above feature calculation, the value of Δt may be changed, and Δt is set to 1 in this patent, that is, the feature of the change between the previous time and the current time is focused.

2.4 node feature projection and aggregation

Each feature in the above time-space independent feature set and time-space coupled feature set is projected into a feature vector space by a learnable parameter and further aggregated to obtain a final node feature vector, expressed as follows:

Wherein W is _g ，W _l ，W _t Weight parameters, W, for projection of time-space independent features _d ，W _i ，W _n Is a weight parameter of the projected time-space coupling feature.

3. Self-attention network space-time characterization calculation link

Constructing a node sequence of the dynamic graph aiming at the edge, merging space-time two-dimensional position coding information by using a self-attention network deep learning model, performing depth characterization calculation on the input node characteristic sequence, and aggregating the characterization result according to the sequence to obtain a characteristic vector aiming at the edge;

the link consists of three sub-links of input sequence construction, two-dimensional position coding and edge feature coding.

3.1, input sequence Structure

For the input of the sample of the dynamic graph and the subgraph where the sample is located, the nodes are arranged in time sequence to form an input sequence, and the input sequence is expressed as follows:

and the U is a splicing operator, the sampling number of the neighbor nodes is T is the time stamp number, and the total length of the input sequence is (C+2) multiplied by T.

3.2 two-dimensional position coding

Each node in the dynamic graph input sequence represents its position information in the dynamic graph with a combination code of positions in two dimensions of space and time, respectively.

Spatial position information: the spatial position of a node is derived from a combination of the absolute position in the graph in which it is located and the relative position in the subgraph in which it is located. Wherein the absolute position of a node is represented by a vector resulting from its higher order random walk operation, as follows:

Wherein rw=ad ^-1 Is a random walk operation result matrix, A is an adjacent matrix of the graph, D ^-1 RW is the inverse of the degree matrix of the graph _kk To take the value of the kth column of the kth row of its result matrix, its superscript stands for power.

In addition, the relative positions of the nodes are obtained by the position relation between the nodes and the center edge node in the subgraph, and are expressed as follows:

if the node is a node of the center side, the position is 0; if the node is a common neighbor node of which the center edge forms the node, the position is 1; if not, the position is 2.

Time position information: the time position of a node is the current time stamp number of the graph in which it is located, and is expressed as follows:

wherein t is a nodeThe current timestamp number of the figure.

Two-dimensional position combination coding: the space-time two-dimensional position coding information of one node is obtained by splicing the sum of the absolute spatial position projection and the relative spatial position projection of the node and the time position projection of the node, and the space-time two-dimensional position coding information is expressed as follows:

wherein,for vector concatenation operator, W _abs And W is _rel Projection matrices of absolute and relative positions, respectively, with dimensions 1×d/2 and m×d/2,w, respectively _tmp The dimension of the projection matrix is 1 x d/2, and d is the vector dimension of the whole position code.

3.3 edge feature coding links

The link consists of three parts, namely input sequence feature calculation, self-attention network characterization calculation and edge feature aggregation calculation.

3.3.1 input sequence feature computation

The input sequence features are obtained by summing the node input features of each node formed by the input sequence features and the position coding features of the node input features, and are expressed as follows:

wherein:(·) ^T transpose the operators for the matrix.

3.3.2 self-attention network characterization calculation

Firstly, according to a self-attention layer method of a transducer model, the obtained input sequence features are deeply characterized by adopting a multi-layer self-attention network, and the depth characterization is expressed as follows:

/>

wherein Softmax (.cndot.) is a function of Softmax, l is the number of network layers where the current is located,

l∈[1,P]p is the total number of layers of the multi-layer self-care network. When l=1, whereinFor the initial input feature sequence +.>

Next, layer normalization operation and feedforward network calculation are performed on the obtained result, which are expressed as follows:

H ^(l) ＝LN(A ^(l) +Q ^(l) ),

where LN (-) is the layer normalization operation and FFM (-) is the feed forward network calculation.

3.3.3 edge feature coding

And carrying out average value aggregation operation on the obtained depth characterization result of the input sequence to obtain a dynamic diagram characteristic result of the center edge corresponding to the input sequence, wherein the dynamic diagram characteristic result is expressed as follows:

Wherein:representing the vector for depth +.>Corresponding to the dynamic graph node sequence +.>Is a token vector for the nth node.

4. Combined relation abnormal score and error calculation link

Converting the feature vector of the edge into an abnormal score by using a multi-layer perceptron deep learning model, calculating error loss between the abnormal score and the label by using the positive and negative sample label information obtained by sampling in the step 1, and updating the learnable parameters in the links in a reverse propagation manner;

the link consists of three sub-links of edge anomaly score calculation, error loss calculation and model parameter optimization.

4.1 edge anomaly score calculation

Converting the obtained feature vector of the edge into an anomaly score by using a multi-layer perceptron model, wherein the anomaly score is used for an anomaly value of a combination relationship represented by the edge and is represented as follows:

4.2 error loss calculation

Calculating error loss between the abnormality score and the label by utilizing the positive and negative sample labels obtained by pre-sampling, wherein the error loss is expressed as follows:

4.3 model parameter optimization

And updating the reverse propagation parameters of the learnable parameters in the links based on the obtained error loss, and obtaining an optimized model through multiple iterations, wherein the model is used for identifying technical point combinations in the emerging technical field.

In this embodiment, AUC (area under ROC curve) and AP (average accuracy) are used to perform performance evaluation on the model, and test is performed on two self-constructed data sets of artificial intelligence patent and medical project, the generated positive sample accounts for 10% of negative sample, the training period is 300 rounds, the feature vector dimension is set to 32, the number of self-attentive network layers is set to 2, the learning rate is set to 0.001, and the obtained test index result is:

the model compared with the method model is a front-edge deep learning model in a dynamic image anomaly detection task, and comprises AddGraph (https:// www.ijcai.org/procedings/2019/614), strGNN (https:// dl.acm.org/doi/10.1145/3459637.3481955) and TADDY (https:// ieeExplore.ieee.org/document/9599560 /). It can be seen that the performance of the method is obviously improved compared with the existing model.

5. Combined output link in candidate emerging technical field

The optimized model parameters are obtained through multi-step iterative learning, the abnormal scores of the technical combination relations output by the model are ordered in descending order, the first K high-score technical combinations are obtained, and the emerging technical field is further obtained through manual verification and used for downstream tasks.

Based on the obtained optimized model, the abnormal detection can be carried out on the artificial intelligent patent technology classification dynamic diagram and the cancer medical project technical keyword dynamic diagram respectively in the actual application stage, K high-score technical combinations before the abnormal score are output as emerging technical candidate fields, and then the final emerging technical fields are obtained through manual verification and are used for tasks in downstream application scenes. Taking k=10, the resulting candidate emerging technology combinations are as follows:

from the analysis results, some identified technical fields represented by the technical combinations have emerged and are popular, such as Knowledge Reasoning + Natural Language Query and Vehicle Adapting Control + Visual or Acoustic Aids and Cancer Intervention and Surveillance +Automation, and these results indicate that our method helps to identify the hot spot research field; some technical combinations are in completely new technical fields, such as Brain Neoplasms+Cell Linear and Androgen Receptor + Biological Markers, and these results indicate that our method helps to give technical fields that might lead to future research directions.

The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims

1. An emerging technology identification method based on dynamic graph anomaly detection is characterized by comprising the following steps:

s2, aiming at the neighbor subgraphs corresponding to each edge in the technical dynamic graph, calculating multi-level node characteristics of each node in the neighbor subgraphs, wherein each node comprises a time-space independent characteristic set and a time-space coupling characteristic set, projecting the multi-level node characteristics into a characteristic space by utilizing a weight parameter, and acquiring a space-time characteristic vector corresponding to each node through aggregation;

S4, inputting the corresponding depth representation vector of each side in the latest snapshot of the technical dynamic graph into a multi-layer perceptron deep learning model, converting the depth representation vector of each side into a corresponding abnormal score, taking the abnormal score as a screening standard, and screening a plurality of sides with front abnormal scores from high to low in the latest snapshot of the technical dynamic graph, wherein two technical field combinations with co-occurrence relations corresponding to the sides are emerging technical candidate fields;

in the step S1, each side is formed in the technical dynamic diagram when sub-sampling is performedIs>And->Selecting all neighbor nodes, wherein the nodes and edges thereof form a neighbor subgraph corresponding to the edge, and the kth node in the neighbor subgraph at time t is +.>The expression is as follows:

wherein,and->Nodes +.>And->Is a first-order neighbor node set;

in the step S2, for the neighbor subgraphs corresponding to each edge in the technical dynamic graph, the space-time feature vector calculation method corresponding to each node is as follows:

global spatial features Represented by the PageRank value of the node in the global graph, the calculation formula is as follows:

local spatial featuresThe minimum distance between nodes is represented by the node-to-edge composition, and the calculation formula is as follows:

time of presence featureRepresented by the time span existing at the center edge of the subgraph where the node is located, the calculation formula is as follows:

distance change featuresThe distance between the nodes formed by the center edges of the subgraph where the nodes are located is represented by the change of the time dimension, and the calculation formula is as follows:

interactive change featuresThe degree of the node formed by the center edge of the sub-graph where the node is located is represented by the change in the time dimension, and the calculation formula is as follows:

co-neighbor variation characteristicsThe calculation formula is as follows, wherein the change of the common neighbor number of the node in the time dimension is represented by the central edge of the sub-graph where the node is located:

wherein: w (W) _g ，W _l ，W _t Learnable weight parameters, W, for projection of time-space independent features _d ，W _i ，W _n Is a learnable weight parameter for projecting the time-space coupling feature.

2. The emerging technology identification method based on the anomaly detection of the dynamic graph as claimed in claim 1, wherein the technology text is a patent document, in the constructed patent technology dynamic graph, nodes are patent CPC classification codes, the combination relation of the first three CPCs related to the patent document is simultaneously adopted, and the timestamp is patent publication date.

3. The method for identifying emerging technologies based on anomaly detection of dynamic graph according to claim 1, wherein the technical text is a project text, in the constructed project technical dynamic graph, nodes are project technical keywords, edges are combination relations among the first five keywords related to the project document, and a timestamp is a project publication date.

4. The method for identifying an emerging technology based on dynamic graph anomaly detection according to claim 1, wherein in S3, the method for obtaining the depth representation vector corresponding to the center edge of each neighbor subgraph is as follows:

s32, each node in the dynamic graph node sequenceSumming the absolute space position projection and the relative space position projection, and splicing the two-dimensional space position projection and the time position projection to obtain space-time two-dimensional position coding information +. >The calculation formula is as follows:

for node->Is calculated as follows:

for node->The calculation formula is as follows:

wherein: transpose the operator for the matrix;

first, calculate the attention weight A ^(l) The formula is as follows:

wherein: softmax (·) is a Softmax function, l is the current network layer number, l is E [1, P]The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps of For the initial input feature sequence +.>

H ^(l) ＝LN(A ^(l) +Q ^(l) ),

5. The method for identifying emerging technologies based on motion graph anomaly detection as recited in claim 4, wherein in S4, each edge in the latest snapshot of the technology motion graph is processedDepth characterization vector conversion to corresponding outliersFrequent score->The expression of (2) is:

6. The method for identifying emerging technologies based on dynamic graph anomaly detection according to claim 4, wherein in S4, the screened emerging technology candidate fields are required to be sent to an artificial auditing end for auditing, and a final emerging technology field is generated by combining an artificial auditing result.

7. The method for identifying emerging technologies based on anomaly detection of dynamic graph as claimed in claim 1, wherein the emerging technology identification framework formed by S1-S4 is used for parameter optimization of the learnable parameters of each network layer by pre-constructed positive and negative samples in training phase before actual reasoning.

8. The method for emerging technology identification based on dynamic graph anomaly detection of claim 7, wherein the error penalty employed by the emerging technology identification framework is trainedThe expression is as follows: