CN116595197A

CN116595197A - Link prediction method and system for patent classification number associated knowledge graph

Info

Publication number: CN116595197A
Application number: CN202310840162.5A
Authority: CN
Inventors: 陈伟坚; 修宇璇; 陈博奎; 梁京昊; 曹可欣; 刘兴禄; 任欣悦
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-08-15
Anticipated expiration: 2043-07-10
Also published as: CN116595197B

Abstract

The invention discloses a link prediction method and a link prediction system for a patent classification number associated knowledge graph, wherein the method comprises the following steps: s1, constructing a patent classification number association knowledge graph; s2, taking an adjacent matrix of the patent classification number association knowledge graph constructed in the step S1 as input, and taking a scoring matrix with the same size as the adjacent matrix as output, and establishing a graph convolution neural network comprising a feature extraction layer, a graph convolution layer and an output layer; s3, taking the patent classification number associated knowledge graph constructed in the step S1 as an initial graph, and removing part of continuous edges in the initial graph to obtain a training graph; training the graph convolution neural network of the step S2 by taking the training graph as input and the initial graph as output; s4, carrying out link prediction based on the graph convolution neural network obtained through training in the step S3. Therefore, the accuracy of the link prediction of the patent classification number associated knowledge graph can be remarkably improved on the premise of not depending on patent text data.

Description

Link prediction method and system for patent classification number associated knowledge graph

Technical Field

The invention relates to the technical fields of artificial intelligence, information management and knowledge graph, in particular to a link prediction method and a link prediction system for patent class number association knowledge graph.

Background

A knowledge graph is a network structure that represents relationships between things as a combination of nodes and edges, and can be used to express knowledge. In the knowledge graph, nodes are also called "entities", representing specific things, and edges represent relationships or connections between entities.

The patent classification number association knowledge graph is a special knowledge graph, and is used for representing association relations among different subdivision technical fields of a specific industry or technical theme, wherein nodes represent different technical fields, and a connecting edge represents association relations among two connected technical fields. The patent classification number association knowledge graph is generally constructed by adopting the following method: firstly, all patents of a specific industry or technical theme are acquired, and then all classification numbers of the patents are acquired and used as nodes of a patent classification number associated knowledge graph. If two classification numbers appear in one or more patents at the same time, a continuous edge is established between nodes corresponding to the two classification numbers, and the weight of the continuous edge is set to be the number of times that the two classification numbers appear in the patents together.

And carrying out link prediction on the patent classification number associated knowledge graph, namely predicting potential connection relations possibly formed in the future between node pairs which are not connected in the knowledge graph. This potential connection often implies the opportunity to cross and fuse between different technical fields. Therefore, the link prediction is carried out on the patent classification number association knowledge graph, has important significance for research personnel, can prompt unknown and potential relations between the subdivision technical fields, and provides new ideas and directions for innovation. Meanwhile, the link prediction is carried out on the patent classification number association knowledge graph, and the trend and hot spot problems in the technical field can be found by the government and enterprises, so that powerful support is provided for relevant decision making.

The existing methods capable of carrying out link prediction on the patent classification number association knowledge graph can be roughly divided into two types, and the first type of methods can be summarized as follows: link prediction index based on topology similarity. Such methods assume that the link formation mechanism is subject to a particular mechanism. For example, common Neighbor (CN) index assumes: "if two technical fields have many common neighbor nodes in the patent classification number association knowledge graph, then a potential association relationship is more likely to exist between the two technical fields. Based on the index assumption of preference connection (preferential attachment, PA): the probability of establishing the continuous edges between a certain technical field and other technical fields in the patent classification number associated knowledge graph is proportional to the number of the continuous edges established in the technical field. However, such methods are not accurate in practical applications. This is because such methods rely on a priori assumptions about the mechanism of link formation in the patent class number associated knowledge-graph, which are difficult to satisfy in practical applications. In fact, in real data, the formation of links between different technical fields may follow some complex mechanism, which is difficult to generalize to one or several explicit rules.

Another class of methods can be summarized as: link prediction model based on graph neural network. The method tries to avoid prior assumption of formation rules of connecting edges between technical fields in the patent classification number associated knowledge graph, and automatically learns the rules by using a graph neural network. However, the link prediction model based on the graph neural network still has poor prediction accuracy on the patent classification number correlation knowledge graph. One of the main reasons is the lack of node features in the patent classification number associated knowledge graph. The nodes in the patent classification number associated knowledge graph are the technical fields represented by the patent classification numbers, and are essentially classification labels for representing different technical fields. Thus, we may not provide more node features than assigning a unique one-hot (one-hot) encoding vector to each node. Since the graph neural network is essentially feature aggregation for neighboring nodes, the lack of node features can result in a decrease in model accuracy.

Currently, the most advanced method for solving the technical problem of lack of node characteristics is to generate a feature vector for a node based on patent text data. Such a method trains a Doc2vec model (vectorization model) based on all patent texts corresponding to each patent classification number, and uses a vector generated by the Doc2vec model as a feature vector of the patent classification number. However, in practical application, a great deal of labor cost and computing resources are required to be consumed in a plurality of links such as acquisition, cleaning and Doc2vec model training of massive patent text data, so that the problem of high cost of the method is caused.

In addition, another reason for poor link prediction accuracy of the graph neural network on the patent classification number associated knowledge graph is that a large number of 'isolated nodes' exist in the patent classification number associated knowledge graph. In the early stages of development, patent class number association knowledge maps of an industry (or technical topic) may contain few technical fields (i.e., nodes). Over time and with the development of technology, the patent classification number association knowledge graph may expand and include more technical fields. In other words, the process of expanding the patent classification number association knowledge graph is accompanied by the establishment of connection between isolated nodes. The existence of these isolated nodes may hinder message propagation and feature aggregation of the graph neural network, thereby reducing learning ability and accuracy of the graph neural network.

Disclosure of Invention

Aiming at solving the technical problems of overhigh cost or lower accuracy caused by the fact that massive patent text data are needed to be relied on in the prior art, the primary purpose of the invention is to provide a link prediction method of a patent classification number association knowledge graph.

It is still another object of the present invention to provide a link prediction system of a patent classification number association knowledge graph, which includes a processor and a memory, wherein the memory stores a computer program, and the computer program is executable by the processor to implement the link prediction method. The technical problems of the invention are solved by the following technical scheme:

A link prediction method of patent classification number association knowledge graph includes the following steps:

s1, constructing a patent classification number association knowledge graph;

s2, taking an adjacent matrix of the patent classification number association knowledge graph constructed in the step S1 as input, and taking a scoring matrix with the same size as the adjacent matrix as output, and establishing a graph convolution neural network comprising a feature extraction layer, a graph convolution layer and an output layer;

s3, taking the patent classification number associated knowledge graph constructed in the step S1 as an initial graph, and removing part of continuous edges in the initial graph to obtain a training graph; taking the training map as input, taking the initial map as output, and training the graph convolution neural network in the step S2;

s4, carrying out link prediction based on the graph convolution neural network obtained through training in the step S3.

In some embodiments, in step S1, the constructing a patent classification number association knowledge graph specifically includes: for a given target industry or target technical theme, searching all related patents, and constructing a patent classification number association knowledge graph according to the International Patent Classification (IPC) code of each patent; each node in the patent classification number association knowledge graph represents the first four codes IPC4 of the IPC classification number of the IPC subclass, the connecting edge between the nodes represents that the IPC4 corresponding to two nodes at least appear in one patent at the same time, and the number of times that the IPC4 corresponding to the two nodes appear in the patent at the same time represents the weight of the connecting edge.

In some embodiments, in step S2, the feature extraction layer performs a series of transformations on adjacent matrixes of the patent classification number associated knowledge graph to obtain a plurality of transformation matrixes, and performs feature extraction on all the transformation matrixes to obtain a feature matrix of the node; wherein each row in the feature matrix represents a feature vector of a node.

In some embodiments, the adjacent matrix of the patent classification number associated knowledge graph is transformed into the following expression； wherein ,/>Is a size of +.>Matrix of->Representing node->And node->There is a weight of +.>Link of->Representing node->And node->No link was observed and the link was not observed,the number of nodes in the knowledge graph is related to the patent classification number constructed in the step S1;

said performing a series of transformations at least comprises the following transformations: (1) Two binarization matricesAnd, wherein ,/>Representing node->And node->The corresponding IPC4 belongs to the same IPC part,/->Representing node->And node->The corresponding IPC4 does not belong to the same IPC part,/->Representing node->And node->The corresponding IPC4 belongs to the same IPC subclass,/->Representing node->And node- >The corresponding IPC4 does not belong to the same IPC subclass; (2) Product matrix of degree vectors->, wherein ,/>Is +.>Vector of->For node->Degree of (1)/(2)>Transposition of the representative vector; (3) 2 to +.>Power of: />, wherein />；

The feature extraction mode of all transformation matrixes is based on one input dimension asThe output dimension isMulti-layer perceptron MLP of (2), wherein>The number of the transformations selected in the step S2; the MLP is according to->Individual transformation matricesNumerical value of corresponding position +.>For every pair of nodes->Calculating a characteristic value +.>I.e.

；

wherein ,for an input dimension of +.>The output dimension is +.>Is a multi-layer perceptron of (2); the output of the feature extraction layer is a size of +.>Matrix of->。

In some embodiments, the deriving a number of transformation matrices is specifically derivingA plurality of transformation matrices; wherein the first transformation has 2, the second transformation has 1, and the third transformation has +.>Each transformation results in a size +.>Is>Finally get->The transformation matrix->。

In some embodiments, in step S2, the graph stacking layer uses each node in the patent classification number association knowledge graph as a target node, aggregates feature vectors of other nodes connected with the target node according to a connection relationship between the nodes to obtain an aggregate feature vector of the target node, and then splices the aggregate feature vectors of all the nodes to obtain an aggregate feature matrix of the nodes.

In some embodiments, the aggregating feature vectors of the other nodes connected to the target node includes the following three methods: (1) Adjacency matrix based on patent classification number association knowledge graphConnection relations between the represented nodes; (2) Based on a binarization matrix->Connection relations between the represented nodes; wherein (1)>Representing node->And node->The corresponding IPC4 belongs to the same IPC part,/->Representing node->And node->Corresponding IPC4 does notBelongs to the same IPC part; (3) Based on a binarization matrix->Connection relations between the represented nodes; wherein,representing node->And node->The corresponding IPC4 belongs to the same IPC subclass,/->Representing node->And node->The corresponding IPC4 does not belong to the same IPC subclass.

In some embodiments, in the method (1), the feature vectors of the other nodes connected to the target node are aggregated to obtain an aggregate feature matrix expression of the node, where the aggregate feature matrix expression is:

；

wherein the matrixIs a unitary matrix->For the sum of adjacent matrix and identity matrix of the patent classification number associated knowledge graph, ++>Diagonal matrix consisting of degrees of nodes +.>，/>Is the sum of the diagonal matrix and the identity matrix composed of the degrees of the nodes, < > >For normalized adjacency matrix->Feature matrix output by the feature extraction layer, +.>Is a size of +.>Is a learnable parameter matrix of->Aggregating feature matrices for nodes convolved with a graph, < >>An activation function for selecting a ReLU or a Sigmoid;

in the method (2), the feature vectors of the other nodes connected with the target node are aggregated to obtain an aggregate feature matrix expression of the node, which is:

；

wherein ,，/>，/>is a matrix->Calculated node degree groupDiagonal matrix, i.e.)>，/>Is +.>Is a learning parameter matrix of the computer;

in the method (3), the feature vectors of the other nodes connected with the target node are aggregated to obtain an aggregate feature matrix expression of the node, which is:

； wherein ,/>，/>，Is a matrix->Diagonal matrix of calculated node degree, i.e，/>Is +.>Is provided for the learning of the parameter matrix.

In some embodiments, in the gallery stack, the three aggregated feature matrices，/> and />Based on an input dimension of 3, the output dimension is +.>According to the values of corresponding positions in the three aggregation feature matrices +.>（) For every pair of nodes->Calculating a value +.>The expression is:

；

wherein the output of the picture scroll lamination layer is a size of Matrix of->。

In some embodiments, in step S2, the output layer calculates a score for each pair of nodes by accumulating the feature matrix output by the feature extraction layer and the aggregate feature matrix output by the graph roll layer, the score representing the likelihood that a link exists between the pair of nodes; preferably by outputting the feature matrix of the feature extraction layerAnd the aggregate feature matrix of the output of the picture volume lamination +.>Adding, normalizing by a Sigmoid layer to obtain final output with the size ofScore matrix +.>。

In some embodiments, in step S3, the patent classification number-associated knowledge graph constructed in step S1 is used as an initial graph, and a training graph is obtained by removing part of continuous edges in the initial graph; taking the training map as input and the initial map as output, training the graph convolution neural network in the step S2, wherein the method specifically comprises the following steps:

s3-1, taking the patent classification number association knowledge graph constructed in the step S1 as an initial graph, wherein ,/>For a set of all N nodes, +.>Is a set formed by all the continuous edges;

s3-2, random removalPart of the training patent is connected with the edge to obtain the training patent co-classification atlas +. >, wherein ,/>Is made of->Is a set of all nodes of->Is made of->A set of all connected edges of (a); will->The adjacency matrix is marked->；

S3-3, in order toAs input to the graphic neural network, in +.>Adjacency matrix of->As a training label of the graph neural network, training the graph neural network by adopting a random gradient descent method, wherein a trained loss function is a binary cross entropy loss function, and the expression is as follows: wherein ,for the output of the graph neural network, +.>To balance the weights of the positive and negative sample numbers.

In some embodiments, in step S4, the link prediction is performed based on the graph convolution neural network trained in step S3, and specifically includes: inputting the patent co-classification map constructed in the step S1 into the graph convolution neural network trained in the step S3 to obtain a score matrix; and (3) according to the scores, carrying out descending arrangement on each pair of nodes without the observed continuous edges in the patent classification number association knowledge graph constructed in the step (S1), and selecting the former L pairs of nodes as L links most likely to occur.

The invention also provides a link prediction system of the patent classification number association knowledge graph, which comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the link prediction method is realized when the processor executes the computer program.

Compared with the prior art, the invention has the beneficial effects that:

the invention adaptively generates the feature vector of the node by establishing a graph roll neural network comprising a feature extraction layer, a graph roll layer and an output layer, wherein the feature extraction layer is used for generating the feature vector of the node by various transformations of an adjacency matrix based on a patent classification number-associated knowledge graph; the graph roll stacking utilizes the prior community structure in the patent classification number association knowledge graph, so that the isolated nodes can transfer and aggregate the characteristics among other nodes belonging to the same community through the graph roll stacking, and the accuracy of the link prediction of the patent classification number association knowledge graph can be remarkably improved on the premise of not depending on patent text data.

Other advantages of embodiments of the present invention are further described below.

Drawings

FIG. 1 is a flowchart of a link prediction method of a special class number associated knowledge graph in an embodiment of the invention;

FIG. 2 is a schematic diagram of the neural network structure of the embodiment of the present invention;

FIG. 3 is a schematic diagram showing a comparison of performance of a link prediction method of a patent class number correlation knowledge graph according to an embodiment of the present invention when a search range of the link prediction method is 10 years with other 12 common link prediction methods;

FIG. 4 is a schematic diagram showing a comparison of the performance of the link prediction method of the patent class number correlation knowledge graph according to the embodiment of the present application with the search range of the other 12 common link prediction methods for 15 years;

FIG. 5 is a schematic diagram showing a comparison of performance of a link prediction method of a patent class number correlation knowledge graph according to an embodiment of the present application with other 12 common link prediction methods in a search range of 20 years;

fig. 6 is a schematic diagram showing a comparison of performance of a link prediction method of a patent class number association knowledge graph according to an embodiment of the present application when the search range is 25 years with other 12 common link prediction methods.

Detailed Description

The application will be further described with reference to the following drawings in conjunction with the preferred embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

It should be noted that, in this embodiment, the terms of left, right, upper, lower, top, bottom, etc. are merely relative terms, or refer to the normal use state of the product, and should not be considered as limiting.

The embodiment of the application provides a link prediction method of a patent class number association knowledge graph based on a graph neural network, which is shown in fig. 1 and comprises the following steps:

S1, constructing a patent classification number association knowledge graph;

s2, taking an adjacent matrix of the patent classification number correlation knowledge graph constructed in the step S1 as input, and taking a scoring matrix with the same size as the adjacent matrix as output, and establishing a graph convolution neural network comprising a feature extraction layer, a graph convolution layer and an output layer, wherein the graph convolution neural network at least comprises a feature extraction layer, a graph convolution layer and an output layer;

s3, taking the patent classification number associated knowledge graph constructed in the step S1 as an initial graph, and removing part of continuous edges in the initial graph to obtain a training graph; training the graph convolution neural network of the step S2 by taking the training graph as input and the initial graph as output;

The method solves the technical problems that the existing method for carrying out link prediction on the patent classification number associated knowledge graph is over-high in cost or low in accuracy due to the fact that the existing method depends on massive patent text data. The link prediction method of the patent classification number association knowledge graph provided by the embodiment of the invention can obviously improve the accuracy of the link prediction of the patent classification number association knowledge graph on the premise of not depending on the patent text data.

Specifically, the link prediction method of the patent class number association knowledge graph provided by the embodiment of the invention comprises a novel graph neural network structure, as shown in fig. 2, which comprises a feature extraction layer, a graph convolution layer and an output layer. The feature extraction layer adaptively generates feature vectors of nodes based on various transformations of adjacent matrixes of the patent classification number associated knowledge patterns, and solves the problem that the patent classification number associated knowledge patterns lack node features on the premise of not depending on patent text data. The graph convolution layer utilizes the prior community structure existing in the patent classification number association knowledge graph, so that the isolated nodes can transfer and aggregate the characteristics among other nodes belonging to the same community through graph convolution, and the accuracy of link prediction can be remarkably improved on the premise of not depending on patent text data.

The link prediction method of the patent classification number association knowledge graph provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:

step S1: and constructing a patent classification number association knowledge graph. The specific operation is as follows:

for a given target industry or target technical topic, all relevant patents are searched, and a patent classification number association knowledge graph is constructed according to the international patent classification (International Patent Classification, IPC) code of each patent. In the patent classification number association knowledge graph, each node is the first four codes (called IPC 4) of the IPC classification number representing the IPC subclass (subs), the border between the nodes represents the number of times that the IPCs 4 corresponding to the two nodes appear in a patent at least simultaneously, and the weight of the border represents the number of times that the IPCs 4 corresponding to the two nodes appear in the patent simultaneously.

Step S2: and establishing a graph convolution neural network. Specifically, the input of the graph neural network is the adjacency matrix of the patent class number association knowledge graph constructed in the step S1, and the adjacency matrix is output as a scoring matrix with the same size as the adjacency matrix, wherein each score represents the possibility of establishing connection between corresponding nodes. The graph neural network at least comprises a feature extraction layer, a graph roll layer and an output layer.

The feature extraction layer firstly carries out a series of transformations on adjacent matrixes of the patent classification number associated knowledge graph to obtain a plurality of transformation matrixes, and then carries out feature extraction on all the transformation matrixes to obtain a feature matrix of the node. Each row in the matrix is a feature vector of a node.

The adjacency matrix expression of the patent classification number associated knowledge graph is as follows:； wherein />Is of a size ofMatrix of->Representing node->And node->There is a weight of +.>Link of->Representing nodesAnd node->No link is observed, +.>The number of nodes in the knowledge graph is related to the patent classification number constructed in the step S1;

a series of transformations of the adjacency matrix of the patent class number-associated knowledge-graph are as follows:

(1) Two binarization matrices and />, wherein ,/>Representing node->And node->The corresponding IPC4 (IPC subclass) belongs to the same IPC part (section), +.>Representing node->And node->The corresponding IPC4 (IPC subclass) does not belong to the same IPC part, is->Representing node->And node->The corresponding IPC4 (IPC subclass) belongs to the same IPC subclass (class), and (I)>Representing node->And node->Corresponding IPC4 (I)PC subclass) do not belong to the same IPC subclass. (2) Product matrix of degree vectors->, wherein ,/>Is +.>Vector of->Is node->Degree (degree) of->Transposition of the representative vector. (3) 2 to->Power of: />, wherein />。

The feature extraction of the entire transformation matrix is based on an input dimension ofThe output dimension is +.>Multi-Layer Perceptron (MLP) of (1), wherein +.>The number of transformations selected in step S2. The MLP is according to->The transformation matrix->Numerical value of corresponding position +.>For every pair of nodes->Calculating a characteristic value +.>I.e. +.>, wherein ,/>For an input dimension of +.>The output dimension is +.>The output of the feature extraction Layer is a Multi-Layer Perceptron (MLP) with a size of +. >Matrix of->。

The graph convolution layer aggregates the feature vectors of other nodes connected with the target node, specifically, the graph convolution layer firstly takes each node in the patent classification number associated knowledge graph as the target node, aggregates the feature vectors of other nodes connected with the target node according to the connection relation between the nodes to obtain an aggregate feature vector of the target node, and then splices the aggregate feature vectors of all the nodes to obtain an aggregate feature matrix of the nodes.

The feature vector aggregation of other nodes connected with the target node comprises the following three methods:

method (1) adjacency matrix based on patent co-classification mapConnection relations between the represented nodes;

the aggregation feature matrix expression of the node feature vector is carried out by the graph convolution operation, and the aggregation feature matrix expression is as follows:wherein matrix->Is an identity matrix>Is the sum of the adjacent matrix and the identity matrix of the patent co-classified atlas,/->Is a diagonal matrix consisting of the degrees of the nodes +.>，Is the sum of the diagonal matrix and the identity matrix consisting of the degrees of the nodes, < >>Is a normalized adjacency matrix,>is the matrix of the feature extraction layer outputs (i.e., the feature matrix of the nodes prior to the graph convolution operation) >Is a size of +.>Is a learnable parameter matrix of->Is node aggregation feature matrix of the graph convolution of step 2.1, < + >>Is an activation function, and can be selected from ReLU or Sigmoid, etc.

Method (2) is based on a binarization matrixConnection relation between represented nodes, wherein ∈>Representing node->And node->The corresponding IPC4 (IPC subclass) belongs to the same IPC part (section), +.>Representing node->And node->The corresponding IPC4 (IPC subclass) does not belong to the same IPC part; the aggregation feature matrix expression of the node feature vector is carried out by the graph convolution operation, and the aggregation feature matrix expression is as follows: />，

wherein ，/>，/>Is composed of matrix->Diagonal matrix of calculated node degree, i.e. +.>，/>Is +.>Is provided with a matrix of parameters that can be learned,is a node aggregation feature matrix of graph convolution by the method (2).

Method (3) is based on a binarization matrixConnection relation between represented nodes (++>Representing node->And node->The corresponding IPC4 (IPC subclass) belongs to the same IPC subclass (class), and (I)>Representing node->And node->The corresponding IPC4 (IPC subclass) does not belong to the same IPC subclass;

the aggregation feature matrix expression of the node feature vector is carried out by the graph convolution operation, and the aggregation feature matrix expression is as follows: , wherein />，/>，/>Is composed of matrix->Diagonal matrix of calculated node degree, i.e. +.>，/>Is +.>Is provided for the learning of the parameter matrix.

，/> and />The relation of (2) is: they are all the results of the graph convolution. />，/> and />Is distinguished in that the graph convolution operations they undergo are different, both in the graph (i.e. the connection relationship matrix between nodes，/> and />) Is also embodied in the convolution kernel (i.e. parameter matrix +.>，/> and />) Is different from the above.

In the graph roll stacking, feature matrices are aggregated for three groups，/> and />Based on an input dimension of 3, the output dimension is +.>According to the values of corresponding positions in 3 aggregation feature matrices +.>（/>) For each pair of nodesCalculating a value +.>I.e. +.>Wherein the output of the picture scroll laminate is a size +.>Matrix of->。

The output layer is used for calculating a score for each pair of nodes based on the feature matrix output by the feature extraction layer and the aggregate feature matrix output by the picture scroll layer, wherein the score represents the possibility of links between the pair of nodes; preferably by outputting a matrix of feature extraction layersAnd matrix of picture scroll laminate output +.>Adding, normalizing by a Sigmoid layer to obtain final output with the size of +. >Score matrix +.>。

Step S3: training the graph roll-up neural network of the step S2. Specifically, the patent classification number association knowledge graph constructed in the step S1 is firstly used as an initial graph, and then part of continuous edges in the initial graph are removed to obtain a training graph. And training the graph convolution neural network of the step S2 by taking the training graph as input and the initial graph as output. The method specifically comprises the following steps:

s3-1, taking the patent classification number association knowledge graph constructed in the step S1 as an initial graph, wherein />Is a set of all N nodes, +.>Is a set formed by all the continuous edges;

s3-2, random removalPart of the training patent is connected with the edge to obtain the training patent co-classification atlas +.>, wherein />Is composed of->Is a set of all nodes of->Is composed of->Is a set of all the edges. Will beThe adjacency matrix is marked->；

S3-3, in order toAs input to the graphic neural network, in +.>Adjacency matrix of->As a training label of the graph neural network, the graph neural network is trained by adopting a random gradient descent method, and a trained loss function is a binary cross entropy (binary cross entropy, BCE) loss function: the expression is as follows:； wherein , is the output of the graph neural network, +. >Is the weight that balances the number of positive and negative samples.

Step S4: and (3) carrying out link prediction based on the graph convolution neural network obtained through training in the step (S3). Inputting the patent classification number correlation knowledge graph constructed in the step S1 into the graph convolution neural network trained in the step S3 to obtain a scoring matrix. And (3) according to the scores, carrying out descending arrangement on each pair of nodes without the observed continuous edges in the patent classification number association knowledge graph constructed in the step (S1), and selecting the former L pairs of nodes as L links most likely to occur.

A link prediction system of patent class number association knowledge graph comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor realizes the link prediction method when executing the computer program.

Examples: in the present embodiment, the initial year of the target industry (i.e., the application year of the earliest related patent) is recorded asLet it be +.>Annual construction of patent class number association knowledge graph and link prediction (wherein +.>The representation is: +.>Annual prediction) with the goal of predicting thereafter +.>New links that occur in the years (hereI.e. "prediction deadline (forecasting horizon)").

The present embodiment selects a 3D printing industry as a target industry. 3D printing is a manufacturing method for creating three-dimensional objects by stacking materials layer by layer, involves various fields of material science, mechanical engineering, computer science, design and manufacturing, and has been widely used in medical, aerospace, automobile manufacturing, construction, consumer goods and other industries. According to the Wohlers report (3D printing industry report), the market size of the global 3D printing industry in 2020 reaches 127.58 hundred million dollars, 7.5% increase compared to 2019.

In this embodiment, step S1 specifically includes: constructing a patent search formula aiming at target industry 3D printing: tab= ((3D OR 3-D OR (3 ADJ dimension) OR (thread ADJ2 dimension) OR additive) NEAR (print OR fabric) OR manuquif) the following claims. The relevant patents are searched in a Derwent Innovation database, the relevant patents of the 3D printing industry are acquired, and the patent application year, the publication number and the IPC classification number of each patent are derived. For the 3D printing industry, the starting year. In this embodiment, the entire 4-position IPC code (IPC 4) is used as a node to construct an initial patent co-classification map +. >, wherein />Is a set of all N nodes, +.>Is a set of all contiguous edges. Will->The adjacency matrix is marked->, wherein />Is a size of +.>Matrix of->For the number of nodes in the patent classification number correlation knowledge graph, < ->Representative node->And node->Two corresponding IPCs 4 were filed in the year from +.>To->The number of occurrences in all patents. In particular, the +>Representative node->And node->The corresponding two IPCs 4 never occur simultaneously in the year of application from +.>To the point ofIn the patent of (2).

In step S2, a graph neural network as shown in fig. 1 is specifically constructed. The graph neural network includes a feature extraction layer, a graph roll-up layer, and an output layer. The neural networkThe input of the complex is the adjacency matrix of the patent co-classification mapThe output is a scoring matrix of the same size as the adjacent matrix>Wherein each score->Represents node->And node->The possibility of establishing a connection between them.

In the feature extraction layer, the following transformations are first performed:

(1) Two binarization matrices and />, wherein ,/>Representing node->And node->The corresponding IPC4 (IPC subclass) belongs to the same IPC part (section), +.>Representing node->And node->Corresponding IPC4 (IPC Subclass) do not belong to the same IPC department, < ->Representing node->And node->The corresponding IPC4 (IPC subclass) belongs to the same IPC subclass (class), and (I)>Representing node->And node->The corresponding IPC4 (IPC subclass) does not belong to the same IPC subclass.

(2) Product matrix of degree vectors, wherein />Is +.>Is used for the vector of (a),is node->Degree (degree) of->Transposition of the representative vector.

(3) 2 to 2 of the adjacency matrixPower of: />, wherein />。

The above-mentioned commonalitiesThe number of the first transformation is 2, the second transformation is 1, and the third transformation is +.>Each transformation results in a size +.>Is>Thus obtainingPersonal matrix->。

Then, based on one input dimension, isThe output dimension is +.>Multi-Layer Perceptron (MLP) performs feature extraction on the entire transform matrix. The MLP is according to->Individual transformation matricesNumerical value of corresponding position +.>For every pair of nodes->Calculating a characteristic value +.>I.e.，

The output of the feature extraction layer is of a size ofMatrix of->。

In the graph roll stacking, the following three methods are adopted to aggregate node feature vectors:

(1) Adjacency matrix based on patent co-classification atlas And (3) carrying out aggregation of node characteristic vectors according to the connection relation among the represented nodes and the graph rolling operation of the following formula: />, wherein ，/>，/>Is a diagonal matrix composed of the degrees of the nodes) Matrix->Is an identity matrix>Is feature extractionMatrix of layer outputs>Is +.>Is provided for the learning of the parameter matrix.

(2) Based on binarization matrixConnection relation between represented nodes (wherein +.>Representing node->And node->The corresponding IPC4 (IPC subclass) belongs to the same IPC part (section), +.>Representing node->And node->The corresponding IPC4 (IPC subclass) does not belong to the same IPC part), and node feature vectors are aggregated according to the graph convolution operation of the following formula: />，

wherein ，/>，/>Is composed of matrix->Diagonal matrix of calculated node degree, i.e. +.>，/>Is +.>Is provided for the learning of the parameter matrix.

(3) Based on binarization matrixConnection relation between represented nodes (++>Representing nodesAnd node->The corresponding IPC4 (IPC subclass) belongs to the same IPC subclass (class), and (I)>Representing node->Sum nodeThe corresponding IPC4 (IPC subclass) does not belong to the same IPC subclass), and node feature vectors are aggregated according to the graph convolution operation of the following formula: / >，

In the graph roll stacking, feature matrices are aggregated for three groups，/> and />Based on an input dimension of 3, the output dimension is +.>According to the values of corresponding positions in 3 aggregation feature matrices +.>（/>) For each pair of nodesCalculating a value +.>I.e. +.>The output of the picture scroll laminate is a size +.>Is a matrix of (a)

。

In the output layer, a matrix for outputting the feature extraction layerAnd matrix of picture scroll laminate output +.>Adding, normalizing by a Sigmoid layer to obtain final output with the size of +.>Score matrix +.>。

The step S3 specifically comprises the following steps: random removalPart of the training patents are connected with each other to obtain training patent co-classification atlas, wherein />Is made of->Is a set of all nodes of->Is made of->Is a set of all the edges. Will->The adjacency matrix is marked->. To->As input to the graphic neural network, in +.>Adjacency matrix of->As a training label of the graph neural network, the graph neural network is trained by adopting a random gradient descent method, and a trained loss function is a binary cross entropy (binary cross entropy, BCE) loss function: / > wherein ,is the output of the graph neural network, +.>Is the weight that balances the number of positive and negative samples.

The step S4 specifically comprises the following steps: and (3) carrying out technical opportunity recognition based on the graph convolution neural network obtained through training in the step S3. Inputting the patent co-classification map constructed in the step S1 into the graph convolution neural network trained in the step S3 to obtain a scoring matrix. According to the score, every pair of nodes with edges which are not observed in the patent co-classification map constructed for the step S1 are arranged in a descending order, and the front L pairs of nodes are selected as the identified technical field with the technical fusion opportunity. To be used for，/>L=5 for example, i.e. falseBased on patent data from 1986 to 2005, the possible technical association relationship in the next 5 years is predicted, and the 5 node pairs with the highest score are output as the identified technical field with the technical fusion opportunity in 2006. Inputting a patent co-classification map constructed based on patent data from 1986 to 2005 into a trained graph neural network, and obtaining 5 node pairs with highest scores as follows:

C09D (coating composition) -H01L (semiconductor device)

B01J (chemical or physical method) -C08L (composition of high molecular compound)

A61K (formulation for medicine, dentistry or dressing) -H01L (semiconductor device)

B29C (shaping of plastics or materials in the plastic state) -C07C (acyclic or carbocyclic compounds)

C07C (acyclic or carbocyclic compound) -H01L (semiconductor device)

The embodiment takes the 5 node pairs as 5 links most likely to occur in the 3D printing patent class number association knowledge graph from 2006 to 2011. ( Among them, a61K (pharmaceutical, dental or cosmetic formulation) -H01L (semiconductor device) prediction errors do not co-occur in 3D printing patents from 2006 to 2011. The other four links are predicted correctly. )

According to the result of the link prediction, enterprise management personnel and research personnel in the 3D printing industry can find potential technical opportunities by combining own expertise and extensive patent file retrieval. For example, the combination of two technologies, C09D (coating composition) -H01L (semiconductor device), in the 3D printing industry may represent a technical opportunity to manufacture electronic components using 3D printing and coating compositions; the combination of two technologies, B01J (chemical or physical method) -C08L (composition of polymer compound), in the 3D printing industry may mean technical opportunities for the preparation method of polymer compound for 3D printing, etc.

In addition, the result of the link prediction also prompts the technical field needing to pay attention to for enterprise managers and developers in the 3D printing industry. For example, H01L (semiconductor device) appears most frequently in the results of link prediction, which may mean that the manufacture of semiconductor devices is an important development direction in the future of 3D printing technology. In addition, C07C (acyclic or carbocyclic compound), C08L (polymeric compound composition), and C09D (coating composition) may be three 3D printing materials that require significant attention.

Finally, the link prediction method provided by the embodiment of the invention is compared with other 12 common link prediction methods, so that the beneficial effects of the link prediction method of the patent class number association knowledge graph provided by the embodiment of the invention are illustrated.

The 12 link prediction methods used as the comparison reference comprise 9 link prediction indexes based on topological structure similarity and 3 link prediction models based on a graph neural network. The 9 link prediction indexes based on the topological structure similarity are respectively as follows: weighted co-neighbors (Weighted Common Neighbor, WCN), weighted Adamic Adar (Weighted Adamic Adar, WAA), weighted resource allocation (Weighted Resource Allocation, WRA), reliable path weighted co-neighbors (useable-route Weighted Common Neighbor, rWCN), reliable path weighted co-neighbors (useable-route Weighted Adamic Adar, rWAA), reliable path weighted resource allocation (useable-route Weighted Resource Allocation, WRA), jaccard (JC) metrics, link prediction oriented cluster coefficients (Clustering Coefficient for Link Prediction, CCLP), and linear optimization (Linear Optimization, LO) metrics. The definition of 9 link predictors based on topological structure similarity is shown in table 1, wherein Representative node->Is a set of all neighbor nodes +.>Representative node->Is used for the clustering coefficient of (a).

Nine link prediction indexes based on topological structure similarityAs shown in Table 1, in the tableRepresentative node->Is a set of all neighbor nodes. />Representative node->Is used for the clustering coefficient of (a). />，/> and />Respectively represent node->And node->Node->And node->Node->And node->The weight of the border between the two nodes, the weight of 0, represents that no border exists between the two nodes. />Representing node->Degree of (i.e.)>。/>Is a manually set parameter, in this embodiment, set。/>

The three link prediction models based on the graph neural network include a variational graph self-Encoder (Variational Graph Auto-Encoder, VGAE), an inverse regularization graph automatic Encoder (Adversarially Regularized Graph Autoencoder, ARGA) and a generated link prediction graph neural network graphLP, and the references of the three link prediction models based on the graph neural network are as follows:

VGAE：T. N. Kipf, M. Welling, Variational graph auto-encoders, arXiv preprint arXiv:1611.07308.

ARGA：S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, C. Zhang, Adversarially regularized graph autoencoder for graph embedding, arXiv preprint arXiv:1802.04407.

GraphLP：X. Xian, T. Wu, X. Ma, S. Qiao, Y. Shao, C. Wang, L. Yuan, Y. Wu, Generative graph neural networks for link prediction, arXiv preprint arXiv:2301.00169.

in the comparative experiment, the prediction period was set toThis may cover short, medium and long term predictions. At the same time set->Will->Is set at the maximum value of (2)Set to 25 to ensure that the following formula holds:

this is because patent data in 2022 may be incomplete since there is a delay of about 1.5 years from application to disclosure.

In this embodiment, the area under the receiver operation characteristic curve (Area Under the receiver operating characteristic Curve, AUC) is used as an accuracy index of the predicted performance, and the calculation formula of AUC is as follows:

wherein Is from->Age to->A set of links newly formed in the year, +.>Is from->Years toThere is no collection of node pairs forming links for the year.

The prediction performance comparison of the method proposed by the embodiment of the invention and other 12 common link prediction methods is shown in tables 2-5. In tables 2-5, each row corresponds to a different link prediction algorithm, and each column represents a set of parameters and />Is set up in a set of experimentsExperimental results. The best predictive performance in each column is +.>A "tag; while the second best predictive performance is with +.>"marking". It can be seen that the method proposed by the embodiment of the present invention achieves performance superior to the other 12 commonly used link prediction methods.

As shown in fig. 3 and table 2, the method proposed in the embodiment of the present invention and other 12 common link prediction methods are shown inComparison of predicted performance at that time. />

As shown in FIG. 4 and Table 3, the method proposed by the embodiment of the present invention is +.>Comparison of predicted performance at that time.

As shown in FIG. 5 and Table 4, the method proposed by the embodiment of the present invention is +.>Comparison of predicted performance at that time. />

As shown in FIG. 6 and Table 5, the method proposed by the embodiment of the present invention is +.>Comparison of predicted performance at that time.

Fig. 3-6 provide more detailed comparisons between the proposed method of the present invention and the other 12 common link prediction methods. Each of figures 3-6 corresponds to +.>A value (10/15/20/25), x-axis represents +.>The value range is an integer from 1 to 10, and the y-axis represents the accuracy index AUC of the link prediction. Each curve in the graph corresponds to a particular link prediction method.

The AUC obtained by the method provided by the embodiment of the invention is higher than that of other 12 common link prediction methods in most cases, which proves that the method provided by the embodiment of the invention can obviously improve the accuracy of the link prediction of the patent classification number association knowledge graph on the premise of not depending on patent text data.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several equivalent substitutions and obvious modifications can be made without departing from the spirit of the invention, and the same should be considered to be within the scope of the invention.

Claims

1. A link prediction method of a patent classification number association knowledge graph is characterized by comprising the following steps:

s1, constructing a patent classification number association knowledge graph;

2. The link prediction method of patent classification number association knowledge graph according to claim 1, wherein in step S1, the construction of the patent classification number association knowledge graph specifically includes: for a given target industry or target technical theme, searching all related patents, and constructing a patent classification number association knowledge graph according to the International Patent Classification (IPC) code of each patent; each node in the patent classification number association knowledge graph represents the first four codes IPC4 of the IPC classification number of the IPC subclass, the connecting edge between the nodes represents that the IPC4 corresponding to two nodes at least appear in one patent at the same time, and the number of times that the IPC4 corresponding to the two nodes appear in the patent at the same time represents the weight of the connecting edge.

3. The link prediction method of patent classification number association knowledge graph according to claim 1, wherein in step S2, the feature extraction layer performs a series of transformations on adjacent matrixes of the patent classification number association knowledge graph to obtain a plurality of transformation matrixes, and performs feature extraction on all the transformation matrixes to obtain a feature matrix of a node; wherein each row in the feature matrix represents a feature vector of a node.

4. The method for predicting a link of a patent classification number associated knowledge graph according to claim 3, wherein in the performing a series of transformations on an adjacency matrix of the patent classification number associated knowledge graph, the adjacency matrix expression is； wherein ,/>Is a size of +.>Matrix of->Representing node->And node->There is a weight of +.>Link of->Representing node->And node->No link is observed, +.>The number of nodes in the knowledge graph is related to the patent classification number constructed in the step S1;

said performing a series of transformations at least comprises the following transformations: (1) Two binarization matricesAnd, wherein ,/>Representing node->And node->The corresponding IPC4 belongs to the same IPC part,/- >Representing node->And node->The corresponding IPC4 does not belong to the same IPC part,/->Representing node->And node->The corresponding IPC4 belongs to the same IPC subclass,/->Representing node->And node->The corresponding IPC4 does not belong to the same IPC subclass; (2) Product matrix of degree vectors->, wherein ,/>Is +.>Vector of->For node->Degree of (1)/(2)>Transposition of the representative vector; (3) 2 to +.>Power of: />, wherein />；

The feature extraction mode of all transformation matrixes is based on one input dimension asThe output dimension is +.>Multi-layer perceptron MLP of (2), wherein>The number of the transformations selected in the step S2; the MLP is according to->Individual transformation matricesNumber of corresponding positions in (a)Value->For every pair of nodes->Calculating a characteristic value +.>I.e.

；

5. The link prediction method of patent classification number association knowledge graph according to claim 3 or 4, wherein the obtaining a plurality of transformation matrices is specifically obtainingA plurality of transformation matrices; wherein the first transformation has 2, the second transformation has 1, and the third transformation has +. >Each transformation results in a size +.>Is>Finally get->The transformation matrix->。

6. The link prediction method of patent classification number association knowledge graph according to claim 1, wherein in step S2, the graph stacking layer uses each node in the patent classification number association knowledge graph as a target node, aggregates feature vectors of other nodes connected with the target node according to a connection relationship between the nodes to obtain an aggregate feature vector of the target node, and splices the aggregate feature vectors of all nodes to obtain an aggregate feature matrix of the nodes.

7. The link prediction method of patent classification number association knowledge graph according to claim 4 or 6, wherein the aggregation of feature vectors of other nodes connected to the target node comprises the following three methods: (1) Adjacency matrix based on patent classification number association knowledge graphConnection relations between the represented nodes; (2) Based on binarization matrixConnection relations between the represented nodes; wherein (1)>Representing node->And node->The corresponding IPC4 belongs to the same IPC part,/->Representing node->And node- >The corresponding IPC4 does not belong to the same IPC part; (3) Based on a binarization matrix->Connection relations between the represented nodes; wherein (1)>Representing node->And node->The corresponding IPC4 belongs to the same IPC subclass,/->Representing node->And node->The corresponding IPC4 does not belong to the same IPC subclass.

8. The link prediction method of patent classification number association knowledge graph according to claim 7, wherein in the method (1), feature vectors of other nodes connected with the target node are aggregated to obtain an aggregate feature matrix expression of the node, which is:

；

wherein the matrixIs a unitary matrix->For the sum of adjacent matrix and identity matrix of the patent classification number associated knowledge graph, ++>Is a diagonal matrix composed of the degrees of nodes

，/>Is the sum of the diagonal matrix and the identity matrix composed of the degrees of the nodes, < >>For normalized adjacency matrix->Feature matrix output by the feature extraction layer, +.>Is a size of +.>Is capable of learningParameter matrix of>Aggregating feature matrices for nodes convolved with a graph, < >>An activation function for selecting a ReLU or a Sigmoid;

；

wherein ,，/>，/>is a matrix->Diagonal matrix of calculated node degree, i.e. +.>，/>Is +.>Is a learning parameter matrix of the computer;

in the method (3), the feature vectors of the other nodes connected with the target node are aggregated to obtain an aggregate feature matrix expression of the node, which is: wherein ,/>，/>，Is a matrix->Diagonal matrix of calculated node degree, i.e

，/>Is +.>Is provided for the learning of the parameter matrix.

9. The method for predicting a link to a patent class number associated knowledge graph according to claim 8, wherein three of said aggregated feature matrices are in said graph roll stacking layer，/> and />Based on an input dimension of 3, the output dimension is +.>According to three polymerizationValues of corresponding positions in the feature matrix +.>（/>) For every pair of nodes->Calculating a value +.>The expression is: />；

Wherein the output of the picture scroll lamination layer is a size ofMatrix of->。

10. The method for predicting a link of a patent classification number-associated knowledge graph according to claim 1, wherein in step S2, the output layer calculates a score for each pair of nodes by accumulating the feature matrix output by the layer based on the feature matrix output by the feature extraction layer and the aggregate feature matrix output by the graph, the score representing a likelihood that a link exists between the pair of nodes; preferably by outputting the feature matrix of the feature extraction layer And the aggregate feature matrix of the output of the picture volume lamination +.>Adding, normalizing by a Sigmoid layer to obtain final output with the size of +.>Score matrix +.>。

11. The link prediction method of patent classification number association knowledge graph according to claim 1, wherein in step S3, the patent classification number association knowledge graph constructed in step S1 is used as an initial graph, and part of continuous edges in the initial graph are removed to obtain a training graph; taking the training map as input and the initial map as output, training the graph convolution neural network in the step S2, wherein the method specifically comprises the following steps:

s3-1, taking the patent classification number association knowledge graph constructed in the step S1 as an initial graph, wherein ,for a set of all N nodes, +.>Is a set formed by all the continuous edges;

s3-2, random removalPart of the training patent is connected with the edge to obtain the training patent co-classification atlas +.>, wherein ,is made of->Is a set of all nodes of->Is made of->A set of all connected edges of (a); will->The adjacency matrix is marked->；

S3-3, in order toAs input to the graphic neural network, in +.>Adjacency matrix of->As a training label of the graph neural network, training the graph neural network by adopting a random gradient descent method, wherein a trained loss function is a binary cross entropy loss function, and the expression is as follows: ；

wherein ,for the output of the graph neural network, +.>To balance the weights of the positive and negative sample numbers.

12. The method for predicting links of patent classification number association knowledge graph according to claim 1, wherein in step S4, the link prediction is performed based on the graph convolution neural network trained in step S3, specifically comprising: inputting the patent co-classification map constructed in the step S1 into the graph convolution neural network trained in the step S3 to obtain a score matrix; and (3) according to the scores, carrying out descending arrangement on each pair of nodes without the observed continuous edges in the patent classification number association knowledge graph constructed in the step (S1), and selecting the former L pairs of nodes as L links most likely to occur.

13. A link prediction system of patent class number associated knowledge graph, characterized by comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the link prediction method according to any one of claims 1 to 12 when executing the computer program.