CN114781553A - Unsupervised patent clustering method based on parallel multi-graph convolution neural network - Google Patents

Unsupervised patent clustering method based on parallel multi-graph convolution neural network Download PDF

Info

Publication number
CN114781553A
CN114781553A CN202210695144.8A CN202210695144A CN114781553A CN 114781553 A CN114781553 A CN 114781553A CN 202210695144 A CN202210695144 A CN 202210695144A CN 114781553 A CN114781553 A CN 114781553A
Authority
CN
China
Prior art keywords
graph
attention
vector
patent data
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210695144.8A
Other languages
Chinese (zh)
Other versions
CN114781553B (en
Inventor
韩蒙
梁兵
况欢
陈灏毅
陈唯
林昶廷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Binjiang Research Institute Of Zhejiang University
Original Assignee
Binjiang Research Institute Of Zhejiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Binjiang Research Institute Of Zhejiang University filed Critical Binjiang Research Institute Of Zhejiang University
Priority to CN202210695144.8A priority Critical patent/CN114781553B/en
Publication of CN114781553A publication Critical patent/CN114781553A/en
Application granted granted Critical
Publication of CN114781553B publication Critical patent/CN114781553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an unsupervised patent clustering method based on a parallel multi-graph convolution neural network, which is characterized in that on the basis of constructing 4 types of patent graphs and coding vectors of a self-encoder for patent data, 4 types of patent graphs and coding vectors are fully extracted through graph convolution operation, effective feature vectors of the patent data are comprehensively extracted, weight is distributed to each type of feature vectors through a parallel single graph self-attention module, the importance degree of important features of a single graph is improved to obtain a single graph attention vector, the single graph attention vectors of all types are fused through a multi-graph attention module for learning, a larger weight is distributed to the important single graph, the obtained global attention vector integrates multi-aspect feature information, and the clustering precision is improved.

Description

Unsupervised patent clustering method based on parallel multi-graph convolution neural network
Technical Field
The invention belongs to the technical field of patent classification, and particularly relates to an unsupervised patent clustering method based on a parallel multi-graph convolution neural network.
Background
Through the analysis of the patent data, specific market development wind vane and organization innovation strength can be obtained. People often use information such as Patent names, keywords, and CPC (co-Patent Classification) codes to search patents on various intellectual property platforms. Among them, the CPC code is an extension of IPC (International Patent Classification), which is commonly managed by EPO (European Patent Office) and the us Patent and trademark Office. It is divided into nine parts, a-H and Y, which are in turn divided into classes, subclasses, groups and subgroups, with approximately 250000 classification entries. Whichever institution participates in processing and approving the patent will determine the type of classification code used for the invention. Once the patent application is approved, the CPC code cannot be changed any more. Therefore, it is extremely important for the patent applicant to prejudge the patent CPC code in advance.
At present, the classification of patent CPC codes mostly adopts a manual method to check patent names, abstracts and texts so as to match the corresponding patent CPC codes, which is very tedious for patent examiners and easy to make mistakes.
Some scholars study the NLP (Natural Language Processing) technology, and classify patents through a word embedding system and a machine learning classification model, so that the speed and accuracy of classifying patents are improved, and the labor cost is reduced.
The scholars also study deep learning methods for classifying patents, which may include Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and Graph Convolutional Neural Networks (GCNs). The graph convolution neural network introduces graph embedding to consider structural information of original patent samples, and uses convolution operation on the graph to effectively utilize important relations among nodes, so that the model achieves better cognition and patent classification capability. Moreover, label training samples in fine classification rarely lead to insufficient classification performance of supervised models and are not enough for realizing fine classification of CPC codes.
Patent document CN109446319A discloses a biomedical patent clustering analysis method based on K-means, which simultaneously selects 4 important evaluation indexes of patent application amount, patent authorization amount, patent growth rate and patent effective rate in patent analysis as clustering variables for clustering analysis, so as to deeply mine the association between data and better classify the patent data, but cannot classify the patent CPC codes.
Disclosure of Invention
In view of the above, the invention provides an unsupervised patent clustering method based on a parallel multi-graph convolution neural network, which improves the precision of a model for finely classifying patents and improves the accuracy of patent classification under unsupervised learning.
In order to achieve the above object, an embodiment of the present invention provides an unsupervised patent clustering method based on a parallel multi-graph convolutional neural network, including the following steps:
vectorizing the patent data to be clustered to obtain vectorized patent data;
constructing multiple types of patent diagrams according to vectorized patent data, wherein the multiple types of patent diagrams comprise a KNN patent diagram, a patent diagram of a common applicant, a patent diagram of a common inventor and a patent diagram of a common keyword, which are constructed based on the similarity of patents;
the patent data to be clustered are calculated by utilizing a model constructed based on unsupervised learning, and the method comprises the following steps: carrying out vector coding on each vectorization patent data by utilizing a coder contained in a self-coder to obtain a coding vector; extracting feature vectors of each type of patent drawings combined with the coding vectors in parallel by using each image convolutional neural network contained in the parallel image convolutional neural network module; calculating a single-drawing attention vector according to each type of feature vector in parallel by utilizing each single-drawing self-attention layer contained in the parallel single-drawing self-attention module; calculating a global attention vector of each patent datum according to all the class single-figure attention vectors by using a multi-figure attention module;
and clustering the global attention vectors of all the patent data to obtain a clustering result.
In one embodiment, each patent data includes an invention name, a summary, an applicant, and an inventor, and the data is vectorized to obtain vectorized patent data.
In one embodiment, when constructing multiple types of patent graphs, each patent is used as a node, vectorized patent data is used as a node attribute, and connecting edges between nodes are constructed differently according to different types of patent graphs, including:
for the KNN patent diagram, similarity calculation between any two patent data is carried out on all the patent data, and the patent data corresponding to k large similarities before the patent data are screened according to the similarity value to serve as neighborhood patent data to be used for constructing a connecting edge between nodes, namely the connecting edge is constructed between any two corresponding nodes of all the neighborhood patent data;
aiming at the patent drawings of the common applicant, constructing connecting edges among nodes corresponding to the common applicant;
aiming at the patent drawings of the common inventors, constructing connecting edges among nodes corresponding to the common inventors;
and aiming at the common keyword patent graph, constructing a connecting edge between nodes corresponding to the common keywords.
In one embodiment, the encoder comprises L encoding layers, and the input vectorization patent data is subjected to vector encoding of the plurality of encoding layers to obtain an encoding vector output by each layer;
each graph convolution neural network corresponding to each type of patent graph comprises L graph convolution layers, the number of the graph convolution layers is equal to that of the coding layers, each graph convolution layer firstly distributes weights to the coding vector output by the corresponding coding layer and the feature vector output by the last graph convolution layer, then the feature vector with the distributed weights is used as the input of the current graph convolution operation, the graph convolution operation is carried out by combining the adjacent matrix of each type of patent graph, and the feature vector is output and expressed by a formula:
Figure 938721DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,lindicated as an index of the number of network layer levels,van index indicating the kind of the patent drawing,
Figure 665369DEST_PATH_IMAGE002
representing weights for balancing the degree of importance of the code vector and the feature vector,
Figure 564055DEST_PATH_IMAGE003
is shown asl-coding vectors output by a 1-layer coding layer,
Figure 540101DEST_PATH_IMAGE004
and
Figure 284066DEST_PATH_IMAGE005
respectively representvCorresponding to patent-like drawingl-1 layer and the secondlThe feature vectors output by the layer map convolution operation,
Figure 396379DEST_PATH_IMAGE006
a feature vector representing the assigned weight is assigned,
Figure 465966DEST_PATH_IMAGE007
is shown asvCorresponding to patent-like drawinglThe weight of the layer map convolution operation,
Figure 663729DEST_PATH_IMAGE008
is shown asvAdjacency matrix of similar patent drawings
Figure 211385DEST_PATH_IMAGE009
And the sum of the identity matrix and the identity matrix,Dto represent
Figure 443783DEST_PATH_IMAGE010
The diagonal matrix of (1), ReLU () representing the ReLU activation function;
for the first layer of the graph convolution layer,
Figure 684272DEST_PATH_IMAGE011
and the node matrix X represents each type of patent graph.
In one embodiment, each single graph calculates a single graph attention vector from the attention layer in parallel according to each type of feature vector, and the method comprises the following steps: firstly, the attention weight of the feature is calculated according to each type of feature vector, and then the activation calculation is carried out on each type of feature vector according to the attention weight so as to obtain the single-image attention vector corresponding to each type of feature vector.
In one embodiment, calculating a global attention vector for each patent data from all class sketch attention vectors using a multi-sketch attention module comprises: firstly, carrying out nonlinear transformation on each type of single-image attention vector to obtain each type of multi-layer attention value; then, carrying out normalization processing on each type of multilayer attention value relative to all types of multilayer attention values to obtain a global attention weight of each type; and finally, carrying out weighted summation on the attention vectors of the single images of each type according to the global attention weight of each type to obtain the global attention vector of each patent data.
In one embodiment, the model requires parameter optimization before being applied, including:
decoding the coding vector output by the encoder by using a decoder contained in the self-encoder to obtain reconstructed patent data corresponding to each vectorized patent data;
constructing total loss, namely constructing reconstruction loss based on vectorization patent data input by a self-encoder and output reconstruction patent data, constructing multi-graph correlation loss based on attention vectors of all kinds of single graphs, and taking weighted summation of the reconstruction loss and the multi-graph correlation loss as the total loss;
and optimizing the model parameters by using the total loss and adopting an unsupervised learning mode to obtain a model with optimized parameters.
In one embodiment, the constructing of the reconstruction loss based on the vectorized patent data input from the encoder and the reconstructed patent data output from the encoder includes: and constructing reconstruction loss according to the squares of Euclidean norms between vectorized patent data and reconstructed patent data corresponding to all the patent data.
In one embodiment, constructing a multi-map correlation penalty based on all class single-map attention vectors includes: firstly, calculating the autocorrelation similarity of attention vectors of each type of single images; and then constructing the multi-graph correlation loss according to the square of the Euclidean norm between the autocorrelation similarities of the self-correlation of any two types of single-graph attention vectors.
In one embodiment, the unsupervised patent clustering method further comprises:
and performing CPC code classification on each patent data according to the clustering result, wherein the CPC code classification comprises the following steps: patent data belonging to the same cluster are considered to have the same CPC code, and when the CPC of one patent data in the cluster is judged manually, the CPC codes of all other patent data of the cluster can be obtained.
Compared with the prior art, the method has the beneficial effects that at least:
on the basis of constructing 4 types of patent drawings and coding vectors of patent data from a coder, 4 types of patent drawings and coding vectors are fully extracted through a drawing convolution operation, effective feature vectors of the patent data are comprehensively extracted, weights are distributed to each type of feature vectors through a parallel single-drawing self-attention module, the importance degree of important features of a single drawing is improved to obtain a single-drawing attention vector, the single-drawing attention vectors of all types are fused through a multi-drawing attention module for learning, and larger weights are distributed to the important single drawing, so that the obtained global attention vector integrates multi-aspect feature information, and the clustering precision is improved.
The model is constructed based on unsupervised learning, the generalization performance of the model to the deep clustering of the patent data is improved under the condition that the fine classification labels are true, the comprehensiveness of the model in feature extraction is improved, and the effectiveness of the patent data clustering is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of an unsupervised patent clustering method based on a parallel multi-graph convolutional neural network provided by an embodiment;
FIG. 2 is a schematic structural diagram of a model provided by the embodiment;
FIG. 3 is a schematic structural diagram of each of the convolutional layers provided in the embodiments;
FIG. 4 is a schematic structural diagram of each single-drawing self-attention layer provided by the embodiment;
fig. 5 is a schematic structural diagram of a multi-graph attention module according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The problem of insufficient classification performance of a supervised classification model caused by too few label training samples during fine classification of patents is solved, and the problem of inaccurate classification of patents caused by insufficient generalization performance of the classification model according to a unilateral patent drawing is also solved. The embodiment provides an unsupervised patent clustering method based on a parallel multi-graph convolution neural network, which improves the precision of a model for finely classifying patents and the accuracy of patent classification under unsupervised learning.
Fig. 1 is a flowchart of an unsupervised patent clustering method based on a parallel multi-graph convolutional neural network according to an embodiment. As shown in fig. 1, the unsupervised patent clustering method based on the parallel multi-graph convolutional neural network provided in the embodiment includes the following steps:
step 1, vectorizing the patent data to be clustered to obtain vectorized patent data.
In the embodiment, each piece of patent data to be clustered corresponds to one patent document, and includes the name, abstract, applicant and inventor of the patent, vectorization is performed on the data to obtain vectorized patent data, and the specific vectorized patent data is expressed in a form of a 1-dimensional vector group.
And 2, constructing multiple types of patent diagrams according to the vectorized patent data.
In an embodiment, the multi-class patent drawings comprise KNN (K-nearest-neighbor) patent drawings, patent drawings of the same applicant, patent drawings of the same inventor and patent drawings of the same keyword, which are constructed based on the similarity of patents. When constructing multiple types of patent graphs, each patent is used as a node, vectorized patent data is used as a node attribute, connecting edges between the nodes are different according to the types of the patent graphs, and the construction modes are also different, wherein the construction modes comprise:
according to the KNN patent graph, similarity calculation between any two patent data is carried out on all the patent data, the patent data corresponding to k large similarities before being screened is used as neighborhood patent data according to the similarity value, and a connecting edge between nodes is constructed, namely the connecting edge is constructed between any two nodes corresponding to all the neighborhood patent data, so that the KNN patent graph is formed.
In one embodiment, cosine similarity between any two patent data can be calculated, and patent data corresponding to k cosine similarities before the patent data are screened as neighborhood patent data according to the cosine similarity, so as to construct a connecting edge between nodes.
In the embodiment, for the patent drawings of the common applicant, connecting edges are constructed among nodes corresponding to the common applicant to form the patent drawings of the common applicant; aiming at the co-inventor patent drawings, constructing connecting edges among nodes corresponding to the co-inventors to form the co-inventor patent drawings; and aiming at the common keyword patent diagrams, constructing connecting edges among nodes corresponding to the common keywords so as to form the common keyword patent diagrams. Wherein, the key words are extracted from the invention names and the abstract contents.
And 3, calculating the patent data to be clustered by using the model constructed based on unsupervised learning to obtain the global attention vector of each patent data.
Fig. 2 is a schematic structural diagram of a model provided by the embodiment. As shown in fig. 2, the constructed model includes a self-encoder including an encoder and a decoder, a parallel graph convolution neural network module, a parallel single graph self-attention module, and a multi-graph attention module, wherein the encoder is used for performing vector encoding on vectorized patent data to obtain an encoded vector; the decoder is used for decoding the coding vector to obtain reconstructed patent data; the parallel graph convolutional neural network module is used for extracting the feature vectors of each type of patent graph combined with the coding vectors in parallel; the parallel single-graph self-attention module is used for calculating single-graph attention vectors according to each type of feature vectors in parallel; the multi-map attention module is used for calculating a global attention vector of each patent datum according to all the class single-map attention vectors.
In one embodiment, the encoder includes L encoding layers, the input vectorized patent data is subjected to vector encoding by a plurality of encoding layers to obtain an output encoding vector of each layer, and the encoding vector is expressed by a formula:
Figure 103752DEST_PATH_IMAGE012
wherein the content of the first and second substances,lexpressed as an index of the coding layer, ReLU () represents the ReLU activation function,
Figure 720678DEST_PATH_IMAGE013
and
Figure 559581DEST_PATH_IMAGE014
representing the weights and offsets of the coding layers,
Figure 236550DEST_PATH_IMAGE015
and
Figure 143327DEST_PATH_IMAGE016
respectively representl-1 layer and the second layerlLayer-coding the coded vectors output by the layer, in particular whenl=1, i.e. for the first layer coding layer,
Figure 298364DEST_PATH_IMAGE017
the input vectorized patent data is represented, the coding layer can adopt a full connection layer, and the obtained coding vectors can be used for enhancing the data representation of the patent drawing.
In an embodiment, the number of layers of the decoder is the same as that of the encoder, and the decoder includes L decoding layers, an input encoded vector is subjected to vector decoding of the plurality of decoding layers to obtain a decoded vector output by a last decoding layer as reconstruction patent data, and the reconstruction patent data is used for constructing a reconstruction loss and is expressed by a formula:
Figure 239776DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 822067DEST_PATH_IMAGE019
and
Figure 481718DEST_PATH_IMAGE020
the weights and offsets of the decoded layers are indicated,
Figure 174868DEST_PATH_IMAGE021
and
Figure 970785DEST_PATH_IMAGE022
respectively represent the firstl-1 layer and the second layerlLayer decoding the decoded vector output by the layer, in particular whenlWhen =1, i.e. the layer is decoded for the first layer,
Figure 723978DEST_PATH_IMAGE023
representing the input code vector.
In an embodiment, the parallel graph convolution neural network module includes graph convolution neural networks of the same number as the types of the patent drawings, that is, there are 4 graph convolution neural networks for 4 types of patent drawings, and the 4 graph convolution neural networks respectively perform feature extraction on feature vectors of the 4 types of patent drawings combined with the coding vectors in parallel to obtain feature vectors of the 4 types of patent drawings.
In the embodiment, each graph convolutional neural network corresponding to each type of patent graph includes L graph convolutional layers, that is, the number of graph convolutional layers is equal to the number of coding layers, as shown in fig. 3, each graph convolutional layer includes a weight assignment operation and a graph convolutional operation, that is, after each graph convolutional layer first performs weight assignment on a coding vector output by a corresponding coding layer (correspondingly coded into a coding layer having the same index as that of the convolutional layer) and a feature vector output by a previous graph convolutional layer, then the feature vector to which the weight is assigned is taken as an input of the current graph convolutional operation, the graph convolutional operation is performed in combination with an adjacent matrix of each type of patent graph, so as to output the feature vector, which is expressed by a formula:
Figure 605346DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,lexpressed as an index of the number of network layer (coding layer or graph convolution layer) layers,vindexes indicating the types of the patent drawings, namely a KNN patent drawing, a patent drawing of a common applicant, a patent drawing of a common inventor and a patent drawing of a common keyword,
Figure 367766DEST_PATH_IMAGE002
representing weights for balancing the degree of importance of the code vector and the feature vector,
Figure 18190DEST_PATH_IMAGE004
and
Figure 942284DEST_PATH_IMAGE005
respectively represent the firstvCorresponding to patent-like drawingl-1 layer and the second layerlThe feature vector output by the layer map convolution operation,
Figure 310948DEST_PATH_IMAGE006
a feature vector representing the assigned weight is assigned,
Figure 345900DEST_PATH_IMAGE007
is shown asvCorresponding to patent-like drawinglThe weight of the layer map convolution operation,
Figure 381989DEST_PATH_IMAGE008
is shown asvAdjacency matrix of similar patent drawings
Figure 211405DEST_PATH_IMAGE009
And the sum of identity matrices, i.e.
Figure 67365DEST_PATH_IMAGE024
DRepresent
Figure 171588DEST_PATH_IMAGE010
A diagonal matrix of (1), ReLU () representing a ReLU activation function, in particular, whenlWhere =1, i.e. for the first map convolutional layer,
Figure 531025DEST_PATH_IMAGE011
and the node matrix X represents each type of patent graph.
In the embodiment, the parallel graph convolution neural network module can improve the feature aggregation capability of the model by combining the coding vector of the self-coder and the graph information of each type of patent graph, and comprehensively obtain the special features of the patent data.
In the embodiment, the parallel single-drawing self-attention module comprises single-drawing self-attention layers with the same number as the types of the patent drawings, namely 4 single-drawing self-attention layers exist for 4 types of the patent drawings, and the 4 single-drawing self-attention layers respectively calculate 4 types of single-drawing attention vectors according to 4 types of feature vectors in parallel.
In the embodiment, as shown in fig. 4, each single-drawing self-attention layer corresponding to each type of patent drawings includes an attention weight calculation operation and an activation calculation operation, that is, firstly, an attention weight of a feature is calculated according to each type of feature vector, and then, activation calculation is performed on each type of feature vector according to the attention weight, so as to obtain a single-drawing attention vector corresponding to each type of feature vector, which is expressed by a formula:
Figure 796921DEST_PATH_IMAGE025
Figure 140178DEST_PATH_IMAGE026
wherein the content of the first and second substances,iandmeach of which represents an index of the patent data,
Figure 782512DEST_PATH_IMAGE027
Figure 527614DEST_PATH_IMAGE028
and
Figure 964411DEST_PATH_IMAGE029
respectively representvClass I patent drawings containiFeature vectors, attention weights and single-map attention vectors corresponding to the individual patent data,
Figure 794964DEST_PATH_IMAGE030
and
Figure 975410DEST_PATH_IMAGE031
respectively, represent the weight and bias of attention weight calculation, tan () represents tan trigonometric function, and Sigmoid () represents Sigmoid activation function.
In an embodiment, each attention layer of the parallel single-graph self-attention module can assign a higher weight to important features of a single patent graph, so that the obtained single-graph attention vector focuses more on characteristic information embodied by the category of the single-graph attention vector.
In an embodiment, the multi-map attention module is configured to compute a global attention vector based on all class single-map attention vectors. As shown in fig. 5, the multi-map attention module includes a non-linear transformation calculation operation, a global attention weight calculation operation, and a global attention vector calculation operation, that is, first, a non-linear transformation is performed on each type of single-map attention vector to obtain each type of multi-layer attention value; then, carrying out normalization processing on each type of multilayer attention value relative to all types of multilayer attention values to obtain a global attention weight of each type; and finally, carrying out weighted summation on the attention vectors of the single images of each type according to the global attention weight of each type to obtain the global attention vector of each patent data, wherein the global attention vector is expressed by a formula as follows:
Figure 309439DEST_PATH_IMAGE032
wherein the content of the first and second substances,
Figure 182717DEST_PATH_IMAGE033
representing shared attention vectors, superscriptsTWhich represents a transposition of the image,
Figure 234987DEST_PATH_IMAGE034
and
Figure 219123DEST_PATH_IMAGE035
respectively representing the weights and biases of the nonlinear transformation calculation operations,
Figure 673238DEST_PATH_IMAGE036
Figure 451839DEST_PATH_IMAGE037
and
Figure 991404DEST_PATH_IMAGE038
respectively representvClass I patent drawings containiThe patent data corresponds to a plurality of layers of attention values, a global attention weight and a global attention vector.
In the embodiment, the multi-graph attention module allocates higher weight to the important single-graph attention vector, so that the feature extraction capability of the model is improved, and the deep clustering capability is further improved.
In an embodiment, the constructed model needs to be optimized for parameters before being applied, including: constructing total loss, including constructing reconstruction loss based on vectorized patent data input from the encoder and output reconstructed patent data, constructing multi-graph correlation loss based on attention vectors of all classes of single graphs, and reconstructing lossThe weighted sum of the sum-and-multiple map-related losses is taken as the total loss; optimizing model parameters by using total loss and adopting an unsupervised learning mode to obtain a parameter optimized model, wherein the total lossLoss final Expressed as:
Figure 779232DEST_PATH_IMAGE039
wherein the content of the first and second substances,α,βthe hyper-parameter is determined by unsupervised learning.
In an embodiment, reconstruction lossLoss ReconstructionThe construction of the vectorized patent data input and the output reconstructed patent data based on the self-encoder specifically comprises the following steps: and constructing reconstruction loss according to the square of the Euclidean norm between vectorized patent data and reconstructed patent data corresponding to all the patent data, wherein the formula is expressed as follows:
Figure 87853DEST_PATH_IMAGE040
wherein the content of the first and second substances,
Figure 37355DEST_PATH_IMAGE041
Figure 64217DEST_PATH_IMAGE042
respectively representiVectorized patent data and reconstructed patent data corresponding to the individual patent data,
Figure 390156DEST_PATH_IMAGE043
Figure 547424DEST_PATH_IMAGE044
respectively representing vectorized patent data and reconstructed patent data corresponding to all patent data, N representing the total amount of patent data,
Figure 933406DEST_PATH_IMAGE045
representing the square of the euclidean norm,
Figure 447564DEST_PATH_IMAGE046
representing the euclidean norm result.
In the examples, the loss associated with multiple graphsLoss Multi-graphConstructing according to attention vectors of all kinds of single graphs, specifically comprising: firstly, calculating the autocorrelation similarity of attention vectors of each type of single images; then, a multi-graph correlation loss is constructed according to the square of the Euclidean norm between the autocorrelation similarities of any two types of single-graph attention vectors, and is expressed by a formula as follows:
Figure 577194DEST_PATH_IMAGE047
wherein the content of the first and second substances,
Figure 594829DEST_PATH_IMAGE048
Figure 151712DEST_PATH_IMAGE049
respectively represent the firstvThe normalized result and the autocorrelation similarity of the class single graph attention vector relative to itself,tan autocorrelation similarity index representing a single graph attention vector,
Figure 153166DEST_PATH_IMAGE050
the autocorrelation similarity of the attention vectors of the single graphs of the t-th class and the V-th class is respectively shown, and V represents the type of the patent graph.
The total loss provided by the embodiment fuses the reconstruction loss and the multi-graph loss, and the generalization performance of the model to the deep clustering of the patent data is improved, so that the effectiveness of the classification of the CPC codes of the patents is improved.
The model with the total loss optimized through unsupervised learning has strong generalization capability, a comprehensive global attention vector can be obtained, and the global attention vector can realize effective and reliable classification of the CPC codes of the patents.
In the embodiment, the calculation of the patent data to be clustered after the parameter optimization comprises the following processes: carrying out vector coding on each vectorization patent data by utilizing a coder contained in a self-coder to obtain a coding vector; extracting feature vectors of each type of patent drawings combined with the coding vectors in parallel by using each image convolutional neural network contained in the parallel image convolutional neural network module; calculating a single-image attention vector according to each type of feature vector in parallel by utilizing each single-image self-attention layer contained in the parallel single-image self-attention module; a global attention vector for each patent datum is calculated from all class simplex attention vectors using a multi-graph attention module.
And 4, clustering the global attention vectors of all the patent data to obtain a clustering result.
In the embodiment, based on the global attention vector corresponding to each patent data, clustering operation is performed to obtain a clustering result, each clustering cluster comprises a plurality of global attention vectors corresponding to the patent data, and each global attention vector has a vector capable of comprehensively expressing patent data characteristics, so that clustering clusters obtained based on the global attention vectors have very same patent data characteristics, can be considered to belong to the same class, and have the same CPC code. The clustering algorithm can adopt an algorithm such as k-means clustering and the like.
And 5, performing CPC code classification on each patent data according to the clustering result.
In the embodiment, the patent data belonging to the same cluster is considered to have the same CPC code, and when the CPC of one patent data in the cluster is judged manually, the CPC codes of all other patent data in the cluster can be obtained.
In a word, the unsupervised patent clustering method based on the parallel multi-graph convolution neural network provided by the embodiment realizes deep clustering of patents by considering multi-graph information and coding information of patent data, improves effectiveness and generalization of CPC code classification of the patents, and has a high application value to CPC code classification of the patents.
The technical solutions and advantages of the present invention have been described in detail in the foregoing detailed description, and it should be understood that the above description is only the most preferred embodiment of the present invention, and is not intended to limit the present invention, and any modifications, additions, and equivalents made within the scope of the principles of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An unsupervised patent clustering method based on a parallel multi-graph convolution neural network is characterized by comprising the following steps of:
vectorizing the patent data to be clustered to obtain vectorized patent data;
constructing multiple types of patent diagrams according to vectorized patent data, wherein the multiple types of patent diagrams comprise a KNN patent diagram, a patent diagram of a common applicant, a patent diagram of a common inventor and a patent diagram of a common keyword, which are constructed based on the similarity of patents;
the patent data to be clustered are calculated by utilizing a model constructed based on unsupervised learning, and the method comprises the following steps: carrying out vector coding on each vectorization patent data by utilizing a coder contained in a self-coder to obtain a coding vector; extracting feature vectors of each type of patent drawings combined with the coding vectors in parallel by using each image convolution neural network contained in the parallel image convolution neural network module; calculating a single-image attention vector according to each type of feature vector in parallel by utilizing each single-image self-attention layer contained in the parallel single-image self-attention module; calculating a global attention vector of each patent datum according to all the class single-drawing attention vectors by using a multi-drawing attention module;
and clustering the global attention vectors of all patent data to obtain a clustering result.
2. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network as claimed in claim 1, wherein each patent data comprises the title, abstract, applicant, inventor, and these data are vectorized to obtain vectorized patent data.
3. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network as claimed in claim 1, wherein when constructing multiple classes of patent graphs, each patent is taken as a node, vectorized patent data is taken as a node attribute, and connecting edges between nodes are different according to the types of the patent graphs, and the construction modes are also different, including:
aiming at the KNN patent graph, similarity calculation between any two patent data is carried out on all the patent data, and the patent data corresponding to k large similarities before being screened is used as neighborhood patent data according to the similarity value and is used for constructing a connecting edge between nodes, namely the connecting edge is constructed between any two nodes corresponding to all the neighborhood patent data;
aiming at the patent drawings of the common applicant, constructing connecting edges among nodes corresponding to the common applicant;
aiming at the patent drawings of the common inventors, constructing connecting edges among nodes corresponding to the common inventors;
and aiming at the common keyword patent graph, constructing a connecting edge between nodes corresponding to the common keywords.
4. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network as claimed in claim 1, wherein the encoder comprises L encoding layers, the input vectorized patent data is subjected to vector encoding of the plurality of encoding layers to obtain an output encoding vector of each layer;
each graph convolution neural network corresponding to each type of patent graph comprises L graph convolution layers, the number of the graph convolution layers is equal to that of the coding layers, each graph convolution layer firstly carries out weight distribution on a coding vector output by the corresponding coding layer and a characteristic vector output by the last layer of graph convolution layer, then takes the characteristic vector distributed with the weight as the input of the current graph convolution operation, carries out the graph convolution operation by combining the adjacent matrix of each type of patent graph to output the characteristic vector, and is expressed by a formula as follows:
Figure 313137DEST_PATH_IMAGE001
wherein the content of the first and second substances,lindicated as an index of the number of network layer levels,van index indicating the kind of the patent drawing,
Figure 699119DEST_PATH_IMAGE002
representing weights for balancing the degree of importance of the code vector and the feature vector,
Figure 947698DEST_PATH_IMAGE003
denotes the firstl-coding vectors output by a 1-layer coding layer,
Figure 77328DEST_PATH_IMAGE004
and
Figure 360542DEST_PATH_IMAGE005
respectively representvCorresponding to patent-like drawingl-1 layer and the second layerlThe feature vector output by the layer map convolution operation,
Figure 651846DEST_PATH_IMAGE006
a feature vector representing the assigned weight is assigned,
Figure 387721DEST_PATH_IMAGE007
is shown asvCorresponding to patent-like drawinglThe weight of the layer map convolution operation,
Figure 55462DEST_PATH_IMAGE008
is shown asvAdjacency matrix of similar patent drawings
Figure 193183DEST_PATH_IMAGE009
And the sum of the identity matrices,Dto represent
Figure 920967DEST_PATH_IMAGE010
The diagonal matrix of (1), ReLU () representing the ReLU activation function;
for the first layer of the graph convolution layer,
Figure 144138DEST_PATH_IMAGE011
and the node matrix X represents each type of patent graph.
5. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network of claim 1, wherein each single graph calculates the attention vector of the single graph from the attention layer in parallel according to each class of feature vectors, comprising: firstly, the attention weight of the feature is calculated according to each type of feature vector, and then the activation calculation is carried out on each type of feature vector according to the attention weight so as to obtain the single-image attention vector corresponding to each type of feature vector.
6. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network of claim 1, wherein calculating a global attention vector of each patent data according to all class single-graph attention vectors by using a multi-graph attention module comprises: firstly, carrying out nonlinear transformation on each type of single-image attention vector to obtain each type of multi-layer attention value; then, carrying out normalization processing on each type of multilayer attention value relative to all types of multilayer attention values to obtain a global attention weight of each type; and finally, carrying out weighted summation on the attention vectors of the single images of each type according to the global attention weight of each type to obtain the global attention vector of each patent data.
7. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network as claimed in claim 1, wherein the model needs parameter optimization before being applied, comprising:
decoding the coding vector output by the encoder by using a decoder contained in the self-encoder to obtain reconstructed patent data corresponding to each vectorized patent data;
constructing total loss, namely constructing reconstruction loss based on vectorization patent data input by a self-encoder and output reconstruction patent data, constructing multi-graph correlation loss based on attention vectors of all classes of single graphs, and taking weighted summation of the reconstruction loss and the multi-graph correlation loss as the total loss;
and optimizing the model parameters by using the total loss and adopting an unsupervised learning mode to obtain a model with optimized parameters.
8. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network as claimed in claim 7, wherein the constructing of reconstruction loss based on vectorized patent data input from an encoder and reconstructed patent data output comprises: and constructing the reconstruction loss according to the square of the Euclidean norm between the vectorized patent data and the reconstructed patent data corresponding to all the patent data.
9. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network of claim 7, wherein constructing the multi-graph correlation loss based on all class single-graph attention vectors comprises: firstly, calculating the autocorrelation similarity of attention vectors of each type of single images; and then constructing the multi-graph correlation loss according to the square of the Euclidean norm between the autocorrelation similarities of the self-correlation of any two types of single graph attention vectors.
10. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network as claimed in claim 1, wherein the unsupervised patent clustering method further comprises:
performing CPC code classification on each patent data according to the clustering result, wherein the CPC code classification comprises the following steps: patent data belonging to the same cluster are considered to have the same CPC code, and when the CPC of one patent data in the cluster is judged manually, the CPC codes of all other patent data in the cluster can be obtained.
CN202210695144.8A 2022-06-20 2022-06-20 Unsupervised patent clustering method based on parallel multi-graph convolution neural network Active CN114781553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210695144.8A CN114781553B (en) 2022-06-20 2022-06-20 Unsupervised patent clustering method based on parallel multi-graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210695144.8A CN114781553B (en) 2022-06-20 2022-06-20 Unsupervised patent clustering method based on parallel multi-graph convolution neural network

Publications (2)

Publication Number Publication Date
CN114781553A true CN114781553A (en) 2022-07-22
CN114781553B CN114781553B (en) 2023-04-07

Family

ID=82421156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210695144.8A Active CN114781553B (en) 2022-06-20 2022-06-20 Unsupervised patent clustering method based on parallel multi-graph convolution neural network

Country Status (1)

Country Link
CN (1) CN114781553B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160155018A1 (en) * 2014-11-28 2016-06-02 Honda Motor Co., Ltd. Image analysis device, method for creating image feature information database, and design similarity determination apparatus and method
CN110162703A (en) * 2019-05-13 2019-08-23 腾讯科技(深圳)有限公司 Content recommendation method, training method, device, equipment and storage medium
CN111373392A (en) * 2017-11-22 2020-07-03 花王株式会社 Document sorting device
CN113254656A (en) * 2021-07-06 2021-08-13 北京邮电大学 Patent text classification method, electronic equipment and computer storage medium
CN113312500A (en) * 2021-06-24 2021-08-27 河海大学 Method for constructing event map for safe operation of dam
CN113326372A (en) * 2021-05-13 2021-08-31 贵阳业勤中小企业促进中心有限公司 Intellectual property data analysis method based on technical position
CN113362160A (en) * 2021-06-08 2021-09-07 南京信息工程大学 Federal learning method and device for credit card anti-fraud
CN113378913A (en) * 2021-06-08 2021-09-10 电子科技大学 Semi-supervised node classification method based on self-supervised learning
CN113468291A (en) * 2021-06-17 2021-10-01 中国科学技术大学 Patent network representation learning-based automatic patent classification method
CN113486934A (en) * 2021-06-22 2021-10-08 河北工业大学 Attribute graph deep clustering method of hierarchical graph convolution network based on attention mechanism
CN113722484A (en) * 2021-08-31 2021-11-30 平安国际智慧城市科技股份有限公司 Rumor detection method, device, equipment and storage medium based on deep learning
CN113869404A (en) * 2021-09-27 2021-12-31 北京工业大学 Self-adaptive graph volume accumulation method for thesis network data
CN113918711A (en) * 2021-07-29 2022-01-11 北京工业大学 Academic paper-oriented classification method based on multi-view and multi-layer attention

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160155018A1 (en) * 2014-11-28 2016-06-02 Honda Motor Co., Ltd. Image analysis device, method for creating image feature information database, and design similarity determination apparatus and method
CN111373392A (en) * 2017-11-22 2020-07-03 花王株式会社 Document sorting device
CN110162703A (en) * 2019-05-13 2019-08-23 腾讯科技(深圳)有限公司 Content recommendation method, training method, device, equipment and storage medium
CN113326372A (en) * 2021-05-13 2021-08-31 贵阳业勤中小企业促进中心有限公司 Intellectual property data analysis method based on technical position
CN113378913A (en) * 2021-06-08 2021-09-10 电子科技大学 Semi-supervised node classification method based on self-supervised learning
CN113362160A (en) * 2021-06-08 2021-09-07 南京信息工程大学 Federal learning method and device for credit card anti-fraud
CN113468291A (en) * 2021-06-17 2021-10-01 中国科学技术大学 Patent network representation learning-based automatic patent classification method
CN113486934A (en) * 2021-06-22 2021-10-08 河北工业大学 Attribute graph deep clustering method of hierarchical graph convolution network based on attention mechanism
CN113312500A (en) * 2021-06-24 2021-08-27 河海大学 Method for constructing event map for safe operation of dam
CN113254656A (en) * 2021-07-06 2021-08-13 北京邮电大学 Patent text classification method, electronic equipment and computer storage medium
CN113918711A (en) * 2021-07-29 2022-01-11 北京工业大学 Academic paper-oriented classification method based on multi-view and multi-layer attention
CN113722484A (en) * 2021-08-31 2021-11-30 平安国际智慧城市科技股份有限公司 Rumor detection method, device, equipment and storage medium based on deep learning
CN113869404A (en) * 2021-09-27 2021-12-31 北京工业大学 Self-adaptive graph volume accumulation method for thesis network data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DOHYUN KIM 等: "A Graph Kernel Approach for Detecting Core Patents and Patent Groups", 《IEEE》 *
吴洁 等: "基于图卷积网络的高质量专利自动识别方案研究", 《情报杂志》 *

Also Published As

Publication number Publication date
CN114781553B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
Yoshihashi et al. Classification-reconstruction learning for open-set recognition
CN108108854B (en) Urban road network link prediction method, system and storage medium
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN109918528A (en) A kind of compact Hash code learning method based on semanteme protection
Gao et al. Multi-layer group sparse coding—For concurrent image classification and annotation
CN104866578B (en) A kind of imperfect Internet of Things data mixing fill method
CN109063666A (en) The lightweight face identification method and system of convolution are separated based on depth
CN112487812B (en) Nested entity identification method and system based on boundary identification
CN109960737B (en) Remote sensing image content retrieval method for semi-supervised depth confrontation self-coding Hash learning
CN108920720A (en) The large-scale image search method accelerated based on depth Hash and GPU
CN112765358A (en) Taxpayer industry classification method based on noise label learning
CN110222218B (en) Image retrieval method based on multi-scale NetVLAD and depth hash
CN111667022A (en) User data processing method and device, computer equipment and storage medium
CN113722509B (en) Knowledge graph data fusion method based on entity attribute similarity
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN114358188A (en) Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment
CN109783645A (en) A kind of file classification method encoded certainly based on variation
CN113537384B (en) Hash remote sensing image retrieval method, device and medium based on channel attention
CN114896434B (en) Hash code generation method and device based on center similarity learning
CN112488231A (en) Cosine measurement supervision deep hash algorithm with balanced similarity
CN114373099A (en) Three-dimensional point cloud classification method based on sparse graph convolution
CN111966828B (en) Newspaper and magazine news classification method based on text context structure and attribute information superposition network
CN114781553B (en) Unsupervised patent clustering method based on parallel multi-graph convolution neural network
CN112434512A (en) New word determining method and device in combination with context
CN116452353A (en) Financial data management method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant