CN114781553A - Unsupervised patent clustering method based on parallel multi-graph convolution neural network - Google Patents
Unsupervised patent clustering method based on parallel multi-graph convolution neural network Download PDFInfo
- Publication number
- CN114781553A CN114781553A CN202210695144.8A CN202210695144A CN114781553A CN 114781553 A CN114781553 A CN 114781553A CN 202210695144 A CN202210695144 A CN 202210695144A CN 114781553 A CN114781553 A CN 114781553A
- Authority
- CN
- China
- Prior art keywords
- graph
- attention
- vector
- patent data
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an unsupervised patent clustering method based on a parallel multi-graph convolution neural network, which is characterized in that on the basis of constructing 4 types of patent graphs and coding vectors of a self-encoder for patent data, 4 types of patent graphs and coding vectors are fully extracted through graph convolution operation, effective feature vectors of the patent data are comprehensively extracted, weight is distributed to each type of feature vectors through a parallel single graph self-attention module, the importance degree of important features of a single graph is improved to obtain a single graph attention vector, the single graph attention vectors of all types are fused through a multi-graph attention module for learning, a larger weight is distributed to the important single graph, the obtained global attention vector integrates multi-aspect feature information, and the clustering precision is improved.
Description
Technical Field
The invention belongs to the technical field of patent classification, and particularly relates to an unsupervised patent clustering method based on a parallel multi-graph convolution neural network.
Background
Through the analysis of the patent data, specific market development wind vane and organization innovation strength can be obtained. People often use information such as Patent names, keywords, and CPC (co-Patent Classification) codes to search patents on various intellectual property platforms. Among them, the CPC code is an extension of IPC (International Patent Classification), which is commonly managed by EPO (European Patent Office) and the us Patent and trademark Office. It is divided into nine parts, a-H and Y, which are in turn divided into classes, subclasses, groups and subgroups, with approximately 250000 classification entries. Whichever institution participates in processing and approving the patent will determine the type of classification code used for the invention. Once the patent application is approved, the CPC code cannot be changed any more. Therefore, it is extremely important for the patent applicant to prejudge the patent CPC code in advance.
At present, the classification of patent CPC codes mostly adopts a manual method to check patent names, abstracts and texts so as to match the corresponding patent CPC codes, which is very tedious for patent examiners and easy to make mistakes.
Some scholars study the NLP (Natural Language Processing) technology, and classify patents through a word embedding system and a machine learning classification model, so that the speed and accuracy of classifying patents are improved, and the labor cost is reduced.
The scholars also study deep learning methods for classifying patents, which may include Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and Graph Convolutional Neural Networks (GCNs). The graph convolution neural network introduces graph embedding to consider structural information of original patent samples, and uses convolution operation on the graph to effectively utilize important relations among nodes, so that the model achieves better cognition and patent classification capability. Moreover, label training samples in fine classification rarely lead to insufficient classification performance of supervised models and are not enough for realizing fine classification of CPC codes.
Patent document CN109446319A discloses a biomedical patent clustering analysis method based on K-means, which simultaneously selects 4 important evaluation indexes of patent application amount, patent authorization amount, patent growth rate and patent effective rate in patent analysis as clustering variables for clustering analysis, so as to deeply mine the association between data and better classify the patent data, but cannot classify the patent CPC codes.
Disclosure of Invention
In view of the above, the invention provides an unsupervised patent clustering method based on a parallel multi-graph convolution neural network, which improves the precision of a model for finely classifying patents and improves the accuracy of patent classification under unsupervised learning.
In order to achieve the above object, an embodiment of the present invention provides an unsupervised patent clustering method based on a parallel multi-graph convolutional neural network, including the following steps:
vectorizing the patent data to be clustered to obtain vectorized patent data;
constructing multiple types of patent diagrams according to vectorized patent data, wherein the multiple types of patent diagrams comprise a KNN patent diagram, a patent diagram of a common applicant, a patent diagram of a common inventor and a patent diagram of a common keyword, which are constructed based on the similarity of patents;
the patent data to be clustered are calculated by utilizing a model constructed based on unsupervised learning, and the method comprises the following steps: carrying out vector coding on each vectorization patent data by utilizing a coder contained in a self-coder to obtain a coding vector; extracting feature vectors of each type of patent drawings combined with the coding vectors in parallel by using each image convolutional neural network contained in the parallel image convolutional neural network module; calculating a single-drawing attention vector according to each type of feature vector in parallel by utilizing each single-drawing self-attention layer contained in the parallel single-drawing self-attention module; calculating a global attention vector of each patent datum according to all the class single-figure attention vectors by using a multi-figure attention module;
and clustering the global attention vectors of all the patent data to obtain a clustering result.
In one embodiment, each patent data includes an invention name, a summary, an applicant, and an inventor, and the data is vectorized to obtain vectorized patent data.
In one embodiment, when constructing multiple types of patent graphs, each patent is used as a node, vectorized patent data is used as a node attribute, and connecting edges between nodes are constructed differently according to different types of patent graphs, including:
for the KNN patent diagram, similarity calculation between any two patent data is carried out on all the patent data, and the patent data corresponding to k large similarities before the patent data are screened according to the similarity value to serve as neighborhood patent data to be used for constructing a connecting edge between nodes, namely the connecting edge is constructed between any two corresponding nodes of all the neighborhood patent data;
aiming at the patent drawings of the common applicant, constructing connecting edges among nodes corresponding to the common applicant;
aiming at the patent drawings of the common inventors, constructing connecting edges among nodes corresponding to the common inventors;
and aiming at the common keyword patent graph, constructing a connecting edge between nodes corresponding to the common keywords.
In one embodiment, the encoder comprises L encoding layers, and the input vectorization patent data is subjected to vector encoding of the plurality of encoding layers to obtain an encoding vector output by each layer;
each graph convolution neural network corresponding to each type of patent graph comprises L graph convolution layers, the number of the graph convolution layers is equal to that of the coding layers, each graph convolution layer firstly distributes weights to the coding vector output by the corresponding coding layer and the feature vector output by the last graph convolution layer, then the feature vector with the distributed weights is used as the input of the current graph convolution operation, the graph convolution operation is carried out by combining the adjacent matrix of each type of patent graph, and the feature vector is output and expressed by a formula:
wherein, the first and the second end of the pipe are connected with each other,lindicated as an index of the number of network layer levels,van index indicating the kind of the patent drawing,representing weights for balancing the degree of importance of the code vector and the feature vector,is shown asl-coding vectors output by a 1-layer coding layer,andrespectively representvCorresponding to patent-like drawingl-1 layer and the secondlThe feature vectors output by the layer map convolution operation,a feature vector representing the assigned weight is assigned,is shown asvCorresponding to patent-like drawinglThe weight of the layer map convolution operation,is shown asvAdjacency matrix of similar patent drawingsAnd the sum of the identity matrix and the identity matrix,Dto representThe diagonal matrix of (1), ReLU () representing the ReLU activation function;
for the first layer of the graph convolution layer,and the node matrix X represents each type of patent graph.
In one embodiment, each single graph calculates a single graph attention vector from the attention layer in parallel according to each type of feature vector, and the method comprises the following steps: firstly, the attention weight of the feature is calculated according to each type of feature vector, and then the activation calculation is carried out on each type of feature vector according to the attention weight so as to obtain the single-image attention vector corresponding to each type of feature vector.
In one embodiment, calculating a global attention vector for each patent data from all class sketch attention vectors using a multi-sketch attention module comprises: firstly, carrying out nonlinear transformation on each type of single-image attention vector to obtain each type of multi-layer attention value; then, carrying out normalization processing on each type of multilayer attention value relative to all types of multilayer attention values to obtain a global attention weight of each type; and finally, carrying out weighted summation on the attention vectors of the single images of each type according to the global attention weight of each type to obtain the global attention vector of each patent data.
In one embodiment, the model requires parameter optimization before being applied, including:
decoding the coding vector output by the encoder by using a decoder contained in the self-encoder to obtain reconstructed patent data corresponding to each vectorized patent data;
constructing total loss, namely constructing reconstruction loss based on vectorization patent data input by a self-encoder and output reconstruction patent data, constructing multi-graph correlation loss based on attention vectors of all kinds of single graphs, and taking weighted summation of the reconstruction loss and the multi-graph correlation loss as the total loss;
and optimizing the model parameters by using the total loss and adopting an unsupervised learning mode to obtain a model with optimized parameters.
In one embodiment, the constructing of the reconstruction loss based on the vectorized patent data input from the encoder and the reconstructed patent data output from the encoder includes: and constructing reconstruction loss according to the squares of Euclidean norms between vectorized patent data and reconstructed patent data corresponding to all the patent data.
In one embodiment, constructing a multi-map correlation penalty based on all class single-map attention vectors includes: firstly, calculating the autocorrelation similarity of attention vectors of each type of single images; and then constructing the multi-graph correlation loss according to the square of the Euclidean norm between the autocorrelation similarities of the self-correlation of any two types of single-graph attention vectors.
In one embodiment, the unsupervised patent clustering method further comprises:
and performing CPC code classification on each patent data according to the clustering result, wherein the CPC code classification comprises the following steps: patent data belonging to the same cluster are considered to have the same CPC code, and when the CPC of one patent data in the cluster is judged manually, the CPC codes of all other patent data of the cluster can be obtained.
Compared with the prior art, the method has the beneficial effects that at least:
on the basis of constructing 4 types of patent drawings and coding vectors of patent data from a coder, 4 types of patent drawings and coding vectors are fully extracted through a drawing convolution operation, effective feature vectors of the patent data are comprehensively extracted, weights are distributed to each type of feature vectors through a parallel single-drawing self-attention module, the importance degree of important features of a single drawing is improved to obtain a single-drawing attention vector, the single-drawing attention vectors of all types are fused through a multi-drawing attention module for learning, and larger weights are distributed to the important single drawing, so that the obtained global attention vector integrates multi-aspect feature information, and the clustering precision is improved.
The model is constructed based on unsupervised learning, the generalization performance of the model to the deep clustering of the patent data is improved under the condition that the fine classification labels are true, the comprehensiveness of the model in feature extraction is improved, and the effectiveness of the patent data clustering is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of an unsupervised patent clustering method based on a parallel multi-graph convolutional neural network provided by an embodiment;
FIG. 2 is a schematic structural diagram of a model provided by the embodiment;
FIG. 3 is a schematic structural diagram of each of the convolutional layers provided in the embodiments;
FIG. 4 is a schematic structural diagram of each single-drawing self-attention layer provided by the embodiment;
fig. 5 is a schematic structural diagram of a multi-graph attention module according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The problem of insufficient classification performance of a supervised classification model caused by too few label training samples during fine classification of patents is solved, and the problem of inaccurate classification of patents caused by insufficient generalization performance of the classification model according to a unilateral patent drawing is also solved. The embodiment provides an unsupervised patent clustering method based on a parallel multi-graph convolution neural network, which improves the precision of a model for finely classifying patents and the accuracy of patent classification under unsupervised learning.
Fig. 1 is a flowchart of an unsupervised patent clustering method based on a parallel multi-graph convolutional neural network according to an embodiment. As shown in fig. 1, the unsupervised patent clustering method based on the parallel multi-graph convolutional neural network provided in the embodiment includes the following steps:
step 1, vectorizing the patent data to be clustered to obtain vectorized patent data.
In the embodiment, each piece of patent data to be clustered corresponds to one patent document, and includes the name, abstract, applicant and inventor of the patent, vectorization is performed on the data to obtain vectorized patent data, and the specific vectorized patent data is expressed in a form of a 1-dimensional vector group.
And 2, constructing multiple types of patent diagrams according to the vectorized patent data.
In an embodiment, the multi-class patent drawings comprise KNN (K-nearest-neighbor) patent drawings, patent drawings of the same applicant, patent drawings of the same inventor and patent drawings of the same keyword, which are constructed based on the similarity of patents. When constructing multiple types of patent graphs, each patent is used as a node, vectorized patent data is used as a node attribute, connecting edges between the nodes are different according to the types of the patent graphs, and the construction modes are also different, wherein the construction modes comprise:
according to the KNN patent graph, similarity calculation between any two patent data is carried out on all the patent data, the patent data corresponding to k large similarities before being screened is used as neighborhood patent data according to the similarity value, and a connecting edge between nodes is constructed, namely the connecting edge is constructed between any two nodes corresponding to all the neighborhood patent data, so that the KNN patent graph is formed.
In one embodiment, cosine similarity between any two patent data can be calculated, and patent data corresponding to k cosine similarities before the patent data are screened as neighborhood patent data according to the cosine similarity, so as to construct a connecting edge between nodes.
In the embodiment, for the patent drawings of the common applicant, connecting edges are constructed among nodes corresponding to the common applicant to form the patent drawings of the common applicant; aiming at the co-inventor patent drawings, constructing connecting edges among nodes corresponding to the co-inventors to form the co-inventor patent drawings; and aiming at the common keyword patent diagrams, constructing connecting edges among nodes corresponding to the common keywords so as to form the common keyword patent diagrams. Wherein, the key words are extracted from the invention names and the abstract contents.
And 3, calculating the patent data to be clustered by using the model constructed based on unsupervised learning to obtain the global attention vector of each patent data.
Fig. 2 is a schematic structural diagram of a model provided by the embodiment. As shown in fig. 2, the constructed model includes a self-encoder including an encoder and a decoder, a parallel graph convolution neural network module, a parallel single graph self-attention module, and a multi-graph attention module, wherein the encoder is used for performing vector encoding on vectorized patent data to obtain an encoded vector; the decoder is used for decoding the coding vector to obtain reconstructed patent data; the parallel graph convolutional neural network module is used for extracting the feature vectors of each type of patent graph combined with the coding vectors in parallel; the parallel single-graph self-attention module is used for calculating single-graph attention vectors according to each type of feature vectors in parallel; the multi-map attention module is used for calculating a global attention vector of each patent datum according to all the class single-map attention vectors.
In one embodiment, the encoder includes L encoding layers, the input vectorized patent data is subjected to vector encoding by a plurality of encoding layers to obtain an output encoding vector of each layer, and the encoding vector is expressed by a formula:
wherein the content of the first and second substances,lexpressed as an index of the coding layer, ReLU () represents the ReLU activation function,andrepresenting the weights and offsets of the coding layers,andrespectively representl-1 layer and the second layerlLayer-coding the coded vectors output by the layer, in particular whenl=1, i.e. for the first layer coding layer,the input vectorized patent data is represented, the coding layer can adopt a full connection layer, and the obtained coding vectors can be used for enhancing the data representation of the patent drawing.
In an embodiment, the number of layers of the decoder is the same as that of the encoder, and the decoder includes L decoding layers, an input encoded vector is subjected to vector decoding of the plurality of decoding layers to obtain a decoded vector output by a last decoding layer as reconstruction patent data, and the reconstruction patent data is used for constructing a reconstruction loss and is expressed by a formula:
wherein the content of the first and second substances,andthe weights and offsets of the decoded layers are indicated,andrespectively represent the firstl-1 layer and the second layerlLayer decoding the decoded vector output by the layer, in particular whenlWhen =1, i.e. the layer is decoded for the first layer,representing the input code vector.
In an embodiment, the parallel graph convolution neural network module includes graph convolution neural networks of the same number as the types of the patent drawings, that is, there are 4 graph convolution neural networks for 4 types of patent drawings, and the 4 graph convolution neural networks respectively perform feature extraction on feature vectors of the 4 types of patent drawings combined with the coding vectors in parallel to obtain feature vectors of the 4 types of patent drawings.
In the embodiment, each graph convolutional neural network corresponding to each type of patent graph includes L graph convolutional layers, that is, the number of graph convolutional layers is equal to the number of coding layers, as shown in fig. 3, each graph convolutional layer includes a weight assignment operation and a graph convolutional operation, that is, after each graph convolutional layer first performs weight assignment on a coding vector output by a corresponding coding layer (correspondingly coded into a coding layer having the same index as that of the convolutional layer) and a feature vector output by a previous graph convolutional layer, then the feature vector to which the weight is assigned is taken as an input of the current graph convolutional operation, the graph convolutional operation is performed in combination with an adjacent matrix of each type of patent graph, so as to output the feature vector, which is expressed by a formula:
wherein, the first and the second end of the pipe are connected with each other,lexpressed as an index of the number of network layer (coding layer or graph convolution layer) layers,vindexes indicating the types of the patent drawings, namely a KNN patent drawing, a patent drawing of a common applicant, a patent drawing of a common inventor and a patent drawing of a common keyword,representing weights for balancing the degree of importance of the code vector and the feature vector,andrespectively represent the firstvCorresponding to patent-like drawingl-1 layer and the second layerlThe feature vector output by the layer map convolution operation,a feature vector representing the assigned weight is assigned,is shown asvCorresponding to patent-like drawinglThe weight of the layer map convolution operation,is shown asvAdjacency matrix of similar patent drawingsAnd the sum of identity matrices, i.e.,DRepresentA diagonal matrix of (1), ReLU () representing a ReLU activation function, in particular, whenlWhere =1, i.e. for the first map convolutional layer,and the node matrix X represents each type of patent graph.
In the embodiment, the parallel graph convolution neural network module can improve the feature aggregation capability of the model by combining the coding vector of the self-coder and the graph information of each type of patent graph, and comprehensively obtain the special features of the patent data.
In the embodiment, the parallel single-drawing self-attention module comprises single-drawing self-attention layers with the same number as the types of the patent drawings, namely 4 single-drawing self-attention layers exist for 4 types of the patent drawings, and the 4 single-drawing self-attention layers respectively calculate 4 types of single-drawing attention vectors according to 4 types of feature vectors in parallel.
In the embodiment, as shown in fig. 4, each single-drawing self-attention layer corresponding to each type of patent drawings includes an attention weight calculation operation and an activation calculation operation, that is, firstly, an attention weight of a feature is calculated according to each type of feature vector, and then, activation calculation is performed on each type of feature vector according to the attention weight, so as to obtain a single-drawing attention vector corresponding to each type of feature vector, which is expressed by a formula:
wherein the content of the first and second substances,iandmeach of which represents an index of the patent data,、andrespectively representvClass I patent drawings containiFeature vectors, attention weights and single-map attention vectors corresponding to the individual patent data,andrespectively, represent the weight and bias of attention weight calculation, tan () represents tan trigonometric function, and Sigmoid () represents Sigmoid activation function.
In an embodiment, each attention layer of the parallel single-graph self-attention module can assign a higher weight to important features of a single patent graph, so that the obtained single-graph attention vector focuses more on characteristic information embodied by the category of the single-graph attention vector.
In an embodiment, the multi-map attention module is configured to compute a global attention vector based on all class single-map attention vectors. As shown in fig. 5, the multi-map attention module includes a non-linear transformation calculation operation, a global attention weight calculation operation, and a global attention vector calculation operation, that is, first, a non-linear transformation is performed on each type of single-map attention vector to obtain each type of multi-layer attention value; then, carrying out normalization processing on each type of multilayer attention value relative to all types of multilayer attention values to obtain a global attention weight of each type; and finally, carrying out weighted summation on the attention vectors of the single images of each type according to the global attention weight of each type to obtain the global attention vector of each patent data, wherein the global attention vector is expressed by a formula as follows:
wherein the content of the first and second substances,representing shared attention vectors, superscriptsTWhich represents a transposition of the image,andrespectively representing the weights and biases of the nonlinear transformation calculation operations,、andrespectively representvClass I patent drawings containiThe patent data corresponds to a plurality of layers of attention values, a global attention weight and a global attention vector.
In the embodiment, the multi-graph attention module allocates higher weight to the important single-graph attention vector, so that the feature extraction capability of the model is improved, and the deep clustering capability is further improved.
In an embodiment, the constructed model needs to be optimized for parameters before being applied, including: constructing total loss, including constructing reconstruction loss based on vectorized patent data input from the encoder and output reconstructed patent data, constructing multi-graph correlation loss based on attention vectors of all classes of single graphs, and reconstructing lossThe weighted sum of the sum-and-multiple map-related losses is taken as the total loss; optimizing model parameters by using total loss and adopting an unsupervised learning mode to obtain a parameter optimized model, wherein the total lossLoss final Expressed as:
wherein the content of the first and second substances,α,βthe hyper-parameter is determined by unsupervised learning.
In an embodiment, reconstruction lossLoss ReconstructionThe construction of the vectorized patent data input and the output reconstructed patent data based on the self-encoder specifically comprises the following steps: and constructing reconstruction loss according to the square of the Euclidean norm between vectorized patent data and reconstructed patent data corresponding to all the patent data, wherein the formula is expressed as follows:
wherein the content of the first and second substances,、respectively representiVectorized patent data and reconstructed patent data corresponding to the individual patent data,、respectively representing vectorized patent data and reconstructed patent data corresponding to all patent data, N representing the total amount of patent data,representing the square of the euclidean norm,representing the euclidean norm result.
In the examples, the loss associated with multiple graphsLoss Multi-graphConstructing according to attention vectors of all kinds of single graphs, specifically comprising: firstly, calculating the autocorrelation similarity of attention vectors of each type of single images; then, a multi-graph correlation loss is constructed according to the square of the Euclidean norm between the autocorrelation similarities of any two types of single-graph attention vectors, and is expressed by a formula as follows:
wherein the content of the first and second substances,、respectively represent the firstvThe normalized result and the autocorrelation similarity of the class single graph attention vector relative to itself,tan autocorrelation similarity index representing a single graph attention vector,the autocorrelation similarity of the attention vectors of the single graphs of the t-th class and the V-th class is respectively shown, and V represents the type of the patent graph.
The total loss provided by the embodiment fuses the reconstruction loss and the multi-graph loss, and the generalization performance of the model to the deep clustering of the patent data is improved, so that the effectiveness of the classification of the CPC codes of the patents is improved.
The model with the total loss optimized through unsupervised learning has strong generalization capability, a comprehensive global attention vector can be obtained, and the global attention vector can realize effective and reliable classification of the CPC codes of the patents.
In the embodiment, the calculation of the patent data to be clustered after the parameter optimization comprises the following processes: carrying out vector coding on each vectorization patent data by utilizing a coder contained in a self-coder to obtain a coding vector; extracting feature vectors of each type of patent drawings combined with the coding vectors in parallel by using each image convolutional neural network contained in the parallel image convolutional neural network module; calculating a single-image attention vector according to each type of feature vector in parallel by utilizing each single-image self-attention layer contained in the parallel single-image self-attention module; a global attention vector for each patent datum is calculated from all class simplex attention vectors using a multi-graph attention module.
And 4, clustering the global attention vectors of all the patent data to obtain a clustering result.
In the embodiment, based on the global attention vector corresponding to each patent data, clustering operation is performed to obtain a clustering result, each clustering cluster comprises a plurality of global attention vectors corresponding to the patent data, and each global attention vector has a vector capable of comprehensively expressing patent data characteristics, so that clustering clusters obtained based on the global attention vectors have very same patent data characteristics, can be considered to belong to the same class, and have the same CPC code. The clustering algorithm can adopt an algorithm such as k-means clustering and the like.
And 5, performing CPC code classification on each patent data according to the clustering result.
In the embodiment, the patent data belonging to the same cluster is considered to have the same CPC code, and when the CPC of one patent data in the cluster is judged manually, the CPC codes of all other patent data in the cluster can be obtained.
In a word, the unsupervised patent clustering method based on the parallel multi-graph convolution neural network provided by the embodiment realizes deep clustering of patents by considering multi-graph information and coding information of patent data, improves effectiveness and generalization of CPC code classification of the patents, and has a high application value to CPC code classification of the patents.
The technical solutions and advantages of the present invention have been described in detail in the foregoing detailed description, and it should be understood that the above description is only the most preferred embodiment of the present invention, and is not intended to limit the present invention, and any modifications, additions, and equivalents made within the scope of the principles of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. An unsupervised patent clustering method based on a parallel multi-graph convolution neural network is characterized by comprising the following steps of:
vectorizing the patent data to be clustered to obtain vectorized patent data;
constructing multiple types of patent diagrams according to vectorized patent data, wherein the multiple types of patent diagrams comprise a KNN patent diagram, a patent diagram of a common applicant, a patent diagram of a common inventor and a patent diagram of a common keyword, which are constructed based on the similarity of patents;
the patent data to be clustered are calculated by utilizing a model constructed based on unsupervised learning, and the method comprises the following steps: carrying out vector coding on each vectorization patent data by utilizing a coder contained in a self-coder to obtain a coding vector; extracting feature vectors of each type of patent drawings combined with the coding vectors in parallel by using each image convolution neural network contained in the parallel image convolution neural network module; calculating a single-image attention vector according to each type of feature vector in parallel by utilizing each single-image self-attention layer contained in the parallel single-image self-attention module; calculating a global attention vector of each patent datum according to all the class single-drawing attention vectors by using a multi-drawing attention module;
and clustering the global attention vectors of all patent data to obtain a clustering result.
2. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network as claimed in claim 1, wherein each patent data comprises the title, abstract, applicant, inventor, and these data are vectorized to obtain vectorized patent data.
3. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network as claimed in claim 1, wherein when constructing multiple classes of patent graphs, each patent is taken as a node, vectorized patent data is taken as a node attribute, and connecting edges between nodes are different according to the types of the patent graphs, and the construction modes are also different, including:
aiming at the KNN patent graph, similarity calculation between any two patent data is carried out on all the patent data, and the patent data corresponding to k large similarities before being screened is used as neighborhood patent data according to the similarity value and is used for constructing a connecting edge between nodes, namely the connecting edge is constructed between any two nodes corresponding to all the neighborhood patent data;
aiming at the patent drawings of the common applicant, constructing connecting edges among nodes corresponding to the common applicant;
aiming at the patent drawings of the common inventors, constructing connecting edges among nodes corresponding to the common inventors;
and aiming at the common keyword patent graph, constructing a connecting edge between nodes corresponding to the common keywords.
4. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network as claimed in claim 1, wherein the encoder comprises L encoding layers, the input vectorized patent data is subjected to vector encoding of the plurality of encoding layers to obtain an output encoding vector of each layer;
each graph convolution neural network corresponding to each type of patent graph comprises L graph convolution layers, the number of the graph convolution layers is equal to that of the coding layers, each graph convolution layer firstly carries out weight distribution on a coding vector output by the corresponding coding layer and a characteristic vector output by the last layer of graph convolution layer, then takes the characteristic vector distributed with the weight as the input of the current graph convolution operation, carries out the graph convolution operation by combining the adjacent matrix of each type of patent graph to output the characteristic vector, and is expressed by a formula as follows:
wherein the content of the first and second substances,lindicated as an index of the number of network layer levels,van index indicating the kind of the patent drawing,representing weights for balancing the degree of importance of the code vector and the feature vector,denotes the firstl-coding vectors output by a 1-layer coding layer,andrespectively representvCorresponding to patent-like drawingl-1 layer and the second layerlThe feature vector output by the layer map convolution operation,a feature vector representing the assigned weight is assigned,is shown asvCorresponding to patent-like drawinglThe weight of the layer map convolution operation,is shown asvAdjacency matrix of similar patent drawingsAnd the sum of the identity matrices,Dto representThe diagonal matrix of (1), ReLU () representing the ReLU activation function;
5. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network of claim 1, wherein each single graph calculates the attention vector of the single graph from the attention layer in parallel according to each class of feature vectors, comprising: firstly, the attention weight of the feature is calculated according to each type of feature vector, and then the activation calculation is carried out on each type of feature vector according to the attention weight so as to obtain the single-image attention vector corresponding to each type of feature vector.
6. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network of claim 1, wherein calculating a global attention vector of each patent data according to all class single-graph attention vectors by using a multi-graph attention module comprises: firstly, carrying out nonlinear transformation on each type of single-image attention vector to obtain each type of multi-layer attention value; then, carrying out normalization processing on each type of multilayer attention value relative to all types of multilayer attention values to obtain a global attention weight of each type; and finally, carrying out weighted summation on the attention vectors of the single images of each type according to the global attention weight of each type to obtain the global attention vector of each patent data.
7. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network as claimed in claim 1, wherein the model needs parameter optimization before being applied, comprising:
decoding the coding vector output by the encoder by using a decoder contained in the self-encoder to obtain reconstructed patent data corresponding to each vectorized patent data;
constructing total loss, namely constructing reconstruction loss based on vectorization patent data input by a self-encoder and output reconstruction patent data, constructing multi-graph correlation loss based on attention vectors of all classes of single graphs, and taking weighted summation of the reconstruction loss and the multi-graph correlation loss as the total loss;
and optimizing the model parameters by using the total loss and adopting an unsupervised learning mode to obtain a model with optimized parameters.
8. The unsupervised patent clustering method based on the parallel multigraph convolutional neural network as claimed in claim 7, wherein the constructing of reconstruction loss based on vectorized patent data input from an encoder and reconstructed patent data output comprises: and constructing the reconstruction loss according to the square of the Euclidean norm between the vectorized patent data and the reconstructed patent data corresponding to all the patent data.
9. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network of claim 7, wherein constructing the multi-graph correlation loss based on all class single-graph attention vectors comprises: firstly, calculating the autocorrelation similarity of attention vectors of each type of single images; and then constructing the multi-graph correlation loss according to the square of the Euclidean norm between the autocorrelation similarities of the self-correlation of any two types of single graph attention vectors.
10. The unsupervised patent clustering method based on the parallel multi-graph convolutional neural network as claimed in claim 1, wherein the unsupervised patent clustering method further comprises:
performing CPC code classification on each patent data according to the clustering result, wherein the CPC code classification comprises the following steps: patent data belonging to the same cluster are considered to have the same CPC code, and when the CPC of one patent data in the cluster is judged manually, the CPC codes of all other patent data in the cluster can be obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210695144.8A CN114781553B (en) | 2022-06-20 | 2022-06-20 | Unsupervised patent clustering method based on parallel multi-graph convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210695144.8A CN114781553B (en) | 2022-06-20 | 2022-06-20 | Unsupervised patent clustering method based on parallel multi-graph convolution neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114781553A true CN114781553A (en) | 2022-07-22 |
CN114781553B CN114781553B (en) | 2023-04-07 |
Family
ID=82421156
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210695144.8A Active CN114781553B (en) | 2022-06-20 | 2022-06-20 | Unsupervised patent clustering method based on parallel multi-graph convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114781553B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160155018A1 (en) * | 2014-11-28 | 2016-06-02 | Honda Motor Co., Ltd. | Image analysis device, method for creating image feature information database, and design similarity determination apparatus and method |
CN110162703A (en) * | 2019-05-13 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Content recommendation method, training method, device, equipment and storage medium |
CN111373392A (en) * | 2017-11-22 | 2020-07-03 | 花王株式会社 | Document sorting device |
CN113254656A (en) * | 2021-07-06 | 2021-08-13 | 北京邮电大学 | Patent text classification method, electronic equipment and computer storage medium |
CN113312500A (en) * | 2021-06-24 | 2021-08-27 | 河海大学 | Method for constructing event map for safe operation of dam |
CN113326372A (en) * | 2021-05-13 | 2021-08-31 | 贵阳业勤中小企业促进中心有限公司 | Intellectual property data analysis method based on technical position |
CN113362160A (en) * | 2021-06-08 | 2021-09-07 | 南京信息工程大学 | Federal learning method and device for credit card anti-fraud |
CN113378913A (en) * | 2021-06-08 | 2021-09-10 | 电子科技大学 | Semi-supervised node classification method based on self-supervised learning |
CN113468291A (en) * | 2021-06-17 | 2021-10-01 | 中国科学技术大学 | Patent network representation learning-based automatic patent classification method |
CN113486934A (en) * | 2021-06-22 | 2021-10-08 | 河北工业大学 | Attribute graph deep clustering method of hierarchical graph convolution network based on attention mechanism |
CN113722484A (en) * | 2021-08-31 | 2021-11-30 | 平安国际智慧城市科技股份有限公司 | Rumor detection method, device, equipment and storage medium based on deep learning |
CN113869404A (en) * | 2021-09-27 | 2021-12-31 | 北京工业大学 | Self-adaptive graph volume accumulation method for thesis network data |
CN113918711A (en) * | 2021-07-29 | 2022-01-11 | 北京工业大学 | Academic paper-oriented classification method based on multi-view and multi-layer attention |
-
2022
- 2022-06-20 CN CN202210695144.8A patent/CN114781553B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160155018A1 (en) * | 2014-11-28 | 2016-06-02 | Honda Motor Co., Ltd. | Image analysis device, method for creating image feature information database, and design similarity determination apparatus and method |
CN111373392A (en) * | 2017-11-22 | 2020-07-03 | 花王株式会社 | Document sorting device |
CN110162703A (en) * | 2019-05-13 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Content recommendation method, training method, device, equipment and storage medium |
CN113326372A (en) * | 2021-05-13 | 2021-08-31 | 贵阳业勤中小企业促进中心有限公司 | Intellectual property data analysis method based on technical position |
CN113378913A (en) * | 2021-06-08 | 2021-09-10 | 电子科技大学 | Semi-supervised node classification method based on self-supervised learning |
CN113362160A (en) * | 2021-06-08 | 2021-09-07 | 南京信息工程大学 | Federal learning method and device for credit card anti-fraud |
CN113468291A (en) * | 2021-06-17 | 2021-10-01 | 中国科学技术大学 | Patent network representation learning-based automatic patent classification method |
CN113486934A (en) * | 2021-06-22 | 2021-10-08 | 河北工业大学 | Attribute graph deep clustering method of hierarchical graph convolution network based on attention mechanism |
CN113312500A (en) * | 2021-06-24 | 2021-08-27 | 河海大学 | Method for constructing event map for safe operation of dam |
CN113254656A (en) * | 2021-07-06 | 2021-08-13 | 北京邮电大学 | Patent text classification method, electronic equipment and computer storage medium |
CN113918711A (en) * | 2021-07-29 | 2022-01-11 | 北京工业大学 | Academic paper-oriented classification method based on multi-view and multi-layer attention |
CN113722484A (en) * | 2021-08-31 | 2021-11-30 | 平安国际智慧城市科技股份有限公司 | Rumor detection method, device, equipment and storage medium based on deep learning |
CN113869404A (en) * | 2021-09-27 | 2021-12-31 | 北京工业大学 | Self-adaptive graph volume accumulation method for thesis network data |
Non-Patent Citations (2)
Title |
---|
DOHYUN KIM 等: "A Graph Kernel Approach for Detecting Core Patents and Patent Groups", 《IEEE》 * |
吴洁 等: "基于图卷积网络的高质量专利自动识别方案研究", 《情报杂志》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114781553B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yoshihashi et al. | Classification-reconstruction learning for open-set recognition | |
CN108108854B (en) | Urban road network link prediction method, system and storage medium | |
CN111753024B (en) | Multi-source heterogeneous data entity alignment method oriented to public safety field | |
CN109918528A (en) | A kind of compact Hash code learning method based on semanteme protection | |
Gao et al. | Multi-layer group sparse coding—For concurrent image classification and annotation | |
CN104866578B (en) | A kind of imperfect Internet of Things data mixing fill method | |
CN109063666A (en) | The lightweight face identification method and system of convolution are separated based on depth | |
CN112487812B (en) | Nested entity identification method and system based on boundary identification | |
CN109960737B (en) | Remote sensing image content retrieval method for semi-supervised depth confrontation self-coding Hash learning | |
CN108920720A (en) | The large-scale image search method accelerated based on depth Hash and GPU | |
CN112765358A (en) | Taxpayer industry classification method based on noise label learning | |
CN110222218B (en) | Image retrieval method based on multi-scale NetVLAD and depth hash | |
CN111667022A (en) | User data processing method and device, computer equipment and storage medium | |
CN113722509B (en) | Knowledge graph data fusion method based on entity attribute similarity | |
CN112733866A (en) | Network construction method for improving text description correctness of controllable image | |
CN114358188A (en) | Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment | |
CN109783645A (en) | A kind of file classification method encoded certainly based on variation | |
CN113537384B (en) | Hash remote sensing image retrieval method, device and medium based on channel attention | |
CN114896434B (en) | Hash code generation method and device based on center similarity learning | |
CN112488231A (en) | Cosine measurement supervision deep hash algorithm with balanced similarity | |
CN114373099A (en) | Three-dimensional point cloud classification method based on sparse graph convolution | |
CN111966828B (en) | Newspaper and magazine news classification method based on text context structure and attribute information superposition network | |
CN114781553B (en) | Unsupervised patent clustering method based on parallel multi-graph convolution neural network | |
CN112434512A (en) | New word determining method and device in combination with context | |
CN116452353A (en) | Financial data management method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |