CN116932765A

CN116932765A - Patent text multi-stage classification method and equipment based on graphic neural network

Info

Publication number: CN116932765A
Application number: CN202311187657.9A
Authority: CN
Inventors: 王军雷; 王亮亮; 季南; 郭少杰; 冀然; 刘兰; 丁强; 龙悦; 叶晓雪; 王静
Original assignee: Zhongqi Intellectual Property Guangzhou Co ltd; China Automobile Information Technology Tianjin Co ltd
Current assignee: Zhongqi Intellectual Property Guangzhou Co ltd; China Automobile Information Technology Tianjin Co ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-10-24
Anticipated expiration: 2043-09-15
Also published as: CN116932765B

Abstract

The application relates to a patent text multi-stage classification method and equipment based on a graph neural network, which relate to the technical field of deep learning, wherein a plurality of words, multi-stage classification numbers and enterprise features are extracted from a patent text to be classified, a label vector is input into the graph convolution neural network, and the label vector is used in combination with a hierarchy adjacent matrix with prior probability to obtain label features with hierarchy relation; extracting word embedding vectors for a plurality of words by adopting a feature extraction module to obtain global associated features; fusing the global associated features, the multi-level classification number features of the patents to be classified and the enterprise feature embedded representation features by using the degree of technology, and classifying by combining a graph convolution neural network to obtain a first classification result; fusing and classifying the tag features and the global associated features to obtain a second classification result; the first classification result and the second classification result are fused, so that automatic classification of the patent text is realized, and classification accuracy is improved.

Description

Patent text multi-stage classification method and equipment based on graphic neural network

Technical Field

The application relates to the technical field of deep learning, in particular to a patent text multi-level classification method and device based on a graph neural network.

Background

The patent can be classified into different technical fields according to the technical scheme recorded by the patent, and at present, the patent text classification is carried out by the following two methods.

The first method is as follows: the patent is manually read and divided into different classification numbers. The second method is as follows: valuable contents are extracted and identified from the natural language understanding model, and then the valuable contents are matched with the keyword books in the specific field so as to determine the technical field of the keyword books. For example, the patent title has a key word of a combustion engine and a structural device, and is classified as F02.

The first method is labor-consuming and inefficient, and the second method, which uses keyword matching, provides a certain efficiency, but is not efficient and often requires manual spot inspection.

The present application has been made in view of the above-described drawbacks.

Disclosure of Invention

In order to solve the technical problems, the application provides a multi-stage classification method and equipment for patent texts based on a graph neural network, which realize automatic classification of the patent texts and improve classification accuracy.

The embodiment of the application provides a patent text multi-level classification method based on a graph neural network, which comprises the following steps:

s1, extracting a plurality of words, multi-level classification number features and enterprise features from a patent text to be classified;

s2, randomizing a label vector, and inputting the label vector into a graph convolution neural network to combine a hierarchical adjacency matrix with prior probability to obtain a label feature with a hierarchical relationship;

s3, extracting word embedding vectors of the words by adopting a feature extraction module, and extracting features by using a Bi-LSTM+self-intent model to obtain global associated features;

s4, fusing the global associated features, namely the multi-level classification number features of the to-be-classified patent and enterprise features by using the degree, and then obtaining a first classification result through a graph convolution neural network and a softmax layer;

s5, fusing the tag features and the global associated features, and then passing through a softmax layer to obtain a second classification result;

s6, fusing the first classification result and the second classification result to obtain a final multi-stage classification result.

Optionally, the feature extraction module includes: an embedded layer, a Bi-LSTM and a self-attention model self-attention;

the step S3 comprises the following steps:

inputting the words in parallel to the embedding layer to obtain a plurality of first features;

inputting the first features into the Bi-LSTM to obtain second features;

and inputting the second features into the self-attribute to obtain a plurality of global associated features.

Optionally, the multi-level classification number at least comprises a multi-level IPC classification number and a multi-level CPC classification number;

the step S4 comprises the following steps:

inputting the multi-level classification number features of different types and the global association features into different attention models to obtain global association features focusing on the multi-level IPC classification number and global association features focusing on the multi-level CPC classification number;

and the global association feature of the concerned multi-level IPC class number, the global association feature of the concerned multi-level CPC class number and the enterprise feature are processed through a concatate layer, a full connection layer, a graph roll neural network, a full connection layer and a softmax layer to obtain a first classification result.

Optionally, the characteristics of the enterprise to which the patent text to be classified belongs include: normalized occupancy of the number of patent numbers of each class number issued by the enterprise to the total number of patent numbers of the enterprise and/or type embedded representation features of the enterprise.

Optionally, the step S6 includes:

and adding the first classification result and the second classification result to obtain a final multi-stage classification result.

Optionally, the initial parameters of the graph convolution neural network are determined according to the hierarchical relation and distribution of the technical field, and the initial parameters are updated in the training process;

the first classification result, the second classification result and the multi-stage classification result are technical fields.

The embodiment of the application also provides electronic equipment, which comprises: a memory and a processor;

wherein one or more computer programs are stored in the memory, the one or more computer programs comprising instructions; the instructions, when executed by the processor, cause the processor to perform any of the graph neural network-based patent text multi-level classification methods.

The method and the device provided by the application have the following technical effects:

1) The label vector is input into the graph convolution neural network to obtain label characteristics with hierarchical relations, so that a technical field classification result with hierarchical relations is obtained.

2) The global association features and the multi-level classification number features are fused and classified, so that the prior knowledge of the existing classification number is utilized to obtain a more accurate technical field classification result.

3) By fusing the first classification result and the second classification result, more accurate classification can be performed by combining the classification number, the hierarchical relationship and the global association characteristic of the patent text, and the attention-to-text assistance of the enhancement version is used for prediction.

4) And obtaining the label structure representation with good performance by using the GCN model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a multi-stage classification method of patent text based on a neural network of the present application;

FIG. 2 is a schematic diagram of a plurality of model architectures corresponding to FIG. 1 according to an embodiment of the present application;

fig. 3 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the application, are within the scope of the application.

Example 1

Fig. 1 is a flowchart of a multi-stage classification method for patent text based on a graph neural network, which is suitable for classifying the patent text in the multi-stage technical field and includes the following operations:

s1, extracting a plurality of words, multi-level classification number features and enterprise features from patent texts to be classified.

The patent text to be classified includes bibliographic items, claims, specifications and abstract of the specification, from which a plurality of words are extracted, the extraction strategy is not limited, for example, extracting a plurality of words from the topics and claim 1, so that the technical field to which the patent text to be classified belongs can be characterized. The multi-level class number features include IPC features and CPC features. The enterprise is the applicant/patentee of the patent and is also extracted from the text of the patent to be classified in advance, and the enterprise characteristics comprise: normalized occupancy of the number of patent numbers of each class number issued by the enterprise to the total number of patent numbers of the enterprise and/or type embedded representation features of the enterprise.

The business's typed embedded representation features may be represented in a code that requires the number to be input to the ebedding layer to be converted into a vector. The category type embedded representation feature represents the technical field category of the enterprise, and is distinguished from IPC and CPC, and only represented by codes. In practical application, the code is obtained through training in the technical field of patents issued by the enterprise, and the closer the patent is to the technical field of issued patents, the closer the code is to the patent.

According to the method, the characteristics of the enterprise to which the patent text to be classified belongs are introduced to participate in the classification of the technical field, and the fact that the service field of an enterprise is concentrated is considered, for example, the classification number of the vehicle enterprise patent is mostly B60W is considered, so that the characteristics of the enterprise are beneficial to enabling the technical field of the patent text to be classified to be closer to the actual service field, and the classification accuracy is beneficial to being improved.

S2, randomizing the label vector, and inputting the label vector into a graph convolution neural network to combine the hierarchical adjacency matrix with prior probability to obtain the label feature with hierarchical relation.

Initializing tag vectors through a random function, wherein the number of the tag vectors is the number of all technical fields.

The method comprises the steps of inputting randomly initialized label vectors through a graph convolutional neural network (GCN), comprehensively acquiring label features from three different directions of top-down, bottom-up and ring by combining a hierarchical adjacency matrix with priori knowledge, fusing global associated features, classifying through a softmax layer, continuously updating the label vectors according to a loss function, and finally outputting label vector features with hierarchical relations. The adjacency matrix of the GCN is determined according to the hierarchical relation and distribution of the technical field and has priori knowledge. For example, a random label vector is input into an adjacent matrix with priori knowledge of the hierarchical relationship and distribution proportion of the technical field of patent statistics in the original GCN in advance, for example, the technical field a accounts for 40%, the technical field A1 accounts for 60% and the technical field A2 accounts for 40%.

Since the technical domain hierarchy relationship and distribution proportion of all the disclosed patents are different from the actual situation of the patent sample or the patent to be classified, the initial parameters also need to be updated in the training process, which will be described in detail in the following embodiments.

And S3, extracting word embedding vectors of the words by adopting a feature extraction module, and extracting features by using a Bi-LSTM+self-intent model to obtain global associated features.

The feature extraction module comprises: an embedded layer, a Bi-directional long-short term memory network Bi-LSTM and a self-attention model self-attention. Specifically, a plurality of words are input into an embedding layer in parallel to a plurality of first features (i.e., word embedding vectors); inputting the first features into the Bi-LSTM to obtain second features; and inputting the second features into the self-attribute to obtain a plurality of global associated features.

Bi-LSTM can better capture the Bi-directional semantic dependencies of long sentences. self-attribute is used to make up for the defect of missing information of Bi-LSTM when patent text is long, and obtain global associated features.

S4, fusing the global associated features, namely the multi-level classification number features of the to-be-classified patent and the enterprise features by using the degree, and then obtaining a first classification result through a graph convolution neural network and a softmax layer.

S4 comprises the following steps: inputting the multi-level classification number features of different types and the global association features into different attention models to obtain global association features focusing on the multi-level IPC classification number and global association features focusing on the multi-level CPC classification number; and the global association feature of the concerned multi-level IPC class number, the global association feature of the concerned multi-level CPC class number and the enterprise feature are processed through a concatate layer, a full connection layer, a graph roll neural network, a full connection layer and a softmax layer to obtain a first classification result.

Specifically, only the multi-level IPC classification number or the multi-level CPC classification number may be used, or both may be used. Under the condition that the two are used, the multi-level classification number features of different types and the global association features are input into different attention models, so that the global association features focusing on the multi-level IPC classification number and the global association features focusing on the multi-level CPC classification number are obtained.

After the multi-level classification number, the enterprise features and the global associated features are fused (through a concatate layer), the integrated features are fully connected through a layer, and then are input into the GCN as initial values of the tag features, so that the tag structure vectors with the hierarchical relationship are obtained, the fused features are more than the random tag vector information, and the accuracy of the hierarchy is improved.

S5, fusing the tag features and the global associated features, and then passing through a softmax layer to obtain a second classification result.

And inputting the tag features and the global associated features into an attention model to obtain global associated features of the tag features, inputting the global associated features of the tag features into a full-connection layer for dimension reduction, and then inputting the global associated features into a softmax layer to obtain a second classification result.

And adding the first classification result and the second classification result to obtain a final multi-stage classification result. Specifically, the weight may be predetermined or obtained by training.

The method provided by the application has the following technical effects:

3) By fusing the first classification result with the second classification result, more accurate classification can be performed in combination with the classification number, the hierarchical relationship and the global association feature of the patent text.

Example two

Fig. 2 is a schematic diagram of a plurality of model structures corresponding to fig. 1 according to an embodiment of the present application.

The word vector model has X1, X2, X3, … … and Xn nodes as input layers, the input dimension is [ batch size, seg_len ], and the dimension of the first feature output after the word vector model is [ batch size, seq_len, unbedding_dim ]. The dimension of the second feature is obtained via Bi-LSTM [ batch size, seq_len, 2X hiddendim ], and the dimensions of the plurality of globally relevant features are obtained via self-atttion [ batch size, seq_len, 2X hiddendim ].

The IPC/CPC classification number is mapped into continuous vectors and then is input into an attribute model together with global associated features to obtain features [ batch size, CPClabel_level num,2×hidd ] respectively

-endim ], [ batch size, ipplabel_level num,2 x hiddendim ], input to full connectivity layer. Illustratively, IPC classification number (primary, secondary, tertiary): such as: b (work; transport) Index 1- > B60 (general vehicle) Index2- > B60Q (arrangement of general vehicle lighting or signalling devices, and its mounting or support or its circuitry [4 ]) Index3, i.e. 3 features. CPC classification number (primary, secondary, tertiary): such as: h (ELECTRICITY) Index 1- > H01 (BASIC ELECTRIC LEMENTS) Index2- > H01C (RESISTORS) Index3, i.e., 3 features.

The feature map of the enterprise is a feature vector [ fetch size, ipclabelmum ], which is input to the fully connected layer. The enterprise category type embedded representation feature is processed by an embellishing layer and then is combined with a feature vector (bactchsize, icclabelnum), the global association feature of the multi-level IPC class number and the global association feature of the multi-level CPC class number are concerned, together input to the conccate layer, and then pass through the full connectivity layer, GCN, full connectivity layer, and softmax layer to obtain a first classification result.

The label vector [ labelnum, labeldim ] is randomized, for example, 40 labels are provided, the vector of each label is 300, the input is [40,300], the label vector is input into the GCN model, and label characteristics [ labelnum, labeldim ] are obtained, and the vector contains the level and relation information of the labels. And obtaining a characteristic (bat size, labelnum,2 x hiddendim) after passing through the attention model, inputting the characteristic into a full-connection layer, obtaining a second classification result through a softmax layer, and adding the first classification result and the second classification result to obtain a final classification result.

Wherein, the batch size: quantity of patent text of a batch, seq_len: word segmentation length. Casting_dim: word vector dimension, hiddendim: hidden vector dimension of bi-lstm, labelnum: labeldim is the number of tags: CPClabel_level num for each tag's vector dimension: CPC class number, IPClabel_level num: IPC number, ipcl abelnum: the IPC classification number of the enterprise published patent.

The word vector model, the intent model, the Bi-LSTM, self-intent, the full-join layer, the softmax layer and the GCN shown in FIG. 2 constitute a large model that is trained in its entirety based on the input patent text samples and the classification results of the technical field.

Example III

Fig. 3 is a schematic diagram of an electronic device according to an embodiment of the present application. The electronic device 500 comprises a bus 501, a processor 502, a communication interface 503 and a memory 504. The processor 502, the memory 504 and the communication interface 503 communicate via a bus 501.

Bus 501 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 3, but not only one bus or one type of bus.

The processor 502 may be any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signal processor, DSP).

The communication interface 503 is for communicating with the outside.

The memory 504 may include volatile memory (RAM), such as random access memory (random access memory). The memory 504 may also include a non-volatile memory (ROM), such as a read-only memory (ROM), a flash memory, a Hard Disk Drive (HDD), or a solid state drive (solid state drive, SSD).

The memory 504 has executable code stored therein, and the processor 502 executes the hierarchical classification method of patent text based on a neural network.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. As used in this specification, the terms "a," "an," "the," and/or "the" are not intended to be limiting, but rather are to be construed as covering the singular and the plural, unless the context clearly dictates otherwise. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method or apparatus comprising such elements.

It should also be noted that the positional or positional relationship indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the positional or positional relationship shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present application. Unless specifically stated or limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present application.

Claims

1. The patent text multi-level classification method based on the graph neural network is characterized by comprising the following steps of:

2. The method of claim 1, wherein the feature extraction module comprises: an embedded layer, a Bi-LSTM and a self-attention model self-attention;

the step S3 comprises the following steps:

inputting the first features into the Bi-LSTM to obtain second features;

3. The method of claim 1, wherein the multi-level classification number comprises at least a multi-level IPC classification number and a multi-level CPC classification number;

the step S4 comprises the following steps:

4. The method of claim 3, wherein the business features to which the patent text to be classified belongs include: normalized occupancy of the number of patent numbers of each class number issued by the enterprise to the total number of patent numbers of the enterprise and/or type embedded representation features of the enterprise.

5. The method according to claim 1, wherein S6 comprises:

6. The method according to claim 1, wherein initial parameters of the graph roll-up neural network are determined according to a hierarchical relationship and distribution of technical fields, and the initial parameters are updated in a training process;

7. An electronic device, comprising:

a memory and a processor;

wherein one or more computer programs are stored in the memory, the one or more computer programs comprising instructions; the instructions, when executed by the processor, cause the processor to perform the method of any one of claims 1 to 6.