CN115758244A

CN115758244A - Chinese patent IPC classification method based on SBERT

Info

Publication number: CN115758244A
Application number: CN202211445354.8A
Authority: CN
Inventors: 雷海卫; 李帆; 武瑞娟; 李玲; 李雯; 呼文秀; 李成奇; 贾博慧; 吴倩
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-03-07

Abstract

The invention discloses a Chinese patent IPC classification method based on SBERT, which comprises the following steps: preprocessing corpus data: extracting specific words from the patent text to form a first path of corpus data, and extracting term descriptions capable of expressing corresponding classes from the IPC classification table according to classes to be used as a second path of corpus data; enhancing data of the corpus data; text vectorization coding: inputting the first path of corpus data and the second path of corpus data into a first path of BERT pre-training model and a second path of BERT pre-training model under an SBERT framework respectively for vectorization coding to obtain vector representation of a patent text; and (3) similarity comparison: and selecting the classification number according to the calculated similarity ranking. The invention adopts SBERT with twin structure as the frame of Chinese patent automatic classification, uses the term description in the patent text and IPC classification table as the two-way input of SBERT, and determines the patent category according to the proximity degree of two-way vectors through BERT vectorization coding, thereby reducing the calculated amount, improving the classification accuracy and providing a plurality of related IPC classification numbers.

Description

Chinese patent IPC classification method based on SBERT

Technical Field

The invention relates to the technical field of patent IPC classification, in particular to a Chinese patent IPC classification method based on SBERT.

Background

With the rapid development of science and technology, the global patent application amount is gradually increased year by year. During patent application, classification is required according to different fields so as to carry out classification statistics and management. That is, each approved Patent is classified into an International Patent Classification (IPC) according to its technical content. At present, the classification work also depends on manual mode, which brings huge workload to patent examiners. Therefore, it is necessary to research how to solve the problem related to automatic classification of patents by mining semantic information in patent application text by using natural language processing technology.

Patent automatic classification is mainly realized through a deep learning network at present. Fig. 1 shows a classification structure adopted in the prior art. The structure is mainly composed of two parts: pre-training the model and Text-CNN classification header. The pre-training model part realizes vectorization representation of patent texts, and the Text-CNN classification head part realizes classification of patents.

A BERT (Bidirectional Encoder reproduction from transformations) pre-training model is adopted as a patent classification method for text vectorization Representation, and the method is a scheme with optimal performance in the current stage. BERT adopts a bidirectional Transformer encoder, and respectively obtains the meaning modes of the word and sentence levels through a word covering model and upper and lower sentence relation prediction. Through the model trained in the way, the BERT has strong expression capability of sentences and words. The performance of the method is excellent in Natural Language Processing (NLP) tasks at a word level, such as named entity recognition, and in sentence level NLP tasks such as question and answer classes. Potential deep semantic and syntactic information can be obtained through vector representation after BERT coding. The pre-training model is obtained by pre-training a mass of texts under an unsupervised target, and parameter fine-tuning can be carried out subsequently according to a specific task.

Text-CNN is a structural variation that CNN uses for Text classification tasks. As shown in FIG. 1, the right part of the figure depicts the hierarchy of Text-CNN, which mainly comprises three parts of convolution, pooling and full connection. Convolution kernels with three sizes are set in the Text-CNN, the width of the convolution kernels is consistent with the length of the word vector, and the heights of the convolution kernels are 3, 4 and 5 respectively. And the convolution kernel slides on the word vector from top to bottom to realize convolution operation. And performing pooling and splicing operation on the feature diagram after convolution to obtain the expression of the one-dimensional features of the text, and finally realizing the classification of the patent text through a full connection layer.

The prior art adopted the best current BERT pre-training model as the method represented herein by the patent. Good results are obtained in the performance of patent classification. However, the prior art has the following defects in realizing the automatic classification of the patent IPC:

1) On one hand: the existing method only uses historical patent data as prior knowledge and does not utilize classification rule terms provided in an IPC classification table. The scheme adopted by the prior art is a typical black box method based on deep learning, and a deep network has certain identification capability through learning of a large amount of historical patent data. As a priori knowledge, besides historical patent data, IPC classification tables and rule terms provided by the State patent office are not utilized.

2) On the other hand: the prior art gives only one main classification number when automatically classifying patent texts. When a patent is identified, 2-5 classification numbers are additionally provided in addition to the main classification number. When a patent of invention relates to different types of technical subjects and these technical subjects also constitute the invention information, multiple classifications should be made according to the technical subjects concerned, giving a plurality of classification numbers. The classification number which can represent the invention information fully is arranged at the first position as a main classification number. The prior art does not provide multiple classification numbers for a patent as the patent relates to multiple technical topics.

Aiming at the two defects in the prior art, the invention provides an SBERT-based Chinese patent IPC automatic classification method.

Disclosure of Invention

The invention aims to provide a Chinese patent IPC classification method based on SBERT, which adopts SBERT with twin structure as a frame for automatically classifying Chinese patents, uses classification terms in patent texts and IPC classification tables as two paths of data to be input into SBERT, realizes vectorization coding of the two paths of texts through BERT, and judges the category of real patents according to the similarity of the obtained two paths of vectors, thereby not only reducing the calculated amount and improving the classification accuracy, but also providing a plurality of IPC classification numbers.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

a Chinese patent IPC classification method based on SBERT comprises the following steps:

preprocessing corpus data: extracting specific words from the patent text to form a first path of corpus data, and extracting term descriptions capable of expressing corresponding classes from the IPC classification table according to classes to be used as a second path of corpus data; the patent text can be a target patent text to be classified or a training sample, and the training sample is obtained from a historical patent text;

performing data enhancement processing on the corpus data;

text vectorization coding: respectively inputting the first path of corpus data and the second path of corpus data subjected to data enhancement processing into a first path of BERT pre-training model and a second path of BERT pre-training model under an SBERT framework for vectorization coding to obtain vector representation of a patent text, wherein the vector representation of the patent text corresponding to the first path of corpus data forms a feature set U, and the vector representation of the patent text corresponding to the second path of corpus data forms a feature set V;

and (3) similarity comparison: and calculating the similarity values of the feature set U and the feature set V to obtain similarity value ranks corresponding to different categories of term descriptions, selecting the IPC class number corresponding to the term description with the similarity value being ranked first as a main IPC class number, and selecting the IPC class number corresponding to the term description with the similarity value being ranked Nth (N is more than 1) as a selectable alternative IPC class number.

Further, the data enhancement processing on the corpus data specifically comprises: in the SBERT model training process, data enhancement of sample data is realized by adopting a Dropout method and inputting the same text into a BERT pre-training model for multiple times.

Further, the vectorization encoding process is as follows: and each sentence text is separately vectorially coded to obtain a vector representation corresponding to each sentence text, and the vector representations of all the sentence texts form a vector representation of the patent text.

Further, the similarity values of the computing feature set U and the computing feature set V are specifically: and calculating the cosine similarity or Euclidean distance of the feature set U and the feature set V to obtain a similarity value.

Further, the text vectorization encoding process further includes performing average pooling on the vector representations of the patent texts.

Further, the specific words are patent titles and abstracts.

After the scheme is adopted, the invention has the following beneficial effects:

1. on one hand, historical patent text data is used as the first path of corpus data, on the other hand, the term description of an IPC classification table is fully used as the second path of corpus data, calculation cost can be reduced compared with BERT for sentence pair tasks by SBERT, meanwhile, two paths of texts are allowed to be input in the maximum text length, and each sentence text is independently subjected to vectorization coding, so that compared with the prior black box method, the Chinese patent classification method has the advantages of being obviously improved in accuracy rate, greatly reduced in calculation amount and capable of achieving faster and more accurate classification.

2. Finally, through similarity comparison, the invention can provide TopN classification numbers as the alternative classification numbers of the patent besides the Top1 main classification number. These alternative classification numbers provide space for selection for human processing. Meanwhile, for the TopN classification number with high similarity value, the patent is disclosed to relate to different technical subjects. When assigning patent classification numbers, classification numbers that imply other topics should be given in addition to the main classification number.

The invention has the advantages that SBERT with a twin structure is adopted in the scheme as a framework for Chinese patent automatic classification. The framework has two inputs, one takes the text (patent subject and abstract) corresponding to the patents to be classified as input, and the other takes the term description in the IPC classification table as input. For the patents to be classified, the second path inputs all the classified terms, and similarity value ranking corresponding to the description of different terms is obtained through similarity calculation, the classification number of the ranking Top1 can be used as the basis of the patent classification and is selected as the main IPC classification number, and the classification number of TopN can be used as the alternative IPC classification number which is used for representing other subjects related to the patent and is given.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other modifications can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a diagram of a prior art BERT and Text-CNN based patent classification framework;

FIG. 2 is a flow chart of a Chinese patent IPC classification method based on SBERT according to the embodiment of the present invention;

FIG. 3 is a block diagram of a Chinese patent classification framework based on SBERT according to an embodiment of the present invention;

FIG. 4 is a schematic representation of a patent vector based on a pre-trained model.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

This embodiment of the BERT is called Bidirectional Encoder replication from transforms, and is a pre-trained language Representation model. It emphasizes that instead of pre-training by using a traditional one-way language model or a method of shallow-splicing two one-way language models as in the past, a new Masked Language Model (MLM) is used so as to generate deep two-way language representations.

The SBERT described in this embodiment is a contracture network Siamese network, also called as sequence BERT, and the sub-network of the SBERT model uses BERT models, and the two BERT models share parameters. When the similarity of the sentences A and B is compared, the sentences A and B are respectively input into two BERT networks, two groups of vectors representing the sentences are output, and then the similarity of the sentences A and B is calculated; vector clustering can be used by utilizing the principle, and an unsupervised learning task is realized.

As shown in fig. 2, an embodiment of the present invention provides a chinese patent IPC classification method based on SBERT, including the following steps:

s10, corpus data preprocessing: extracting specific words from the patent text to form a first path of corpus data, and extracting term descriptions capable of expressing corresponding classes from the IPC classification table according to classes to be used as a second path of corpus data; the patent texts can be target patent texts to be classified or training samples, and the training samples are obtained from historical patent texts; the specific sentences extracted from the first path of corpus data generally select patent titles and abstracts; regarding the second path of corpus data, taking part a in the IPC classification table as an example, the part a in the IPC classification table carries out term extraction on Wen Shiyi columns in the middle of the subclass level, and simultaneously carries out merging and sorting, and the processed result is shown in table 1:

TABLE 1 description of the terms of the subclass under section A

S20, performing data enhancement processing on the corpus data; when the model is trained, under the condition that the training samples are few or unbalanced, data enhancement needs to be carried out on the training data to prevent the model from being over-fitted and improve the accuracy of the model. In this embodiment, the number of positive example samples is far less than that of negative example samples, so data enhancement needs to be performed, specifically: in the training process of the SBERT model, data enhancement of training samples can be achieved by using Dropout, and Dropout masks in the SBERT model have randomness, so that different vector representations can be output after the same text passes through the BERT model. The method has better effect than a more complex data enhancement method such as word deletion or replacement based on a synonym or mask language model, because the original meaning of the text can be changed by the deletion or replacement operation, and the semantics of the samples generated by the Dropout method are completely consistent with those of the original samples, but the generated vectors represent different representations;

s30, text vectorization coding: respectively inputting a first path of corpus data and a second path of corpus data which are subjected to data enhancement processing into a first path of BERT pre-training model and a second path of BERT pre-training model under an SBERT frame for vectorization coding, wherein the vectorization coding process is used for separately vectorizing and coding each sentence text to obtain vector representation corresponding to each sentence text, the vector representations of all the sentence texts form vector representation of the patent text, the vector representation of the patent text corresponding to the first path of corpus data forms a feature set U, and the vector representation of the patent text corresponding to the second path of corpus data forms a feature set V;

referring to fig. 3, based on the SBERT framework, for a sentence-pair task, the SBERT model can reduce the computation overhead compared to the BERT model, and at the same time, allows both the two paths of texts (the first path of corpus data and the second path of corpus data) to be input with the maximum text length. When using the BERT model, two sentence texts need to be stitched together to form a sentence pair using the symbol [ SEP ], and the number of sentence pairs (i.e., vectorized encoding times) is the square of the number of sentences. And by using an SBERT model, each sentence text is independently vectorized and coded, and finally similarity comparison is carried out by calculating cosine similarity or Euclidean distance and other methods. Therefore, compared with the BERT model, the invention adopts the SBERT model to vector the coding times less.

In addition, because the two paths of input texts respectively realize vectorization coding by the BERT (the first path of BERT pre-training model and the second path of BERT pre-training model) on the two paths of branches, the two paths of texts can be input in the maximum text length allowed by the BERT pre-training model, and in the SBERT model, sentence-level text vectorization representation can be in different modes. As shown in FIG. 4, after each word in the patent text is input into the pre-training model, it obtains its vector representation, R _[CLS] Represents a vector representation of the entire text, however, the vector R _[CLS] Often, the accuracy is not enough when representing the whole text vector, and this embodiment further adopts an average pooling method, that is, taking the average of all word vector representations, that is, adopting a way of adding an average pooling layer at the output end to improve vectorization codingQuality;

s40, similarity comparison: calculating the similarity values of the feature set U and the feature set V to obtain similarity value ranks corresponding to the description of different categories of terms, selecting the IPC class number corresponding to the term description with the similarity value ranked first (Top 1) as a main IPC class number, and selecting the IPC class number corresponding to the term description with the similarity value ranked N (TopN, N is a positive integer, and N is greater than 1) as a selectable alternative IPC class number. Specifically, the similarity value can be obtained by calculating the cosine similarity or euclidean distance of the feature set U and the feature set V.

Based on the above embodiment, the invention adopts SBERT with twin structure as the frame for Chinese patent automatic classification, and the frame has two inputs, one takes the text (patent subject and abstract) corresponding to the patent to be classified as the input, and the other takes the term description in the IPC classification table as the input. For the patents to be classified, the second path inputs all the classified terms, and similarity value ranking corresponding to the description of different terms is obtained through similarity calculation, the classification number of the ranking Top1 can be used as the basis of the patent classification and is selected as the main IPC classification number, and the classification number of TopN can be used as the alternative IPC classification number which is used for representing other subjects related to the patent and is given.

In the description of the specification, references to "one embodiment," "some embodiments," "an example," "a specific example" or "an alternative embodiment" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-described embodiments do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the above-described embodiments should be included in the protection scope of the technical solution.

Claims

1. A Chinese patent IPC classification method based on SBERT is characterized by comprising the following steps:

performing data enhancement processing on the corpus data;

and (3) similarity comparison: and calculating the similarity values of the feature set U and the feature set V to obtain similarity value ranks corresponding to different categories of term descriptions, selecting the IPC class number corresponding to the term description with the first similarity value rank as a main IPC class number, and selecting the IPC class number corresponding to the term description with the Nth (N is more than 1) similarity value rank as a selectable alternative IPC class number.

2. The SBERT-based Chinese patent IPC classification method of claim 1, wherein: the data enhancement processing of the corpus data specifically comprises the following steps: in the SBERT model training process, data enhancement of sample data is realized by adopting a Dropout method and inputting the same text into a BERT pre-training model for multiple times.

3. The SBERT-based Chinese patent IPC classification method of claim 1, wherein the vectorization coding process is as follows: and each sentence text is separately vectorially coded to obtain a vector representation corresponding to each sentence text, and the vector representations of all the sentence texts form a vector representation of the patent text.

4. The SBERT-based Chinese patent IPC classification method of claim 1, wherein: the similarity values of the calculation feature set U and the feature set V are specifically: and calculating the cosine similarity or Euclidean distance of the feature set U and the feature set V to obtain a similarity value.

5. The SBERT-based Chinese patent IPC classification method of claim 1, wherein: the text vectorization coding process also comprises the step of carrying out average pooling on the vector representation of the patent text.

6. The SBERT-based Chinese patent IPC classification method of claim 1, wherein: the specific words are patent titles and abstracts.