CN107122382B

CN107122382B - A description-based patent classification method

Info

Publication number: CN107122382B
Application number: CN201710082677.8A
Authority: CN
Inventors: 朱玉全; 金健; 佘远程; 石亮
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2021-03-23
Anticipated expiration: 2037-02-16
Also published as: CN107122382A

Abstract

The invention discloses a patent classification method based on a description, which belongs to the field of text processing and data mining. Firstly, the text preprocessing is performed on the patent specification; then an inverted index file is constructed, and the feature selection method combining information gain and word frequency is used to select the feature words; the improved TF-IDF formula is further used to calculate the feature word weight, and construct Patent feature vector; then build a set of training patent fields; finally use the optimized KNN classifier to classify patents. This research provides a new idea for the classification of patent documents, and also lays a foundation for further research on intelligent retrieval of patent documents.

Description

A description-based patent classification method

技术领域technical field

本发明属于计算机分析技术在专利文献的的应用，具体涉及一种利用专利说明书的专利分类方法。The invention belongs to the application of computer analysis technology in patent documents, and specifically relates to a patent classification method using patent specifications.

背景技术Background technique

专利是技术创新和企业价值的具体表现，是知识发展和创新的重要载体、成果和源泉之一，许多发明创造成果仅出现于专利文献中。据世界知识产权组织(WIPO)统计，世界上发明成果的70％～90％首先出现在专利文献中，而不是杂志、论文等其他载体的文献中。此外，为了保护自身的利益，企业会尽可能早的申请专利，专利中往往集中了最为活跃和先进的技术，包含了世界上90％～95％的技术信息。同时为了审查的方便，专利文献往往撰写的比较详细，相对于其他类型的资料而言，专利文献能够提供更多的信息，是一种最常见的技术创新成果，记录着专利活动的完整过程。它不仅反映各个技术领域中技术活动的现状，而且能够体现某个特定技术领域中技术活动的发展历史。专利文献中含有每一件申请专利的发明创造的具体技术解决方案，对于企业创新具有非常重要的作用，不仅使企业可以了解最新科研动态，避免重复研究，节约研究时间和科研经费，同时还可启迪企业研究人员的创新思路，提高创新的起点，借鉴以往的发明，极大缩短科研工作进度。Patents are the concrete manifestation of technological innovation and enterprise value, and one of the important carriers, achievements and sources of knowledge development and innovation. Many inventions and creations only appear in patent documents. According to the statistics of the World Intellectual Property Organization (WIPO), 70% to 90% of the inventions in the world first appear in patent documents, rather than in the documents of other carriers such as journals and papers. In addition, in order to protect their own interests, enterprises will apply for patents as early as possible, and patents often focus on the most active and advanced technologies, including 90% to 95% of the technical information in the world. At the same time, for the convenience of examination, patent documents are often written in more detail. Compared with other types of documents, patent documents can provide more information and are the most common technological innovation achievements, recording the complete process of patent activities. It not only reflects the status quo of technical activities in various technical fields, but also reflects the development history of technical activities in a specific technical field. Patent documents contain specific technical solutions for each patented invention and creation, which plays a very important role in enterprise innovation. It not only enables enterprises to understand the latest scientific research trends, avoids repeated research, saves research time and research funds, but also enables enterprises. Enlighten the innovative ideas of enterprise researchers, improve the starting point of innovation, learn from past inventions, and greatly shorten the progress of scientific research.

随着我国新研究成果和发明创造的不断涌现，专利数量呈现出快速的增长。截止2016年10月5日，我国已公布的发明专利数已超过598万件，其中授权发明专利总数为223.850万件。如果每个专利的平均大小为2M，则专利数据的容量高达几百TB。为了科学地管理这些专利文献数据，同时也为了快速、方便地检索相关专利文献，专利文献的分类显得尤为重要。目前，世界上大多数国家均采用国际专利分类法IPC(International PatentClassification)来对专利文献进行分类，IPC按照五个等级分类，即部(Section)、大类(Class)、小类(Subclass)、主组(Main Grop)、分组(Grop)，其中部是分类表中最高等级的分类层，按照领域不同，分为八个大部，用一位的英文字母标记，分别是A-H，每个部分下属设有多个大类，大类是由二位数字组成，每个部下面有不同数量的大类。例如：G06F21/00表示物理-电数字数据处理-防止未授权行为的保护计算机、其部件、程序或数据的安全装置。With the continuous emergence of new research results and inventions in my country, the number of patents has shown rapid growth. As of October 5, 2016, the number of published invention patents in my country has exceeded 5.98 million, of which the total number of authorized invention patents is 2.2385 million. If the average size of each patent is 2M, the capacity of patent data is as high as several hundred terabytes. In order to manage these patent document data scientifically and to search related patent documents quickly and conveniently, the classification of patent documents is particularly important. At present, most countries in the world use the International Patent Classification (IPC) to classify patent documents. IPC is classified according to five levels, namely Section, Class, Subclass, Main group (Main Grop), group (Grop), the part of which is the highest classification layer in the classification table, according to different fields, is divided into eight major parts, marked with one-digit English letters, respectively A-H, each part Subordinates have multiple categories, which are composed of two-digit numbers, and there are different numbers of categories under each subordinate. For example: G06F21/00 stands for Physical-Electrical Digital Data Processing-Security Devices for Protecting Computers, Its Components, Programs or Data from Unauthorized Actions.

由此可见，对于已或即将公布的发明专利而言，必须赋予一个或多个与之对应的分类号，如发明专利“一种关联规则挖掘中隐私数据的保护方法”的分类号为G06F21/00。对于即将提交的申请专利来讲，其分类号是未知并需要确定的，对此，目前常用的做法是根据专利描述对象的所属领域或专利内容来确定，需要依靠相关专家人工阅读申请书，随着专利申请量的急剧增加(每年的专利申请数已接近100万)，此方法需要耗费大量的人力和物力，而且专家自身知识的局限性也难以保证分类结果的一致性和准确性。为此，本发明提出了一种基于专利文献说明书的专利分类方法，该方法利用已公布发明专利说明书中的信息来构造分类器或分类函数，并以此来确定申请专利的类别，由此实现专利的自动分类。It can be seen that for the invention patent that has been or will be published, one or more corresponding classification numbers must be assigned. For example, the classification number of the invention patent "a method for protecting privacy data in association rule mining" is G06F21/ 00. For the patent application to be submitted, its classification number is unknown and needs to be determined. In this regard, the common practice is to determine it according to the field or patent content of the patent description object. It is necessary to rely on relevant experts to manually read the application and follow the instructions. With the rapid increase in the number of patent applications (the number of patent applications per year is close to 1 million), this method requires a lot of manpower and material resources, and the limitations of experts' own knowledge make it difficult to ensure the consistency and accuracy of the classification results. To this end, the present invention proposes a patent classification method based on the specification of patent documents. The method utilizes the information in the published invention patent specification to construct a classifier or a classification function, and uses this to determine the category of the patent application, thereby realizing Automatic classification of patents.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对现有专利分类方法不能充分有效地利用已公布发明专利中的说明书信息，提出一种基于专利文献说明书的专利分类方法，该方法将充分利用已公布发明专利所包含的说明书信息以及对应的类别来构造分类器或分类函数，以此来确定已提交申请专利的类别，并就构造过程中说明书的特征提取和选择、分类器的确定等方面提出了相应的优化解决方案。The purpose of the present invention is to propose a patent classification method based on the description of the patent document, which can make full use of the description contained in the published invention patent, in view of the fact that the existing patent classification method cannot fully and effectively utilize the description information in the published invention patent. Information and corresponding categories are used to construct classifiers or classification functions, so as to determine the category of the submitted patent application, and propose corresponding optimization solutions for the feature extraction and selection of the specification and the determination of the classifier during the construction process.

本发明采用的技术方案是：基于专利文献说明书的专利分类方法主要包括以下步骤：The technical solution adopted in the present invention is: the patent classification method based on the patent document description mainly includes the following steps:

(1)专利数据预处理(1) Patent data preprocessing

专利样本数据的采集、样本IPC号、说明书的提取、中文分词、词性标注。去掉说明书中符号、数字(说明书中存在大量的段落标号)。利用正则匹配过滤掉停用词、虚词、连接词等对专利分类用处不大的词语，仅保留名词、形容词、动词等关键词。Collection of patent sample data, sample IPC number, extraction of description, Chinese word segmentation, part-of-speech tagging. Remove the symbols and numbers in the specification (there are a lot of paragraph marks in the specification). Use regular matching to filter out stop words, function words, conjunctions and other words that are not useful for patent classification, and only retain keywords such as nouns, adjectives, and verbs.

(2)构建倒排索引文件(2) Build an inverted index file

统计出每个词的词频、位置信息、词性权重以及类间分布信息，利用这些统计值以及专利文本信息，构建倒排索引文件。Count the word frequency, position information, part-of-speech weight and distribution information between classes of each word, and use these statistical values and patent text information to construct an inverted index file.

(3)专利文本特征选择(3) Feature selection of patent text

利用信息增益和词频相结合的特征选择方法来计算步骤(2)中词语的特征值，对特征值排序，选择一定数量的特征词来表征专利文本。The feature selection method combining information gain and word frequency is used to calculate the feature values of the words in step (2), sort the feature values, and select a certain number of feature words to represent the patent text.

设A_ij为包含特征词t_i并且属于c_j的文档数量，B_ij为包含特征词t_i并且类别不属于c_j的文档数量，C_ij为不包含特征词t_i并且类别属于c_j的文档数量,D_ij为不包含特征词t_i并且类别属于不c_j的文档数量，则特征值的计算如式(1)所示。Let A _ij be the number of documents that contain the feature word t _i and belong to c _j , B _ij be the number of documents that contain the feature word t _i and the category does not belong to c _j , and C _ij is the number of documents that do not contain the feature word t _i and the category belongs to c _j The number of documents, D _ij is the number of documents that do not contain the feature word t _i and the category belongs to not c _j , then the calculation of the feature value is shown in formula (1).

其中，TF代表专利中词频对于专利特征选择的影响程度。设m为训练专利中类别总数，N_j表示c_j类中的专利总数，TF_jk表示特征词t_i在c_j类中专利P_k中的词频，则TF的计算如式(2)所示。Among them, TF represents the degree of influence of the word frequency in the patent on the selection of patent features. Let m be the total number of categories in the training patents, N _j the total number of patents in the c _j category, TF _jk the word frequency of the feature word t _i in the c _j category of the patent P _k , then the calculation of TF is shown in formula (2) .

式(1)中的IC代表特征词在类别间的分散程度，越分散说明该词越没有代表性，值也就越小。设TF_j(t_i)表示特征词t_i在类c_j中的频数，TF(t_i)表示特征词t_i的总频数，

表示特征词t_i在所有类中出现的频数平均值，则计算如式(3)所示。IC in formula (1) represents the degree of dispersion of feature words among categories. The more dispersed, the less representative the word is, and the smaller the value is. Let TF _j (t _i ) represent the frequency of the feature word t _i in the class c _j , TF(t _i ) represent the total frequency of the feature word t _i ,

Represents the average frequency of the feature word t _i appearing in all classes, then the calculation is shown in formula (3).

(4)专利文本向量化(4) Vectorization of patent text

本步骤把包括：This step will include:

①权重计算，计算如式(4)所示。①Weight calculation, the calculation is shown in formula (4).

其中，

表示特征词t在文本

中出现的频率，N表示全部专利样本集中所有专利的个数，n表示全部专利样本集中出现特征词t的专利个数，C_t表示特征词词性所对应的词性权重系数，P_t表示特征词的位置权重系数。in,

Represents the feature word t in the text

The frequency of occurrence in all patent sample sets, N represents the number of all patents in all patent sample sets, n represents the number of patents in which feature word t appears in all patent sample sets, C _t represents the part-of-speech weight coefficient corresponding to the feature word part-of-speech, and P _t represents the feature word The position weight coefficient of .

②排序，根据权重降序排序，构造专利文本的空间模型向量V_i(w_i1,w_i2,...,w_in)，以此来表示每个专利文本的内容。②Sort, sort according to the weight in descending order, construct the space model vector V _i (w _i1 ,w _i2 ,...,w _in ) of the patent text, so as to represent the content of each patent text.

(5)生成IPC各层次类别特征向量(5) Generate the feature vectors of each level of the IPC

本步骤包括：This step includes:

①将各子组的类别描述并入所属主组的类别描述，进行分词、去停用词处理。① Combine the category description of each subgroup into the category description of the main group to which it belongs, and perform word segmentation and stop word removal processing.

②将每个主组的描述合并后进行特征选择，构造IPC小类层次的类别特征向量，向量表示为{V_A01B1/00,V_A01B3/00,...,V_H99Z99/00}。其中，A01B1/00为IPC中第一个主组，H99Z99/00为IPC中最后一个主组。② Combine the descriptions of each main group and perform feature selection to construct the category feature vector of the IPC subclass level, which is represented as {V _A01B1/00 ,V _A01B3/00 ,...,V _H99Z99/00 }. Among them, A01B1/00 is the first main group in IPC, and H99Z99/00 is the last main group in IPC.

③将同一个小类下的所有基本描述合并后进行特征选择，构造IPC大类层次的类别特征向量，向量表示为{V_A01B,V_A01C,...,V_H99Z}。其中，A01B为IPC中第一个小类，H99Z是IPC中最后一个小类。③ Combine all the basic descriptions under the same sub-category and perform feature selection to construct the category feature vector of the IPC category level, which is represented as {V _A01B ,V _A01C ,...,V _H99Z }. Among them, A01B is the first subclass in IPC, and H99Z is the last subclass in IPC.

④将同一大类下的所有基本描述合并后进行特征选择，构造IPC部层次的类别特征向量，向量表示为{V_A01,V_A21,...,V_H99}。其中，A01为IPC中第一个大类，H99Z是IPC中最后一个大类。④ Combine all the basic descriptions under the same category and perform feature selection to construct a category feature vector at the IPC level, which is represented as {V _A01 ,V _A21 ,...,V _H99 }. Among them, A01 is the first category in IPC, and H99Z is the last category in IPC.

(6)构建专利样本邻域(6) Construct patent sample neighborhood

本步骤包括：This step includes:

①计算专利训练集中各专利之间的相似度。相似度可以通过计算向量间的夹角余弦得到。设sim(d_i,d_j)表示专利文本d_i与d_j的相似度，则计算公式如式(5)所示。① Calculate the similarity between the patents in the patent training set. Similarity can be obtained by calculating the cosine of the angle between the vectors. Suppose sim(d _i , d _j ) represents the similarity between the patent texts d _i and d _j , then the calculation formula is shown in formula (5).

其中，W_ik和W_jk表示专利向量中对应特证词的权重，n表示向量的维数。Among them, _Wik and _Wjk represent the weight of the corresponding special testimony in the patent vector, and n represents the dimension of the vector.

②将d_i与其他所有专利样本d_j的相似度按降序排序，选择前K个专利样本形成集合D_i，D_i称作为专利d_i的邻域，K的值视具体情况而定。② Sort the similarity between d _i and all other patent samples d _j in descending order, select the first K patent samples to form a set D _i , and D _i is called the neighborhood of patent d _i , and the value of K depends on the specific situation.

(7)待分类专利相似度计算(7) Calculation of similarity of patents to be classified

本步骤包括：This step includes:

①待分类专利进行说明书的提取、中文分词、词性标注、去停用词。①Extracting the description, Chinese word segmentation, part-of-speech tagging, and removing stop words for the patent to be classified.

②专利特征选择和向量化。②Patent feature selection and vectorization.

③计算待分类专利B_j特征向量与各IPC类别特征向量的余弦相似度S_ai。③ Calculate the cosine similarity S _ai of the feature vector of the patent B _j to be classified and the feature vector of each IPC category.

④计算待分类专利B_j与专利训练集中每个专利的余弦相似度S_bj。④ Calculate the cosine similarity S _bj between the patent B _j to be classified and each patent in the patent training set.

⑤将上述的训练专利按相似度值S_bj降序排序，选择最前面K个专利作为其邻域集合。⑤ Sort the above training patents in descending order of similarity value S _bj , and select the top K patents as its neighborhood set.

(8)分类决策(8) Classification decision

本步骤包括：This step includes:

①计算待分类专利B_j与样本专利d_i之间的共享领域大小L(B_j,d_i)，即两个领域集合中相同专利的个数。① Calculate the shared field size L(B _j ,d _i ) between the patent B _j to be classified and the sample patent d _i , that is, the number of identical patents in the two field sets.

②计算待分类专利与各个IPC类别间的最终加权相似度，计算公式如式(6)所示。② Calculate the final weighted similarity between the patent to be classified and each IPC category, and the calculation formula is shown in formula (6).

其中，I表示类别，p，k，α，β为可调参数，系统默认情况下，p为0.8，k为0.95，α为0.6，β为5。Among them, I represents the category, and p, k, α, and β are adjustable parameters. By default, p is 0.8, k is 0.95, α is 0.6, and β is 5.

③将待分类专利归入相似度S(i)最大的类。③ Classify the patents to be classified into the class with the largest similarity S(i).

本发明的主要有益效果表现在：The main beneficial effects of the present invention are shown in:

(1)专利文本特征选择方面(1) Feature selection of patent text

相对于专利的标题和摘要来说，专利说明书内容更加丰富，所包含的信息量也更大。也正因如此，专利说明书中含有大量的噪声数据，尤其是到了IPC小类以下级别的分类，不同专利之间所包含的相似信息更多，不利于分类。为此，本发明改进了专利说明书的特征提取以及特征向量化的方法，降低了噪声干扰，提高了专利的分类精度。Compared with the title and abstract of the patent, the patent specification is richer in content and contains a larger amount of information. Because of this, the patent specification contains a lot of noise data, especially in the classification below the IPC sub-category, the similar information contained between different patents is more, which is not conducive to the classification. Therefore, the present invention improves the method of feature extraction and feature vector quantization in the patent specification, reduces noise interference, and improves the classification accuracy of patents.

(2)专利分类方法设计方面(2) Design of patent classification method

由于专利数据量相当庞大，而专利类别又特别多，从而导致分类模型训练速度过慢等问题，明显不适用于专利分类。为此，本发明提出了一种新的最近邻分类算法，并在分类过程中加入了IPC描述信息，在保证分类速度的前提下进一步提高了专利分类的准确度。Due to the huge amount of patent data and the large number of patent categories, the training speed of the classification model is too slow, which is obviously not suitable for patent classification. Therefore, the present invention proposes a new nearest neighbor classification algorithm, and adds IPC description information in the classification process, which further improves the accuracy of patent classification on the premise of ensuring the classification speed.

附图说明Description of drawings

图1是本发明实施例中的结构框图Fig. 1 is a structural block diagram in an embodiment of the present invention

图2是本发明实施例中专利向量空间的构造流程Fig. 2 is the construction flow of the patent vector space in the embodiment of the present invention

图3是本发明实施例中基于改进KNN的分类流程图Fig. 3 is the classification flow chart based on improved KNN in the embodiment of the present invention

具体实施方式Detailed ways

下面以专利文献为实施例，详细说明本发明的专利分类方法，具体执行过程如下：The patent classification method of the present invention is described in detail below with the patent documentation as an embodiment, and the specific execution process is as follows:

步骤1：获取专利文本的数据，对专利说明书进行文本预处理，主要是分词和去停用词。Step 1: Obtain the data of the patent text, and perform text preprocessing on the patent specification, mainly including word segmentation and removal of stop words.

①获取IPC类别的描述，对描述进行分词和词性标注、去停用词处理，对分词结果进行人工较正后，构建用户词典。① Obtain the description of the IPC category, perform word segmentation and part-of-speech tagging on the description, remove stop words, and manually correct the word segmentation results to construct a user dictionary.

②分别对上述抽取的专利样本进行格式转换、说明书提取，在分词程序中加入(1)中构建的用户词典，然后对说明书进行中文分词、词性标注。(2) Perform format conversion and description extraction on the patent samples extracted above, add the user dictionary constructed in (1) to the word segmentation program, and then perform Chinese word segmentation and part-of-speech tagging on the description.

③利用正则表达式，去除专利说明书中停用词、虚词、连接词等对专利分类用处不大的词语，仅保留名词、形容词、动词。③Using regular expressions, remove stop words, function words, conjunctions and other words that are not useful for patent classification in the patent specification, and only retain nouns, adjectives, and verbs.

步骤2：统计出每个词的词频、位置信息、词性权重以及类间分布信息，利用这些统计值以及专利文本信息，构建倒排索引文件。Step 2: Count the word frequency, position information, part-of-speech weight and inter-class distribution information of each word, and use these statistical values and patent text information to construct an inverted index file.

根据步骤1中过滤出来的词语构建倒排索引文件，索引文件结构包括词汇表和事件表，每个词汇对应一个事件表，事件表存放词汇所出现的专利号在该专利文档中的词频、位置权重以及词性权重。这里的位置权重计算公式为：

其中n表示词汇在说明书中出现的总次数，l_i表示词汇第i次出现所处位置的权重，实例中设技术领域权重1，背景技术0.8，其他位置0.5。词性权重设定为名词2.5，动词和形容词均为1，具体结果如表1所示。An inverted index file is constructed according to the words filtered out in step 1. The index file structure includes a vocabulary table and an event table. Each vocabulary corresponds to an event table. The event table stores the word frequency and position of the patent number in the patent document where the vocabulary appears. weights and part-of-speech weights. The formula for calculating the position weight here is:

Among them, n represents the total number of times the word appears in the specification, and li represents the weight of the position where the word appears for the _i -th time. The part-of-speech weight is set to 2.5 for nouns, and 1 for both verbs and adjectives. The specific results are shown in Table 1.

表1用户词典和倒排索引合并Table 1 User dictionary and inverted index merge

步骤3：利用信息增益和词频相结合的特征选择方法来计算词语的特征值，对特征值排序，选择一定数量的特征词来表征专利文本。Step 3: Use the feature selection method combining information gain and word frequency to calculate the feature values of words, sort the feature values, and select a certain number of feature words to represent the patent text.

由于信息增益存在低频词缺陷，而申请人为了强调一个创新点往往会重复一些特殊词，而这些高频词对于分类是有利的，为此，本发明采用信息增益和词频相结合的特征选择方法，首先根据公式(1)计算各专利中词语的特征值，然后按照特征值对这些词语进行降序排序，选择其中前20个词语作为该专利的特征词。Due to the defect of low-frequency words in information gain, the applicant often repeats some special words in order to emphasize an innovative point, and these high-frequency words are beneficial for classification. Therefore, the present invention adopts a feature selection method combining information gain and word frequency. , first calculate the eigenvalues of the words in each patent according to formula (1), then sort these words in descending order according to the eigenvalues, and select the first 20 words as the characteristic words of the patent.

步骤4：利用倒排索引文件，计算每个专利特征词的权重，然后利用的改进过的TF-IDF公式计算特征词权重，最后构建专利特征向量。Step 4: Use the inverted index file to calculate the weight of each patent feature word, then use the improved TF-IDF formula to calculate the feature word weight, and finally construct the patent feature vector.

本步骤具体包括：This step specifically includes:

其中，

表示特征词t在文本

Represents the feature word t in the text

倒排索引文件中已经记录了特征词的词频、位置权重、词性权重，所以只需要统计同样出现该特征词的文本数，至于总文本数也是已知的，具体结果如表2所示。The word frequency, position weight, and part-of-speech weight of the feature word have been recorded in the inverted index file, so it is only necessary to count the number of texts in which the feature word also appears. The total number of texts is also known. The specific results are shown in Table 2.

表2专利特征向量Table 2 Patent feature vector

步骤5：生成IPC各层次类别特征向量，在步骤1基础上，从小类开始逐层向上，计算每个词汇在对应层次的类别权重，权重的计算使用TF-IDF，将一个类别描述看作一个文本，然后构建各层次的类别特征向量。Step 5: Generate the category feature vectors of each level of IPC. On the basis of step 1, start from the small category and go up layer by layer, calculate the category weight of each vocabulary at the corresponding level, use TF-IDF to calculate the weight, and regard a category description as a text, and then construct the category feature vector at each level.

本步骤具体包括：This step specifically includes:

比如，将A01B小类下的所有组的词汇并成一个A01B词汇集，其他A01大类下的小类亦是如此，然后计算A01B词汇集中每个词的权重，最后构造A01B小类的特征向量。For example, combine the vocabulary of all groups under the A01B sub-category into an A01B vocabulary set, as well as the sub-categories under the other A01 categories, then calculate the weight of each word in the A01B vocabulary set, and finally construct the feature vector of the A01B sub-category .

步骤6：构建专利样本邻域，利用步骤4中的专利特征向量，计算每个专利与其他专利之间相似度，对这些专利相似度进行排序，选择其中相似度最大的100个专利，组成该专利的邻域集合。Step 6: Construct the patent sample neighborhood, use the patent feature vector in step 4 to calculate the similarity between each patent and other patents, sort the similarity of these patents, and select the 100 patents with the highest similarity to form the Neighborhood set of patents.

本步骤具体包括：This step specifically includes:

具体结果如表3所示。The specific results are shown in Table 3.

表3专利领域集合Table 3 Collection of patent fields

步骤7：计算待分类专利向量与IPC类别特征向量以及与训练集专利之间的余弦相似度值，同样计算出待分专利的邻域集合。Step 7: Calculate the cosine similarity value between the patent vector to be classified, the IPC category feature vector, and the training set patent, and also calculate the neighborhood set of the patent to be classified.

本步骤包括：This step includes:

①对待分类专利进行预处理、特征选择、向量化以及数据格式转换。①Preprocessing, feature selection, vectorization and data format conversion of patents to be classified.

步骤8：分类决策，首先计算待分类专利与训练集中专利之间共享领域的大小，即计算邻域集合中相同专利的个数。然后计算待分专利与专利类别间的相似度加权和，对加权和排序后，将待分专利划分为值最大的那个类中。Step 8: Classification decision, first calculate the size of the shared field between the patents to be classified and the patents in the training set, that is, calculate the number of identical patents in the neighborhood set. Then, the weighted sum of the similarity between the patent to be divided and the patent category is calculated, and after the weighted sum is sorted, the patent to be divided is divided into the class with the largest value.

本步骤具体包括：This step specifically includes:

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, reference to the terms "one embodiment," "some embodiments," "exemplary embodiment," "example," "specific example," or "some examples", etc., is meant to incorporate the embodiments A particular feature, structure, material, or characteristic described by an example or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, The scope of the invention is defined by the claims and their equivalents.

Claims

1. A patent classification method based on an instruction book is characterized by comprising the following steps:

step 1, acquiring data of a patent text, and performing text preprocessing on a patent specification;

step 2, counting word frequency, position information, part-of-speech weight and inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the text information of the patent specification;

step 3, calculating the characteristic values of the words by using a characteristic selection method combining information gain and word frequency, sequencing the characteristic values, and selecting a certain number of characteristic words to represent the text of the patent specification;

the calculation process of the characteristic value in the step 3 is as follows:

let A_ijFor containing the feature word t_iAnd belong to c_jNumber of documents, B_ijFor containing the feature word t_iAnd the category does not belong to c_jNumber of documents of C_ijTo not contain the feature word t_iAnd the category belongs to c_jNumber of documents, D_ijTo not contain the feature word t_iAnd the category belongs to_jThe feature value is calculated as shown in equation (1):

wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection; let m be the total number of classes in the training patent, N_jDenotes c_jTotal number of patents in class, TF_jkRepresentation feature word t_iAt c_jMiddle class patent P_kThe term frequency in (1), then the calculation of TF is as shown in equation (2):

IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller; let TF_j(t_i) Representation feature word t_iIn class c_jFrequency of middle, TF (t)_i) Representation feature word t_iThe total frequency of the frequency of (c) is,

representation feature word t_iThe average of the frequencies occurring in all classes is calculated as shown in equation (3):

step 4, calculating the weight of each patent feature word by using the inverted index file, then calculating the weight of the feature word by using an improved TF-IDF formula, and finally constructing a patent feature vector;

the specific process of the step 4 is as follows:

step 4.1, weight calculation, wherein the calculation is shown as the formula (4):

wherein,

representing the characteristic word t in the text

The frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, C_tRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature word_tRepresenting characteristic wordsA position weight coefficient;

step 4.2, sorting according to weight descending order, and constructing a space model vector V of the text of the patent specification_i(w_i1,w_i2,...,w_in) The content of the text of each patent specification is expressed as such;

step 5, generating IPC class characteristic vectors of each level, calculating the class weight of each vocabulary in the corresponding level layer by layer from subclasses upwards on the basis of the step 1, calculating the weight by using TF-IDF, regarding a class description as a text, and then constructing the class characteristic vectors of each level;

the specific process of the step 5 is as follows:

step 5.1, merging the category description of each subgroup into the category description of the main group to which the subgroup belongs, and performing word segmentation and word stop processing;

and 5.2, combining the descriptions of each main group, then selecting features, constructing a class feature vector of an IPC subclass layer, and expressing the vector as { V }_A01B1/00,V_A01B3/00,...,V_H99Z99/00}; wherein, A01B1/00 is the first main group in IPC, H99Z99/00 is the last main group in IPC;

step 5.3, merging all basic descriptions under the same subclass, then performing feature selection, constructing a class feature vector of an IPC (IPC) major class hierarchy, and expressing the vector as { V }_A01B,V_A01C,...,V_H99Z}; wherein, A01B is the first subclass in IPC, H99Z is the last subclass in IPC;

step 5.4, merging all basic descriptions under the same large class, then performing feature selection, constructing a class feature vector of an IPC part level, and expressing the vector as { V }_A01,V_A21,...,V_H99}; wherein, A01 is the first major class in IPC, H99Z is the last major class in IPC;

step 6, constructing a neighborhood of a patent sample, calculating the similarity between each patent and other patents by using the patent feature vectors in the step 4, sequencing the patent similarities, and selecting K patents with the maximum similarity to form a neighborhood set of the patent;

the specific process of the step 6 is as follows:

step 6.1, calculating the similarity among the patents in the patent training set; the similarity can be obtained by calculating the cosine of an included angle between vectors; let sim (d)_i,d_j) Text d representing patent specification_iAnd d_jThe calculation formula is shown in formula (5):

wherein, W_ikAnd W_jkRepresenting the weight of a corresponding special token in the patent vector, wherein n represents the dimension of the vector;

step 6.2, mixing d_iSample d of all other patents_jThe similarity is sorted in descending order, and the first K patent samples are selected to form a set D_i，D_iReferred to as patent d_iThe value of K is case specific;

step 7, calculating cosine similarity values between the patent vectors to be classified, the IPC class characteristic vectors and the patents in the training set, and calculating neighborhood sets of the patents to be classified;

step 8, firstly, calculating the size of a shared field between the patents to be classified and the patents in the training set, namely calculating the number of the same patents in the neighborhood set; then calculating the similarity weighted sum between the patents to be classified and the patent categories, and after the weighted sum is sorted, dividing the patents to be classified into the category with the largest value;

the specific process of the step 8 is as follows:

step 8.1, calculate the patent B to be classified_jAnd sample patent d_iSize L (B) of shared field between_j,d_i) I.e. the number of identical patents in the two domain sets;

and 8.2, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein the calculation formula is shown as the formula (6):

wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5;

and 8.3, classifying the patents to be classified into the class with the maximum similarity S (i).

2. A specification-based patent classification method according to claim 1, characterized in that: the step 1 specifically comprises:

collecting patent sample data, sampling IPC (International patent Classification) numbers, extracting specifications, dividing Chinese words, labeling parts of speech, and removing symbols and numbers in the specifications; regular matching is used for filtering out words with little use for patent classification, such as stop words, dummy words and connecting words, and only nouns, adjectives and verb keywords are reserved.

3. A specification-based patent classification method according to claim 1, characterized in that: the specific process of the step 7 is as follows:

7.1, extracting a specification, dividing Chinese words, labeling parts of speech and removing stop words of the patent to be classified;

step 7.2, selecting and vectorizing patent features;

step 7.3, calculating the patent B to be classified_jCosine similarity S of eigenvector and each IPC class eigenvector_ai；

Step 7.4, calculate the patent B to be classified_jThe cosine similarity S with each patent in the patent training set_bj；

Step 7.5, the above training patents are processed according to the similarity value S_bjSorting in descending order, and selecting the top K patents as the neighborhood set.