CN107122382B - A description-based patent classification method - Google Patents

A description-based patent classification method Download PDF

Info

Publication number
CN107122382B
CN107122382B CN201710082677.8A CN201710082677A CN107122382B CN 107122382 B CN107122382 B CN 107122382B CN 201710082677 A CN201710082677 A CN 201710082677A CN 107122382 B CN107122382 B CN 107122382B
Authority
CN
China
Prior art keywords
patents
feature
ipc
class
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710082677.8A
Other languages
Chinese (zh)
Other versions
CN107122382A (en
Inventor
朱玉全
金健
佘远程
石亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201710082677.8A priority Critical patent/CN107122382B/en
Publication of CN107122382A publication Critical patent/CN107122382A/en
Application granted granted Critical
Publication of CN107122382B publication Critical patent/CN107122382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于说明书的专利分类方法,属于文本处理与数据挖掘领域。首先对专利说明书进行文本预处理;其后构建倒排索引文件,利用信息增益和词频相结合的特征选择方法来选取特征词;进一步利用的改进过的TF‑IDF公式计算特征词权重,并构建专利特征向量;然后构建训练专利领域集合;最后利用优化过的KNN分类器对专利进行分类。该项研究为专利文献分类提供了新的思路,也为进一步研究专利文献智能检索等奠定了基础。

Figure 201710082677

The invention discloses a patent classification method based on a description, which belongs to the field of text processing and data mining. Firstly, the text preprocessing is performed on the patent specification; then an inverted index file is constructed, and the feature selection method combining information gain and word frequency is used to select the feature words; the improved TF-IDF formula is further used to calculate the feature word weight, and construct Patent feature vector; then build a set of training patent fields; finally use the optimized KNN classifier to classify patents. This research provides a new idea for the classification of patent documents, and also lays a foundation for further research on intelligent retrieval of patent documents.

Figure 201710082677

Description

一种基于说明书的专利分类方法A description-based patent classification method

技术领域technical field

本发明属于计算机分析技术在专利文献的的应用,具体涉及一种利用专利说明书的专利分类方法。The invention belongs to the application of computer analysis technology in patent documents, and specifically relates to a patent classification method using patent specifications.

背景技术Background technique

专利是技术创新和企业价值的具体表现,是知识发展和创新的重要载体、成果和源泉之一,许多发明创造成果仅出现于专利文献中。据世界知识产权组织(WIPO)统计,世界上发明成果的70%~90%首先出现在专利文献中,而不是杂志、论文等其他载体的文献中。此外,为了保护自身的利益,企业会尽可能早的申请专利,专利中往往集中了最为活跃和先进的技术,包含了世界上90%~95%的技术信息。同时为了审查的方便,专利文献往往撰写的比较详细,相对于其他类型的资料而言,专利文献能够提供更多的信息,是一种最常见的技术创新成果,记录着专利活动的完整过程。它不仅反映各个技术领域中技术活动的现状,而且能够体现某个特定技术领域中技术活动的发展历史。专利文献中含有每一件申请专利的发明创造的具体技术解决方案,对于企业创新具有非常重要的作用,不仅使企业可以了解最新科研动态,避免重复研究,节约研究时间和科研经费,同时还可启迪企业研究人员的创新思路,提高创新的起点,借鉴以往的发明,极大缩短科研工作进度。Patents are the concrete manifestation of technological innovation and enterprise value, and one of the important carriers, achievements and sources of knowledge development and innovation. Many inventions and creations only appear in patent documents. According to the statistics of the World Intellectual Property Organization (WIPO), 70% to 90% of the inventions in the world first appear in patent documents, rather than in the documents of other carriers such as journals and papers. In addition, in order to protect their own interests, enterprises will apply for patents as early as possible, and patents often focus on the most active and advanced technologies, including 90% to 95% of the technical information in the world. At the same time, for the convenience of examination, patent documents are often written in more detail. Compared with other types of documents, patent documents can provide more information and are the most common technological innovation achievements, recording the complete process of patent activities. It not only reflects the status quo of technical activities in various technical fields, but also reflects the development history of technical activities in a specific technical field. Patent documents contain specific technical solutions for each patented invention and creation, which plays a very important role in enterprise innovation. It not only enables enterprises to understand the latest scientific research trends, avoids repeated research, saves research time and research funds, but also enables enterprises. Enlighten the innovative ideas of enterprise researchers, improve the starting point of innovation, learn from past inventions, and greatly shorten the progress of scientific research.

随着我国新研究成果和发明创造的不断涌现,专利数量呈现出快速的增长。截止2016年10月5日,我国已公布的发明专利数已超过598万件,其中授权发明专利总数为223.850万件。如果每个专利的平均大小为2M,则专利数据的容量高达几百TB。为了科学地管理这些专利文献数据,同时也为了快速、方便地检索相关专利文献,专利文献的分类显得尤为重要。目前,世界上大多数国家均采用国际专利分类法IPC(International PatentClassification)来对专利文献进行分类,IPC按照五个等级分类,即部(Section)、大类(Class)、小类(Subclass)、主组(Main Grop)、分组(Grop),其中部是分类表中最高等级的分类层,按照领域不同,分为八个大部,用一位的英文字母标记,分别是A-H,每个部分下属设有多个大类,大类是由二位数字组成,每个部下面有不同数量的大类。例如:G06F21/00表示物理-电数字数据处理-防止未授权行为的保护计算机、其部件、程序或数据的安全装置。With the continuous emergence of new research results and inventions in my country, the number of patents has shown rapid growth. As of October 5, 2016, the number of published invention patents in my country has exceeded 5.98 million, of which the total number of authorized invention patents is 2.2385 million. If the average size of each patent is 2M, the capacity of patent data is as high as several hundred terabytes. In order to manage these patent document data scientifically and to search related patent documents quickly and conveniently, the classification of patent documents is particularly important. At present, most countries in the world use the International Patent Classification (IPC) to classify patent documents. IPC is classified according to five levels, namely Section, Class, Subclass, Main group (Main Grop), group (Grop), the part of which is the highest classification layer in the classification table, according to different fields, is divided into eight major parts, marked with one-digit English letters, respectively A-H, each part Subordinates have multiple categories, which are composed of two-digit numbers, and there are different numbers of categories under each subordinate. For example: G06F21/00 stands for Physical-Electrical Digital Data Processing-Security Devices for Protecting Computers, Its Components, Programs or Data from Unauthorized Actions.

由此可见,对于已或即将公布的发明专利而言,必须赋予一个或多个与之对应的分类号,如发明专利“一种关联规则挖掘中隐私数据的保护方法”的分类号为G06F21/00。对于即将提交的申请专利来讲,其分类号是未知并需要确定的,对此,目前常用的做法是根据专利描述对象的所属领域或专利内容来确定,需要依靠相关专家人工阅读申请书,随着专利申请量的急剧增加(每年的专利申请数已接近100万),此方法需要耗费大量的人力和物力,而且专家自身知识的局限性也难以保证分类结果的一致性和准确性。为此,本发明提出了一种基于专利文献说明书的专利分类方法,该方法利用已公布发明专利说明书中的信息来构造分类器或分类函数,并以此来确定申请专利的类别,由此实现专利的自动分类。It can be seen that for the invention patent that has been or will be published, one or more corresponding classification numbers must be assigned. For example, the classification number of the invention patent "a method for protecting privacy data in association rule mining" is G06F21/ 00. For the patent application to be submitted, its classification number is unknown and needs to be determined. In this regard, the common practice is to determine it according to the field or patent content of the patent description object. It is necessary to rely on relevant experts to manually read the application and follow the instructions. With the rapid increase in the number of patent applications (the number of patent applications per year is close to 1 million), this method requires a lot of manpower and material resources, and the limitations of experts' own knowledge make it difficult to ensure the consistency and accuracy of the classification results. To this end, the present invention proposes a patent classification method based on the specification of patent documents. The method utilizes the information in the published invention patent specification to construct a classifier or a classification function, and uses this to determine the category of the patent application, thereby realizing Automatic classification of patents.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对现有专利分类方法不能充分有效地利用已公布发明专利中的说明书信息,提出一种基于专利文献说明书的专利分类方法,该方法将充分利用已公布发明专利所包含的说明书信息以及对应的类别来构造分类器或分类函数,以此来确定已提交申请专利的类别,并就构造过程中说明书的特征提取和选择、分类器的确定等方面提出了相应的优化解决方案。The purpose of the present invention is to propose a patent classification method based on the description of the patent document, which can make full use of the description contained in the published invention patent, in view of the fact that the existing patent classification method cannot fully and effectively utilize the description information in the published invention patent. Information and corresponding categories are used to construct classifiers or classification functions, so as to determine the category of the submitted patent application, and propose corresponding optimization solutions for the feature extraction and selection of the specification and the determination of the classifier during the construction process.

本发明采用的技术方案是:基于专利文献说明书的专利分类方法主要包括以下步骤:The technical solution adopted in the present invention is: the patent classification method based on the patent document description mainly includes the following steps:

(1)专利数据预处理(1) Patent data preprocessing

专利样本数据的采集、样本IPC号、说明书的提取、中文分词、词性标注。去掉说明书中符号、数字(说明书中存在大量的段落标号)。利用正则匹配过滤掉停用词、虚词、连接词等对专利分类用处不大的词语,仅保留名词、形容词、动词等关键词。Collection of patent sample data, sample IPC number, extraction of description, Chinese word segmentation, part-of-speech tagging. Remove the symbols and numbers in the specification (there are a lot of paragraph marks in the specification). Use regular matching to filter out stop words, function words, conjunctions and other words that are not useful for patent classification, and only retain keywords such as nouns, adjectives, and verbs.

(2)构建倒排索引文件(2) Build an inverted index file

统计出每个词的词频、位置信息、词性权重以及类间分布信息,利用这些统计值以及专利文本信息,构建倒排索引文件。Count the word frequency, position information, part-of-speech weight and distribution information between classes of each word, and use these statistical values and patent text information to construct an inverted index file.

(3)专利文本特征选择(3) Feature selection of patent text

利用信息增益和词频相结合的特征选择方法来计算步骤(2)中词语的特征值,对特征值排序,选择一定数量的特征词来表征专利文本。The feature selection method combining information gain and word frequency is used to calculate the feature values of the words in step (2), sort the feature values, and select a certain number of feature words to represent the patent text.

设Aij为包含特征词ti并且属于cj的文档数量,Bij为包含特征词ti并且类别不属于cj的文档数量,Cij为不包含特征词ti并且类别属于cj的文档数量,Dij为不包含特征词ti并且类别属于不cj的文档数量,则特征值的计算如式(1)所示。Let A ij be the number of documents that contain the feature word t i and belong to c j , B ij be the number of documents that contain the feature word t i and the category does not belong to c j , and C ij is the number of documents that do not contain the feature word t i and the category belongs to c j The number of documents, D ij is the number of documents that do not contain the feature word t i and the category belongs to not c j , then the calculation of the feature value is shown in formula (1).

Figure GDA0001344119680000021
Figure GDA0001344119680000021

其中,TF代表专利中词频对于专利特征选择的影响程度。设m为训练专利中类别总数,Nj表示cj类中的专利总数,TFjk表示特征词ti在cj类中专利Pk中的词频,则TF的计算如式(2)所示。Among them, TF represents the degree of influence of the word frequency in the patent on the selection of patent features. Let m be the total number of categories in the training patents, N j the total number of patents in the c j category, TF jk the word frequency of the feature word t i in the c j category of the patent P k , then the calculation of TF is shown in formula (2) .

Figure GDA0001344119680000031
Figure GDA0001344119680000031

式(1)中的IC代表特征词在类别间的分散程度,越分散说明该词越没有代表性,值也就越小。设TFj(ti)表示特征词ti在类cj中的频数,TF(ti)表示特征词ti的总频数,

Figure GDA0001344119680000036
表示特征词ti在所有类中出现的频数平均值,则计算如式(3)所示。IC in formula (1) represents the degree of dispersion of feature words among categories. The more dispersed, the less representative the word is, and the smaller the value is. Let TF j (t i ) represent the frequency of the feature word t i in the class c j , TF(t i ) represent the total frequency of the feature word t i ,
Figure GDA0001344119680000036
Represents the average frequency of the feature word t i appearing in all classes, then the calculation is shown in formula (3).

Figure GDA0001344119680000032
Figure GDA0001344119680000032

(4)专利文本向量化(4) Vectorization of patent text

本步骤把包括:This step will include:

①权重计算,计算如式(4)所示。①Weight calculation, the calculation is shown in formula (4).

Figure GDA0001344119680000033
Figure GDA0001344119680000033

其中,

Figure GDA0001344119680000034
表示特征词t在文本
Figure GDA0001344119680000035
中出现的频率,N表示全部专利样本集中所有专利的个数,n表示全部专利样本集中出现特征词t的专利个数,Ct表示特征词词性所对应的词性权重系数,Pt表示特征词的位置权重系数。in,
Figure GDA0001344119680000034
Represents the feature word t in the text
Figure GDA0001344119680000035
The frequency of occurrence in all patent sample sets, N represents the number of all patents in all patent sample sets, n represents the number of patents in which feature word t appears in all patent sample sets, C t represents the part-of-speech weight coefficient corresponding to the feature word part-of-speech, and P t represents the feature word The position weight coefficient of .

②排序,根据权重降序排序,构造专利文本的空间模型向量Vi(wi1,wi2,...,win),以此来表示每个专利文本的内容。②Sort, sort according to the weight in descending order, construct the space model vector V i (w i1 ,w i2 ,...,w in ) of the patent text, so as to represent the content of each patent text.

(5)生成IPC各层次类别特征向量(5) Generate the feature vectors of each level of the IPC

本步骤包括:This step includes:

①将各子组的类别描述并入所属主组的类别描述,进行分词、去停用词处理。① Combine the category description of each subgroup into the category description of the main group to which it belongs, and perform word segmentation and stop word removal processing.

②将每个主组的描述合并后进行特征选择,构造IPC小类层次的类别特征向量,向量表示为{VA01B1/00,VA01B3/00,...,VH99Z99/00}。其中,A01B1/00为IPC中第一个主组,H99Z99/00为IPC中最后一个主组。② Combine the descriptions of each main group and perform feature selection to construct the category feature vector of the IPC subclass level, which is represented as {V A01B1/00 ,V A01B3/00 ,...,V H99Z99/00 }. Among them, A01B1/00 is the first main group in IPC, and H99Z99/00 is the last main group in IPC.

③将同一个小类下的所有基本描述合并后进行特征选择,构造IPC大类层次的类别特征向量,向量表示为{VA01B,VA01C,...,VH99Z}。其中,A01B为IPC中第一个小类,H99Z是IPC中最后一个小类。③ Combine all the basic descriptions under the same sub-category and perform feature selection to construct the category feature vector of the IPC category level, which is represented as {V A01B ,V A01C ,...,V H99Z }. Among them, A01B is the first subclass in IPC, and H99Z is the last subclass in IPC.

④将同一大类下的所有基本描述合并后进行特征选择,构造IPC部层次的类别特征向量,向量表示为{VA01,VA21,...,VH99}。其中,A01为IPC中第一个大类,H99Z是IPC中最后一个大类。④ Combine all the basic descriptions under the same category and perform feature selection to construct a category feature vector at the IPC level, which is represented as {V A01 ,V A21 ,...,V H99 }. Among them, A01 is the first category in IPC, and H99Z is the last category in IPC.

(6)构建专利样本邻域(6) Construct patent sample neighborhood

本步骤包括:This step includes:

①计算专利训练集中各专利之间的相似度。相似度可以通过计算向量间的夹角余弦得到。设sim(di,dj)表示专利文本di与dj的相似度,则计算公式如式(5)所示。① Calculate the similarity between the patents in the patent training set. Similarity can be obtained by calculating the cosine of the angle between the vectors. Suppose sim(d i , d j ) represents the similarity between the patent texts d i and d j , then the calculation formula is shown in formula (5).

Figure GDA0001344119680000041
Figure GDA0001344119680000041

其中,Wik和Wjk表示专利向量中对应特证词的权重,n表示向量的维数。Among them, Wik and Wjk represent the weight of the corresponding special testimony in the patent vector, and n represents the dimension of the vector.

②将di与其他所有专利样本dj的相似度按降序排序,选择前K个专利样本形成集合Di,Di称作为专利di的邻域,K的值视具体情况而定。② Sort the similarity between d i and all other patent samples d j in descending order, select the first K patent samples to form a set D i , and D i is called the neighborhood of patent d i , and the value of K depends on the specific situation.

(7)待分类专利相似度计算(7) Calculation of similarity of patents to be classified

本步骤包括:This step includes:

①待分类专利进行说明书的提取、中文分词、词性标注、去停用词。①Extracting the description, Chinese word segmentation, part-of-speech tagging, and removing stop words for the patent to be classified.

②专利特征选择和向量化。②Patent feature selection and vectorization.

③计算待分类专利Bj特征向量与各IPC类别特征向量的余弦相似度Sai③ Calculate the cosine similarity S ai of the feature vector of the patent B j to be classified and the feature vector of each IPC category.

④计算待分类专利Bj与专利训练集中每个专利的余弦相似度Sbj④ Calculate the cosine similarity S bj between the patent B j to be classified and each patent in the patent training set.

⑤将上述的训练专利按相似度值Sbj降序排序,选择最前面K个专利作为其邻域集合。⑤ Sort the above training patents in descending order of similarity value S bj , and select the top K patents as its neighborhood set.

(8)分类决策(8) Classification decision

本步骤包括:This step includes:

①计算待分类专利Bj与样本专利di之间的共享领域大小L(Bj,di),即两个领域集合中相同专利的个数。① Calculate the shared field size L(B j ,d i ) between the patent B j to be classified and the sample patent d i , that is, the number of identical patents in the two field sets.

②计算待分类专利与各个IPC类别间的最终加权相似度,计算公式如式(6)所示。② Calculate the final weighted similarity between the patent to be classified and each IPC category, and the calculation formula is shown in formula (6).

Figure GDA0001344119680000042
Figure GDA0001344119680000042

其中,I表示类别,p,k,α,β为可调参数,系统默认情况下,p为0.8,k为0.95,α为0.6,β为5。Among them, I represents the category, and p, k, α, and β are adjustable parameters. By default, p is 0.8, k is 0.95, α is 0.6, and β is 5.

③将待分类专利归入相似度S(i)最大的类。③ Classify the patents to be classified into the class with the largest similarity S(i).

本发明的主要有益效果表现在:The main beneficial effects of the present invention are shown in:

(1)专利文本特征选择方面(1) Feature selection of patent text

相对于专利的标题和摘要来说,专利说明书内容更加丰富,所包含的信息量也更大。也正因如此,专利说明书中含有大量的噪声数据,尤其是到了IPC小类以下级别的分类,不同专利之间所包含的相似信息更多,不利于分类。为此,本发明改进了专利说明书的特征提取以及特征向量化的方法,降低了噪声干扰,提高了专利的分类精度。Compared with the title and abstract of the patent, the patent specification is richer in content and contains a larger amount of information. Because of this, the patent specification contains a lot of noise data, especially in the classification below the IPC sub-category, the similar information contained between different patents is more, which is not conducive to the classification. Therefore, the present invention improves the method of feature extraction and feature vector quantization in the patent specification, reduces noise interference, and improves the classification accuracy of patents.

(2)专利分类方法设计方面(2) Design of patent classification method

由于专利数据量相当庞大,而专利类别又特别多,从而导致分类模型训练速度过慢等问题,明显不适用于专利分类。为此,本发明提出了一种新的最近邻分类算法,并在分类过程中加入了IPC描述信息,在保证分类速度的前提下进一步提高了专利分类的准确度。Due to the huge amount of patent data and the large number of patent categories, the training speed of the classification model is too slow, which is obviously not suitable for patent classification. Therefore, the present invention proposes a new nearest neighbor classification algorithm, and adds IPC description information in the classification process, which further improves the accuracy of patent classification on the premise of ensuring the classification speed.

附图说明Description of drawings

图1是本发明实施例中的结构框图Fig. 1 is a structural block diagram in an embodiment of the present invention

图2是本发明实施例中专利向量空间的构造流程Fig. 2 is the construction flow of the patent vector space in the embodiment of the present invention

图3是本发明实施例中基于改进KNN的分类流程图Fig. 3 is the classification flow chart based on improved KNN in the embodiment of the present invention

具体实施方式Detailed ways

下面以专利文献为实施例,详细说明本发明的专利分类方法,具体执行过程如下:The patent classification method of the present invention is described in detail below with the patent documentation as an embodiment, and the specific execution process is as follows:

步骤1:获取专利文本的数据,对专利说明书进行文本预处理,主要是分词和去停用词。Step 1: Obtain the data of the patent text, and perform text preprocessing on the patent specification, mainly including word segmentation and removal of stop words.

①获取IPC类别的描述,对描述进行分词和词性标注、去停用词处理,对分词结果进行人工较正后,构建用户词典。① Obtain the description of the IPC category, perform word segmentation and part-of-speech tagging on the description, remove stop words, and manually correct the word segmentation results to construct a user dictionary.

②分别对上述抽取的专利样本进行格式转换、说明书提取,在分词程序中加入(1)中构建的用户词典,然后对说明书进行中文分词、词性标注。(2) Perform format conversion and description extraction on the patent samples extracted above, add the user dictionary constructed in (1) to the word segmentation program, and then perform Chinese word segmentation and part-of-speech tagging on the description.

③利用正则表达式,去除专利说明书中停用词、虚词、连接词等对专利分类用处不大的词语,仅保留名词、形容词、动词。③Using regular expressions, remove stop words, function words, conjunctions and other words that are not useful for patent classification in the patent specification, and only retain nouns, adjectives, and verbs.

步骤2:统计出每个词的词频、位置信息、词性权重以及类间分布信息,利用这些统计值以及专利文本信息,构建倒排索引文件。Step 2: Count the word frequency, position information, part-of-speech weight and inter-class distribution information of each word, and use these statistical values and patent text information to construct an inverted index file.

根据步骤1中过滤出来的词语构建倒排索引文件,索引文件结构包括词汇表和事件表,每个词汇对应一个事件表,事件表存放词汇所出现的专利号在该专利文档中的词频、位置权重以及词性权重。这里的位置权重计算公式为:

Figure GDA0001344119680000051
其中n表示词汇在说明书中出现的总次数,li表示词汇第i次出现所处位置的权重,实例中设技术领域权重1,背景技术0.8,其他位置0.5。词性权重设定为名词2.5,动词和形容词均为1,具体结果如表1所示。An inverted index file is constructed according to the words filtered out in step 1. The index file structure includes a vocabulary table and an event table. Each vocabulary corresponds to an event table. The event table stores the word frequency and position of the patent number in the patent document where the vocabulary appears. weights and part-of-speech weights. The formula for calculating the position weight here is:
Figure GDA0001344119680000051
Among them, n represents the total number of times the word appears in the specification, and li represents the weight of the position where the word appears for the i -th time. The part-of-speech weight is set to 2.5 for nouns, and 1 for both verbs and adjectives. The specific results are shown in Table 1.

表1用户词典和倒排索引合并Table 1 User dictionary and inverted index merge

Figure GDA0001344119680000061
Figure GDA0001344119680000061

步骤3:利用信息增益和词频相结合的特征选择方法来计算词语的特征值,对特征值排序,选择一定数量的特征词来表征专利文本。Step 3: Use the feature selection method combining information gain and word frequency to calculate the feature values of words, sort the feature values, and select a certain number of feature words to represent the patent text.

由于信息增益存在低频词缺陷,而申请人为了强调一个创新点往往会重复一些特殊词,而这些高频词对于分类是有利的,为此,本发明采用信息增益和词频相结合的特征选择方法,首先根据公式(1)计算各专利中词语的特征值,然后按照特征值对这些词语进行降序排序,选择其中前20个词语作为该专利的特征词。Due to the defect of low-frequency words in information gain, the applicant often repeats some special words in order to emphasize an innovative point, and these high-frequency words are beneficial for classification. Therefore, the present invention adopts a feature selection method combining information gain and word frequency. , first calculate the eigenvalues of the words in each patent according to formula (1), then sort these words in descending order according to the eigenvalues, and select the first 20 words as the characteristic words of the patent.

设Aij为包含特征词ti并且属于cj的文档数量,Bij为包含特征词ti并且类别不属于cj的文档数量,Cij为不包含特征词ti并且类别属于cj的文档数量,Dij为不包含特征词ti并且类别属于不cj的文档数量,则特征值的计算如式(1)所示。Let A ij be the number of documents that contain the feature word t i and belong to c j , B ij be the number of documents that contain the feature word t i and the category does not belong to c j , and C ij is the number of documents that do not contain the feature word t i and the category belongs to c j The number of documents, D ij is the number of documents that do not contain the feature word t i and the category belongs to not c j , then the calculation of the feature value is shown in formula (1).

Figure GDA0001344119680000062
Figure GDA0001344119680000062

其中,TF代表专利中词频对于专利特征选择的影响程度。设m为训练专利中类别总数,Nj表示cj类中的专利总数,TFjk表示特征词ti在cj类中专利Pk中的词频,则TF的计算如式(2)所示。Among them, TF represents the degree of influence of the word frequency in the patent on the selection of patent features. Let m be the total number of categories in the training patents, N j the total number of patents in the c j category, TF jk the word frequency of the feature word t i in the c j category of the patent P k , then the calculation of TF is shown in formula (2) .

Figure GDA0001344119680000071
Figure GDA0001344119680000071

式(1)中的IC代表特征词在类别间的分散程度,越分散说明该词越没有代表性,值也就越小。设TFj(ti)表示特征词ti在类cj中的频数,TF(ti)表示特征词ti的总频数,

Figure GDA0001344119680000072
表示特征词ti在所有类中出现的频数平均值,则计算如式(3)所示。IC in formula (1) represents the degree of dispersion of feature words among categories. The more dispersed, the less representative the word is, and the smaller the value is. Let TF j (t i ) represent the frequency of the feature word t i in the class c j , TF(t i ) represent the total frequency of the feature word t i ,
Figure GDA0001344119680000072
Represents the average frequency of the feature word t i appearing in all classes, then the calculation is shown in formula (3).

Figure GDA0001344119680000073
Figure GDA0001344119680000073

步骤4:利用倒排索引文件,计算每个专利特征词的权重,然后利用的改进过的TF-IDF公式计算特征词权重,最后构建专利特征向量。Step 4: Use the inverted index file to calculate the weight of each patent feature word, then use the improved TF-IDF formula to calculate the feature word weight, and finally construct the patent feature vector.

本步骤具体包括:This step specifically includes:

①权重计算,计算如式(4)所示。①Weight calculation, the calculation is shown in formula (4).

Figure GDA0001344119680000074
Figure GDA0001344119680000074

其中,

Figure GDA0001344119680000075
表示特征词t在文本
Figure GDA0001344119680000076
中出现的频率,N表示全部专利样本集中所有专利的个数,n表示全部专利样本集中出现特征词t的专利个数,Ct表示特征词词性所对应的词性权重系数,Pt表示特征词的位置权重系数。in,
Figure GDA0001344119680000075
Represents the feature word t in the text
Figure GDA0001344119680000076
The frequency of occurrence in all patent sample sets, N represents the number of all patents in all patent sample sets, n represents the number of patents in which feature word t appears in all patent sample sets, C t represents the part-of-speech weight coefficient corresponding to the feature word part-of-speech, and P t represents the feature word The position weight coefficient of .

②排序,根据权重降序排序,构造专利文本的空间模型向量Vi(wi1,wi2,...,win),以此来表示每个专利文本的内容。②Sort, sort according to the weight in descending order, construct the space model vector V i (w i1 ,w i2 ,...,w in ) of the patent text, so as to represent the content of each patent text.

倒排索引文件中已经记录了特征词的词频、位置权重、词性权重,所以只需要统计同样出现该特征词的文本数,至于总文本数也是已知的,具体结果如表2所示。The word frequency, position weight, and part-of-speech weight of the feature word have been recorded in the inverted index file, so it is only necessary to count the number of texts in which the feature word also appears. The total number of texts is also known. The specific results are shown in Table 2.

表2专利特征向量Table 2 Patent feature vector

Figure GDA0001344119680000077
Figure GDA0001344119680000077

Figure GDA0001344119680000081
Figure GDA0001344119680000081

步骤5:生成IPC各层次类别特征向量,在步骤1基础上,从小类开始逐层向上,计算每个词汇在对应层次的类别权重,权重的计算使用TF-IDF,将一个类别描述看作一个文本,然后构建各层次的类别特征向量。Step 5: Generate the category feature vectors of each level of IPC. On the basis of step 1, start from the small category and go up layer by layer, calculate the category weight of each vocabulary at the corresponding level, use TF-IDF to calculate the weight, and regard a category description as a text, and then construct the category feature vector at each level.

本步骤具体包括:This step specifically includes:

①将各子组的类别描述并入所属主组的类别描述,进行分词、去停用词处理。① Combine the category description of each subgroup into the category description of the main group to which it belongs, and perform word segmentation and stop word removal processing.

②将每个主组的描述合并后进行特征选择,构造IPC小类层次的类别特征向量,向量表示为{VA01B1/00,VA01B3/00,...,VH99Z99/00}。其中,A01B1/00为IPC中第一个主组,H99Z99/00为IPC中最后一个主组。② Combine the descriptions of each main group and perform feature selection to construct the category feature vector of the IPC subclass level, which is represented as {V A01B1/00 ,V A01B3/00 ,...,V H99Z99/00 }. Among them, A01B1/00 is the first main group in IPC, and H99Z99/00 is the last main group in IPC.

③将同一个小类下的所有基本描述合并后进行特征选择,构造IPC大类层次的类别特征向量,向量表示为{VA01B,VA01C,...,VH99Z}。其中,A01B为IPC中第一个小类,H99Z是IPC中最后一个小类。③ Combine all the basic descriptions under the same sub-category and perform feature selection to construct the category feature vector of the IPC category level, which is represented as {V A01B ,V A01C ,...,V H99Z }. Among them, A01B is the first subclass in IPC, and H99Z is the last subclass in IPC.

④将同一大类下的所有基本描述合并后进行特征选择,构造IPC部层次的类别特征向量,向量表示为{VA01,VA21,...,VH99}。其中,A01为IPC中第一个大类,H99Z是IPC中最后一个大类。④ Combine all the basic descriptions under the same category and perform feature selection to construct a category feature vector at the IPC level, which is represented as {V A01 ,V A21 ,...,V H99 }. Among them, A01 is the first category in IPC, and H99Z is the last category in IPC.

比如,将A01B小类下的所有组的词汇并成一个A01B词汇集,其他A01大类下的小类亦是如此,然后计算A01B词汇集中每个词的权重,最后构造A01B小类的特征向量。For example, combine the vocabulary of all groups under the A01B sub-category into an A01B vocabulary set, as well as the sub-categories under the other A01 categories, then calculate the weight of each word in the A01B vocabulary set, and finally construct the feature vector of the A01B sub-category .

步骤6:构建专利样本邻域,利用步骤4中的专利特征向量,计算每个专利与其他专利之间相似度,对这些专利相似度进行排序,选择其中相似度最大的100个专利,组成该专利的邻域集合。Step 6: Construct the patent sample neighborhood, use the patent feature vector in step 4 to calculate the similarity between each patent and other patents, sort the similarity of these patents, and select the 100 patents with the highest similarity to form the Neighborhood set of patents.

本步骤具体包括:This step specifically includes:

①计算专利训练集中各专利之间的相似度。相似度可以通过计算向量间的夹角余弦得到。设sim(di,dj)表示专利文本di与dj的相似度,则计算公式如式(5)所示。① Calculate the similarity between the patents in the patent training set. Similarity can be obtained by calculating the cosine of the angle between the vectors. Suppose sim(d i , d j ) represents the similarity between the patent texts d i and d j , then the calculation formula is shown in formula (5).

Figure GDA0001344119680000091
Figure GDA0001344119680000091

其中,Wik和Wjk表示专利向量中对应特证词的权重,n表示向量的维数。Among them, Wik and Wjk represent the weight of the corresponding special testimony in the patent vector, and n represents the dimension of the vector.

②将di与其他所有专利样本dj的相似度按降序排序,选择前K个专利样本形成集合Di,Di称作为专利di的邻域,K的值视具体情况而定。② Sort the similarity between d i and all other patent samples d j in descending order, select the first K patent samples to form a set D i , and D i is called the neighborhood of patent d i , and the value of K depends on the specific situation.

具体结果如表3所示。The specific results are shown in Table 3.

表3专利领域集合Table 3 Collection of patent fields

Figure GDA0001344119680000092
Figure GDA0001344119680000092

步骤7:计算待分类专利向量与IPC类别特征向量以及与训练集专利之间的余弦相似度值,同样计算出待分专利的邻域集合。Step 7: Calculate the cosine similarity value between the patent vector to be classified, the IPC category feature vector, and the training set patent, and also calculate the neighborhood set of the patent to be classified.

本步骤包括:This step includes:

①对待分类专利进行预处理、特征选择、向量化以及数据格式转换。①Preprocessing, feature selection, vectorization and data format conversion of patents to be classified.

②专利特征选择和向量化。②Patent feature selection and vectorization.

③计算待分类专利Bj特征向量与各IPC类别特征向量的余弦相似度Sai③ Calculate the cosine similarity S ai of the feature vector of the patent B j to be classified and the feature vector of each IPC category.

④计算待分类专利Bj与专利训练集中每个专利的余弦相似度Sbj④ Calculate the cosine similarity S bj between the patent B j to be classified and each patent in the patent training set.

⑤将上述的训练专利按相似度值Sbj降序排序,选择最前面K个专利作为其邻域集合。⑤ Sort the above training patents in descending order of similarity value S bj , and select the top K patents as its neighborhood set.

步骤8:分类决策,首先计算待分类专利与训练集中专利之间共享领域的大小,即计算邻域集合中相同专利的个数。然后计算待分专利与专利类别间的相似度加权和,对加权和排序后,将待分专利划分为值最大的那个类中。Step 8: Classification decision, first calculate the size of the shared field between the patents to be classified and the patents in the training set, that is, calculate the number of identical patents in the neighborhood set. Then, the weighted sum of the similarity between the patent to be divided and the patent category is calculated, and after the weighted sum is sorted, the patent to be divided is divided into the class with the largest value.

本步骤具体包括:This step specifically includes:

①计算待分类专利Bj与样本专利di之间的共享领域大小L(Bj,di),即两个领域集合中相同专利的个数。① Calculate the shared field size L(B j ,d i ) between the patent B j to be classified and the sample patent d i , that is, the number of identical patents in the two field sets.

②计算待分类专利与各个IPC类别间的最终加权相似度,计算公式如式(6)所示。② Calculate the final weighted similarity between the patent to be classified and each IPC category, and the calculation formula is shown in formula (6).

Figure GDA0001344119680000101
Figure GDA0001344119680000101

其中,I表示类别,p,k,α,β为可调参数,系统默认情况下,p为0.8,k为0.95,α为0.6,β为5。Among them, I represents the category, and p, k, α, and β are adjustable parameters. By default, p is 0.8, k is 0.95, α is 0.6, and β is 5.

③将待分类专利归入相似度S(i)最大的类。③ Classify the patents to be classified into the class with the largest similarity S(i).

在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, reference to the terms "one embodiment," "some embodiments," "exemplary embodiment," "example," "specific example," or "some examples", etc., is meant to incorporate the embodiments A particular feature, structure, material, or characteristic described by an example or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, The scope of the invention is defined by the claims and their equivalents.

Claims (3)

1. A patent classification method based on an instruction book is characterized by comprising the following steps:
step 1, acquiring data of a patent text, and performing text preprocessing on a patent specification;
step 2, counting word frequency, position information, part-of-speech weight and inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the text information of the patent specification;
step 3, calculating the characteristic values of the words by using a characteristic selection method combining information gain and word frequency, sequencing the characteristic values, and selecting a certain number of characteristic words to represent the text of the patent specification;
the calculation process of the characteristic value in the step 3 is as follows:
let AijFor containing the feature word tiAnd belong to cjNumber of documents, BijFor containing the feature word tiAnd the category does not belong to cjNumber of documents of CijTo not contain the feature word tiAnd the category belongs to cjNumber of documents, DijTo not contain the feature word tiAnd the category belongs tojThe feature value is calculated as shown in equation (1):
Figure FDA0002632214860000011
wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection; let m be the total number of classes in the training patent, NjDenotes cjTotal number of patents in class, TFjkRepresentation feature word tiAt cjMiddle class patent PkThe term frequency in (1), then the calculation of TF is as shown in equation (2):
Figure FDA0002632214860000012
IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller; let TFj(ti) Representation feature word tiIn class cjFrequency of middle, TF (t)i) Representation feature word tiThe total frequency of the frequency of (c) is,
Figure FDA0002632214860000014
representation feature word tiThe average of the frequencies occurring in all classes is calculated as shown in equation (3):
Figure FDA0002632214860000013
step 4, calculating the weight of each patent feature word by using the inverted index file, then calculating the weight of the feature word by using an improved TF-IDF formula, and finally constructing a patent feature vector;
the specific process of the step 4 is as follows:
step 4.1, weight calculation, wherein the calculation is shown as the formula (4):
Figure FDA0002632214860000021
wherein,
Figure FDA0002632214860000022
representing the characteristic word t in the text
Figure FDA0002632214860000023
The frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, CtRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature wordtRepresenting characteristic wordsA position weight coefficient;
step 4.2, sorting according to weight descending order, and constructing a space model vector V of the text of the patent specificationi(wi1,wi2,...,win) The content of the text of each patent specification is expressed as such;
step 5, generating IPC class characteristic vectors of each level, calculating the class weight of each vocabulary in the corresponding level layer by layer from subclasses upwards on the basis of the step 1, calculating the weight by using TF-IDF, regarding a class description as a text, and then constructing the class characteristic vectors of each level;
the specific process of the step 5 is as follows:
step 5.1, merging the category description of each subgroup into the category description of the main group to which the subgroup belongs, and performing word segmentation and word stop processing;
and 5.2, combining the descriptions of each main group, then selecting features, constructing a class feature vector of an IPC subclass layer, and expressing the vector as { V }A01B1/00,VA01B3/00,...,VH99Z99/00}; wherein, A01B1/00 is the first main group in IPC, H99Z99/00 is the last main group in IPC;
step 5.3, merging all basic descriptions under the same subclass, then performing feature selection, constructing a class feature vector of an IPC (IPC) major class hierarchy, and expressing the vector as { V }A01B,VA01C,...,VH99Z}; wherein, A01B is the first subclass in IPC, H99Z is the last subclass in IPC;
step 5.4, merging all basic descriptions under the same large class, then performing feature selection, constructing a class feature vector of an IPC part level, and expressing the vector as { V }A01,VA21,...,VH99}; wherein, A01 is the first major class in IPC, H99Z is the last major class in IPC;
step 6, constructing a neighborhood of a patent sample, calculating the similarity between each patent and other patents by using the patent feature vectors in the step 4, sequencing the patent similarities, and selecting K patents with the maximum similarity to form a neighborhood set of the patent;
the specific process of the step 6 is as follows:
step 6.1, calculating the similarity among the patents in the patent training set; the similarity can be obtained by calculating the cosine of an included angle between vectors; let sim (d)i,dj) Text d representing patent specificationiAnd djThe calculation formula is shown in formula (5):
Figure FDA0002632214860000031
wherein, WikAnd WjkRepresenting the weight of a corresponding special token in the patent vector, wherein n represents the dimension of the vector;
step 6.2, mixing diSample d of all other patentsjThe similarity is sorted in descending order, and the first K patent samples are selected to form a set Di,DiReferred to as patent diThe value of K is case specific;
step 7, calculating cosine similarity values between the patent vectors to be classified, the IPC class characteristic vectors and the patents in the training set, and calculating neighborhood sets of the patents to be classified;
step 8, firstly, calculating the size of a shared field between the patents to be classified and the patents in the training set, namely calculating the number of the same patents in the neighborhood set; then calculating the similarity weighted sum between the patents to be classified and the patent categories, and after the weighted sum is sorted, dividing the patents to be classified into the category with the largest value;
the specific process of the step 8 is as follows:
step 8.1, calculate the patent B to be classifiedjAnd sample patent diSize L (B) of shared field betweenj,di) I.e. the number of identical patents in the two domain sets;
and 8.2, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein the calculation formula is shown as the formula (6):
Figure FDA0002632214860000032
wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5;
and 8.3, classifying the patents to be classified into the class with the maximum similarity S (i).
2. A specification-based patent classification method according to claim 1, characterized in that: the step 1 specifically comprises:
collecting patent sample data, sampling IPC (International patent Classification) numbers, extracting specifications, dividing Chinese words, labeling parts of speech, and removing symbols and numbers in the specifications; regular matching is used for filtering out words with little use for patent classification, such as stop words, dummy words and connecting words, and only nouns, adjectives and verb keywords are reserved.
3. A specification-based patent classification method according to claim 1, characterized in that: the specific process of the step 7 is as follows:
7.1, extracting a specification, dividing Chinese words, labeling parts of speech and removing stop words of the patent to be classified;
step 7.2, selecting and vectorizing patent features;
step 7.3, calculating the patent B to be classifiedjCosine similarity S of eigenvector and each IPC class eigenvectorai
Step 7.4, calculate the patent B to be classifiedjThe cosine similarity S with each patent in the patent training setbj
Step 7.5, the above training patents are processed according to the similarity value SbjSorting in descending order, and selecting the top K patents as the neighborhood set.
CN201710082677.8A 2017-02-16 2017-02-16 A description-based patent classification method Active CN107122382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710082677.8A CN107122382B (en) 2017-02-16 2017-02-16 A description-based patent classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710082677.8A CN107122382B (en) 2017-02-16 2017-02-16 A description-based patent classification method

Publications (2)

Publication Number Publication Date
CN107122382A CN107122382A (en) 2017-09-01
CN107122382B true CN107122382B (en) 2021-03-23

Family

ID=59717475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710082677.8A Active CN107122382B (en) 2017-02-16 2017-02-16 A description-based patent classification method

Country Status (1)

Country Link
CN (1) CN107122382B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679153A (en) * 2017-09-27 2018-02-09 国家电网公司信息通信分公司 A kind of patent classification method and device
CN107844553B (en) * 2017-10-31 2021-07-27 浪潮通用软件有限公司 Text classification method and device
CN107862328A (en) * 2017-10-31 2018-03-30 平安科技(深圳)有限公司 The regular execution method of information word set generation method and rule-based engine
CN108170666A (en) * 2017-11-29 2018-06-15 同济大学 A kind of improved method based on TF-IDF keyword extractions
CN108227564B (en) * 2017-12-12 2020-07-21 深圳和而泰数据资源与云技术有限公司 Information processing method, terminal and computer readable medium
CN108804512B (en) * 2018-04-20 2020-11-24 平安科技(深圳)有限公司 Text classification model generation device and method and computer readable storage medium
CN109213855A (en) * 2018-09-12 2019-01-15 合肥汇众知识产权管理有限公司 Document labeling method based on patent drafting
CN109299263B (en) * 2018-10-10 2021-01-05 上海观安信息技术股份有限公司 Text classification method and electronic equipment
CN110019822B (en) * 2019-04-16 2021-07-06 中国科学技术大学 A few-sample relation classification method and system
CN111930946A (en) * 2020-08-18 2020-11-13 哈尔滨工程大学 Patent classification method based on similarity measurement
CN113919436A (en) * 2021-10-22 2022-01-11 厦门理工学院 Patent rapid processing method based on mobile phone terminal
CN113849655B (en) * 2021-12-02 2022-02-18 江西师范大学 Patent text multi-label classification method
CN114297387A (en) * 2021-12-31 2022-04-08 智慧芽信息科技(苏州)有限公司 Training sample labeling method and device and classification model training method and device
CN116701633B (en) * 2023-06-14 2024-06-18 上交所技术有限责任公司 Industry classification method based on patent big data
CN116975068A (en) * 2023-09-25 2023-10-31 中国标准化研究院 Metadata-based patent document data storage method, device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Also Published As

Publication number Publication date
CN107122382A (en) 2017-09-01

Similar Documents

Publication Publication Date Title
CN107122382B (en) A description-based patent classification method
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN109101477B (en) A method for enterprise field classification and enterprise keyword screening
CN110825877A (en) A Semantic Similarity Analysis Method Based on Text Clustering
CN105426426B (en) A kind of KNN file classification methods based on improved K-Medoids
WO2022156328A1 (en) Restful-type web service clustering method fusing service cooperation relationships
CN108763213A (en) Theme feature text key word extracting method
CN110543564B (en) Domain Label Acquisition Method Based on Topic Model
CN109902289B (en) News video theme segmentation method oriented to fuzzy text mining
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN111144106B (en) Two-stage text feature selection method under unbalanced data set
CN102495892A (en) Webpage information extraction method
CN112861984B (en) A speech emotion classification method based on feature fusion and ensemble learning
CN108804595B (en) A short text representation method based on word2vec
CN110097096B (en) A Text Classification Method Based on TF-IDF Matrix and Capsule Network
CN104391852B (en) A kind of method and apparatus for establishing keyword dictionary
CN104536830A (en) KNN text classification method based on MapReduce
CN111259110A (en) College patent personalized recommendation system
CN107908624A (en) A kind of K medoids Text Clustering Methods based on all standing Granule Computing
CN118278365A (en) Automatic generation method and device for scientific literature review
CN106503153A (en) Computer text classification system, system and text classification method thereof
CN104866606A (en) MapReduce parallel big data text classification method
CN101408893A (en) Method for rapidly clustering documents
CN113157857B (en) News-oriented hot topic detection method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant