CN105808524A

CN105808524A - Patent document abstract-based automatic patent classification method

Info

Publication number: CN105808524A
Application number: CN201610139838.8A
Authority: CN
Inventors: 彭彦; 朱玉全; 李竞; 何峰; 余飞
Original assignee: Jiangsu Changyuan Information Technology Co ltd
Current assignee: Jiangsu Changyuan Information Technology Co ltd
Priority date: 2016-03-11
Filing date: 2016-03-11
Publication date: 2016-07-27

Abstract

The invention discloses a patent document abstract-based automatic patent classification method. The method comprises the following steps: dictionary construction, generation of various layers and classes or feature vectors of an IPC, patent text feature selection, patent text vectorization, SVM-based classification model construction and to-be-classified patent classification. In the patent text feature selection, the information contained in the patent titles and abstracts are fully utilized so that the calculation amount of the patent feature selection is remarkably decreased; in the calculation process of the patent text vectorization, a TF-IDF calculation method is improved, and the part of speech weights and position weights of words are added, so that the classification effect is further improved; and in the patent classification, a layer-based classification method is adopted, and the method fully utilizes the advantages of an SVM classifier and a KNN classifier, so that the disadvantages that the SVM classifier is overmuch in training and the KNN classifier is large in calculation amount are overcome.

Description

A kind of patent automatic classifying method based on patent documentation summary

Technical field

The invention belongs to the application of the computer analytical technology of patent documentation, be specifically related to one and utilize abridgments of specifications Patent classification method.

Background technology

Patent refers to the general designation of all kinds description of Patent Office of various countries or the publication of International Patent tissue, is patent literary composition The main body offered.Adding up according to World Intellectual Property Organization (WIPO), the 70% of invention achievement in the world～90% first Occur in the patent literature rather than in the document of other carriers such as magazine, paper.Additionally, in order to protect self Interests, enterprise can as far as possible early apply for a patent, often concentrate technology the most active and advanced, comprise in patent The technical information of 90%～95% in the world.Simultaneously for the convenience examined, the comparison that patent documentation is often write is detailed Carefully, for other kinds of data, patent documentation can provide more information, is a kind of modal skill Art innovation achievement, records the complete procedure of Patent Activities.It not only reflects showing of technical activity in each technical field Shape, and the developing history of technical activity in certain particular technology area can be embodied.Containing each in patent documentation The concrete technical solution of the innovation and creation applied for a patent, has very important effect for enterprise innovation, not only Up-to-date scientific research is dynamic to make enterprise it will be seen that, it is to avoid repeat research, saves search time and research funding, the most also may be used Edify opening one's minds of business research personnel, improve the starting point of innovation, use for reference conventional invention, greatly shorten scientific research work Make progress.

Along with continuing to bring out of China's recent studies on achievement and innovation and creation, patent numbers presents quickly growth, every year Patent application number close to 1,000,000.Ending December in 2015 29, the patent of invention number that China has announced is More than 5,000,000, wherein authorizing patent of invention sum is 186.610 ten thousand.If the mean size of each patent is 2M, then the capacity of patent data is up to hundreds of TB.In order to scientifically manage these patent documentation data, also it is simultaneously Retrieval Patents document quickly and easily, the classification of patent documentation is particularly important.At present, the most greatly Majority state all uses International Patent Classification IPC (International Patent Classification) to come patent literary composition Offer and classify, IPC according to five grade separations, i.e. portion (Section), big class (Class), group (Subclass), Main group (Main Grop), packet (Grop), be the classification layer of highest ranking in classification chart in the middle part of it, different according to field, Being divided into eight big portions, with the English alphabet labelling of, be A-H respectively, each part subordinate is provided with multiple big class, Big class is made up of binary digit, each the big class having varying number below.Such as: G06F21/00 represents physics Protection computer, its parts, program or the safety device of data of-electricity Digital data processing-prevent unauthorized behavior.

As can be seen here, for the patent of invention announced, all there is a corresponding classification number, as G06F21/00.From the point of view of patent to be applied for, its classification number be unknown and it needs to be determined that, to this, lead at present Normal way is the art according to patent description object or patent content determines.Generally, such point Class method is the narrowest, and the content that associated specialist need to be relied on to carry out manual read's application completes, patent literary composition Sharply increasing of offering makes manual analysis patent need the man power and material expended to increase, and the limitation of expert's its knowledge Property is also difficult to ensure that concordance and the accuracy of classification.To this end, the present invention proposes a kind of based on patent documentation summary Patent automatic classifying method, the method utilizes the information announced in patent of invention summary to come structural classification device or classification letter Number, and the classification determining patent-pending with this, be achieved in the automatic classification of patent.

Summary of the invention

Present invention aims to existing patent classification method can not fully, effectively utilize and announce patent of invention In summary info, propose a kind of based on patent documentation summary patent automatic classifying method, the method will make full use of The classification having announced summary info that patent of invention comprised and correspondence comes structural classification device or classification function, comes with this Determine that classification has been filed on the classification applied for a patent, and with regard to the feature extraction of summary in construction process and selection, grader Determine etc. that aspect proposes corresponding optimal solution.

The technical scheme is that a kind of patent automatic classifying method based on patent documentation summary, including dictionary structure Build, the generation of IPC each stratigraphic classification characteristic vector, patent text feature selection, the vectorization of patent text, based on The disaggregated model of SVM builds, patent classification step to be sorted, it is characterised in that: each level of dictionary creation, IPC The generation of category feature vector, patent text feature selection, the vectorization of patent text, classification mould based on SVM Type builds, patent classification step to be sorted includes:

Step 1 patent data pretreatment, including sample IPC, title and the extraction of summary, Chinese word segmentation, word Property mark, remove stop words and word frequency statistics；

Step 2 utilizes the method that canonical is mated, and filters out non-noun, verb or adjectival word, generates record word The inverted index file of language position and part of speech weight；

Step 3 builds dictionary, and all words are built into a dictionary, and the form of dictionary uses label item form, As: 1 China；

Step 4 generates the characteristic vector of each stratigraphic classification of IPC, and its step includes:

The classification of each subgroup is described the classification description being incorporated to institute owner group by step 4.1, carries out participle, goes at stop words Reason；

Step 4.2 carries out feature selection, the category feature of the structure little class hierarchy of IPC after the description of each main group being merged Vector, vector representation is { V_A01B1/00,V_A01B3/00,...,V_H99Z99/00}.Wherein, first master during A01B1/00 is IPC Group, H99Z99/00 is in IPC last main group；

Step 4.3 carries out feature selection after all basic description under same group being merged, and constructs IPC big class layer Secondary category feature vector, vector representation is { V_A01B,V_A01C,...,V_H99Z, wherein, A01B is in IPC first Group, H99Z is last group in IPC.

Step 4.4 carries out feature selection after all basic description under same big class being merged, structure IPC portion level Category feature vector, vector representation is { V_A01,V_A21,...,V_H99, wherein, A01 is first big class, H99Z in IPC It it is last big class in IPC.

Step 5 patent text feature selection, uses card side's feature selection approach to calculate the eigenvalue of institute's prompter language, row Sequence also selects K (being given by user) individual Feature Words to characterize patent text.

Step 6 calculates the eigenvalue of each word, utilizes inverted index file to calculate each statistic；

The vectorization of step 7 patent text, its step includes:

Step 7.1 carries out weight calculation to Feature Words, and computing formula is:

Wherein, whereinRepresent that Feature Words t is at textThe frequency of middle appearance, N represents that full text is concentrated The number of all texts, n represents that full text concentrates textual data lexical item t occur, C_tRepresent that Feature Words part of speech institute is right The part of speech weight coefficient answered, P_tRepresent the position weight coefficient of Feature Words；

Step 7.2 carries out descending sort to weight, the spatial model vector of structure patent text: V_i(w_i1,w_i2,...,w_in), The content of each patent text is represented with this；

Step 8 builds ministerial level SVM training set, and its step includes:

Patent text characteristic vector is carried out merger by the portion of IPC by step 8.1；

The category feature vector of step 8.2 IPC portion level joins in the vector set in step 8.1；

Vector set is converted into following training set form by step 8.3:

Wherein first being classified as IPC portion number, lable is lexical item, and w is characterized word weight；

Step 9 builds disaggregated model based on SVM in A-H portion, uses the SVM multiclass of strategy one to one to divide Class method, is trained training set, generates the sorter model of patent office level；

Step 10 builds big class level SVM training set, and its step includes:

Step 10.1 patent text characteristic vector carries out merger by the big class of IPC；

The category feature vector of the big class hierarchy of step 10.2 IPC joins in the vector set in step 10.1；

Vector set is converted into following training set form by step 10.3:

Wherein the first big class-mark being classified as IPC, lable is lexical item, and w is characterized word weight；

Step 11 builds disaggregated model based on SVM in each big class, divides the training of big class according to A-H portion Collection, is trained the big class training set in each portion respectively, the same SVM multicategory classification side using strategy one to one Method, generates the sorter model of big class hierarchy；

Step 12 patent to be sorted categorised decision in portion and big class, its step includes:

Step 12.1 patent to be sorted carries out title and the extraction of summary, Chinese word segmentation, part-of-speech tagging, removes stop words；

Step 12.2 patent characteristic selects and vectorization；

Step 12.3 utilizes the tree classificator hierarchical model being previously obtained that patent to be sorted is carried out classification decision-making, determines Big class belonging to patent to be sorted；

The classification of the little class hierarchy of step 13 patent, utilizes KNN algorithm based on IPC classification knowledge to carry out classification and knows Not, its step includes:

Step 13.1 calculates patent B to be sorted_jCharacteristic vector with its residing for big each group I characteristic vector of apoplexy due to endogenous wind remaining String similarity S_ai；

Step 13.2 calculates patent B to be sorted_jWith its residing for the cosine similarity S of each training patent of big apoplexy due to endogenous wind_bj；

Training patent in step 13.2 is pressed sequencing of similarity by step 13.3, if the highest similarity is less than Thresholding, illustrates that its mistake has divided big class, is classified to misclassification, otherwise proceed to step 13.4；

Step 13.4 chooses similarity the highest K training patent, according to its group classification respectively with in step 13.1 Group Similarity-Weighted summation, calculation is:

S_{i} = S_{a i} + p \times \underset{B_{j} &Element; I}{Σ} S_{b j} - - - I I I

Wherein p is weight coefficient, S_iFinal similarity for patent to be sorted Yu each group；

Patent to be sorted is included into similarity S by step 13.5_iMaximum group.

The present invention constructs making full use of the classification announcing summary info that patent of invention comprised and correspondence point With this, class device or classification function, determine that classification has been filed on the classification applied for a patent, and with regard to the spy of summary in construction process Levy extract and select, the aspect such as the determination of grader proposes corresponding optimal solution, it is mainly reflected in:

(1) patent text feature selection aspect

Patent title and summary specify title and the art of patent, and clearly reflecting this patent to solve Technical problem, technical scheme and main uses certainly, patent of the present invention takes full advantage of these information, hence it is evident that reduce The amount of calculation that patent characteristic selects, and classifying quality is very good.It addition, in the calculating of patent text vectorization Cheng Zhong, improves the method calculating TF-IDF, adds part of speech weight and the position weight of word, thus carries further High classifying quality.

(2) design aspect of patent classification method

Owing to patent classification number is relatively big, need to train substantial amounts of SVM classifier, and profound patent class Other similarity is the highest, so being difficult to find corresponding SVM to support vector in patent group.And KNN algorithm is The sorting algorithm of Case-based Reasoning, it is not necessary to training pattern, but the example comprised due to patent office or big class is too much, its point Similarity Measure amount in class process is huge.To this end, a kind of method that patent of the present invention proposes hierarchical classification, the method Take full advantage of the respective advantage of SVM and KNN grader, overcome SVM classifier training too much KNN The shortcoming that algorithm is computationally intensive.

Accompanying drawing explanation

Fig. 1 is the structured flowchart in the embodiment of the present invention

Fig. 2 is the structure flow process of embodiment of the present invention Patent vector space

Fig. 3 is based on the classification process figure improving KNN in the embodiment of the present invention

Detailed description of the invention

This example, according to IPC category table, extracts 1000, patent sample respectively, wherein at same class in A-H portion Other level uniformly extracts according to concrete class.Concrete execution step is as follows:

(1) obtain the description of IPC classification, description is carried out participle and part-of-speech tagging, goes stop words to process, here Be the ICTCLAS participle instrument of the Chinese Academy of Sciences, after word segmentation result is carried out artificial calibration, build user-oriented dictionary.

(2) respectively patent sample is carried out form conversion, title and abstract extraction, title and summary are carried out Chinese Participle, part-of-speech tagging.Utilize regular expression, remove stop words, function word, conjunction etc. to patent classification use not Big word, only retains noun, adjective, verb.

(3) building inverted index file, indexed file structure includes vocabulary and event table, each vocabulary correspondence one Individual event table, event table deposits the patent No. that vocabulary is occurred.Position weight employing computing formula:Wherein n Represent that vocabulary at title and plucks the total degree occurred in use, l_iRepresent that the weight of present position occurs in vocabulary i & lt, real Bidding topic weight 1, first sentence 0.8 of making a summary, other positions 0.5 in example.Part of speech weight setting is noun 2.5, verb and Adjective is 1.

(4) utilizing inverted index file, build and filtered, by participle, the dictionary that the vocabulary obtained forms, each word is corresponding One lable, the purpose of structure is to train for SVM below to prepare, and concrete outcome is as shown in table 1.

Table 1 user-oriented dictionary and inverted index merge

(5) on the basis of (1), start the most upwards from group, calculate each vocabulary classification at corresponding level Weight, the calculation of weight is similar to TF-IDF, a classification is described and regards a text as, then builds each layer Secondary category feature vector.Such as, by the vocabulary of all groups under A01B group and become an A01B word finder, Group under the big class of other A01 is also such, then calculates the weight of each word in A01B word finder, finally constructs The characteristic vector of A01B group.

(6) short text is belonged to due to patent title and summary, and the content of high level overview patent and affiliated neck Territory, the frequency that the vocabulary that class discrimination degree is high occurs is not necessarily the highest, so using the feature choosing of card side in this example Selection method.Computational methods for card side the most here repeat, and utilize inverted index file statistical computation in the method Several statistics A used_ij、B_ij、C_ij、D_ij(A_ijFor comprising Feature Words t_iAnd belong to c_jNumber of documents, B_ij For comprising Feature Words t_iAnd classification is not belonging to c_jNumber of documents, C_ijFor not comprising Feature Words t_iAnd classification belongs to c_j Number of documents, D_ijFor not comprising Feature Words t_iAnd classification belongs to not c_jNumber of documents.).Utilize these four Statistic, calculates the chi-square value between each word and classification in patent, is worth the biggest, the dependency of vocabulary and classification is described more By force.The chi-square value that word each in patent is calculated by this example is ranked up, and selects the most front 10 chi-square value maximums Word as Feature Words.

(7) utilize inverted index file, calculate the weight of each patent characteristic word.Inverted index file has been remembered The word frequency of Feature Words, position weight, part of speech weight are recorded, so having only to add up the text that this feature word occurs equally Number, is also known as total textual data, and the TF-IDF public affairs formula (II) improved of utilization calculates weight, then Weight is carried out normalization, has reduced the amount of calculation of SVM training.Finally, the Text eigenvector of patent is constructed.

(8) training set of SVM is built.The present invention uses libsvm storehouse, thus training set need according to The form of regulation carrys out structure and is good for, and concrete form describes in detail in summary of the invention.Patent text characteristic vector is pressed level It is saved in respectively in different files, after the patent vector format of A-H, is saved in file svm.train, for ministerial level SVM training.Be saved in file svma.train after the patent vector format of A subordinate, the patent of other subordinates to Amount is saved in respective file after formatting the most respectively, trains for the SVM of big class level, concrete outcome such as table 2 institute Show.

The SVM training set of table 2 A-H

(9) the training set file utilizing above-mentioned structure carries out SVM training respectively, builds grader.By training set number According to zooming in and out between 0-1, use the zoom tool carried inside libsvm, this example has changed some and has entered Mouth parameter, can be saved in the result after scaling in file.To the training set after scaling, utilize libsvm next life Becoming model, result produces 9 SVM classifier models altogether.These graders are pressed tree structure tissue, and root saves Point is ministerial level SVM classifier model, and other 8 SVM models hang over ministerial level SVM respectively as direct child node Under model, concrete outcome is as shown in table 3.

The training pattern of table 3 A-H

(10) patent to be sorted is carried out pretreatment, feature selection, vectorization and Data Format Transform.First, By the patent data to be sorted input ministerial level SVM classifier model after conversion, it was predicted that its affiliated portion.Then, then Enter data into the SVM classifier model in affiliated portion, it was predicted that its affiliated big class.

(11) in (5), the category feature vector of each group of IPC is had calculated that, according to be sorted in (10) The big class that patent puts under, calculates this patent to be sorted vector and the cosine phase of all group category features vector under this big class Like angle value (group characteristic similarity).Then patent to be sorted vector and all training patent texts under this big class are calculated The cosine similarity value of vector.Sort after above-mentioned Similarity value is normalized, when maximum similarity is less than territory During value, it is believed that this patent to be divided belongs to misclassification, terminating classification, this threshold value is set by the user.If greater than threshold value, Then choose 30 the training patents the highest with band classification patent similarity, according to the group classification of these training patents, Public formula (III) is utilized to be weighted summation with group characteristic similarity.The result calculated is ranked up, will training Patent is included into the classification that summation Similarity value is the highest.

Claims

1. a patent automatic classifying method based on patent documentation summary, including dictionary creation, each stratigraphic classification of IPC The generation of characteristic vector, patent text feature selection, the vectorization of patent text, disaggregated model structure based on SVM Build, patent classification step to be sorted, it is characterised in that: the generation of dictionary creation, IPC each stratigraphic classification characteristic vector, Patent text feature selection, the vectorization of patent text, disaggregated model based on SVM structure, patent to be sorted are divided Class step includes:

Step 4 generates the characteristic vector of each stratigraphic classification of IPC；

Step 5 patent text feature selection, uses card side's feature selection approach to calculate the eigenvalue of institute's prompter language, row Sequence also selects K (being given by user) individual Feature Words to characterize patent text；

The vectorization of step 7 patent text, its step includes:

Step 8 builds ministerial level SVM training set, and its step includes:

Vector set is converted into following training set form by step 8.3:

Step 10 builds big class level SVM training set, and its step includes:

Vector set is converted into following training set form by step 10.3:

Step 12.2 patent characteristic selects and vectorization；

Step 12.3 utilizes the tree classificator hierarchical model being previously obtained patent to be sorted to be carried out classification decision-making, really Fixed big class belonging to patent to be sorted；

The classification of the little class hierarchy of step 13 patent, utilizes KNN algorithm based on IPC classification knowledge to carry out classification and knows Not.

A kind of patent automatic classifying method based on patent documentation summary the most according to claim 1, its feature It is: the concrete steps of described step 4 include:

Step 4.3 carries out feature selection after all basic description under same group being merged, and constructs IPC big class layer Secondary category feature vector, vector representation is { V_A01B,V_A01C,...,V_H99Z, wherein, A01B is in IPC first Group, H99Z is last group in IPC；

A kind of patent automatic classifying method based on patent documentation summary the most according to claim 1, its feature It is: the concrete steps of described step 13 include:

Step 13.1 calculates patent B to be sorted_jCharacteristic vector and big each group I characteristic vector of apoplexy due to endogenous wind residing for it Cosine similarity S_ai；

Step 13.4 chooses similarity the highest K training patent, according to its group classification respectively with step 13.1 In group Similarity-Weighted summation, calculation is:

S_{i} = S_{a i} + p \times \underset{B_{j} &Element; I}{Σ} S_{b j} - - - I I I

Patent to be sorted is included into similarity S by step 13.5_iMaximum group.