CN105808524A - Patent document abstract-based automatic patent classification method - Google Patents

Patent document abstract-based automatic patent classification method Download PDF

Info

Publication number
CN105808524A
CN105808524A CN201610139838.8A CN201610139838A CN105808524A CN 105808524 A CN105808524 A CN 105808524A CN 201610139838 A CN201610139838 A CN 201610139838A CN 105808524 A CN105808524 A CN 105808524A
Authority
CN
China
Prior art keywords
ipc
classification
vector
big class
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610139838.8A
Other languages
Chinese (zh)
Inventor
彭彦
朱玉全
李竞
何峰
余飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Changyuan Information Technology Co ltd
Original Assignee
Jiangsu Changyuan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Changyuan Information Technology Co ltd filed Critical Jiangsu Changyuan Information Technology Co ltd
Priority to CN201610139838.8A priority Critical patent/CN105808524A/en
Publication of CN105808524A publication Critical patent/CN105808524A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a patent document abstract-based automatic patent classification method. The method comprises the following steps: dictionary construction, generation of various layers and classes or feature vectors of an IPC, patent text feature selection, patent text vectorization, SVM-based classification model construction and to-be-classified patent classification. In the patent text feature selection, the information contained in the patent titles and abstracts are fully utilized so that the calculation amount of the patent feature selection is remarkably decreased; in the calculation process of the patent text vectorization, a TF-IDF calculation method is improved, and the part of speech weights and position weights of words are added, so that the classification effect is further improved; and in the patent classification, a layer-based classification method is adopted, and the method fully utilizes the advantages of an SVM classifier and a KNN classifier, so that the disadvantages that the SVM classifier is overmuch in training and the KNN classifier is large in calculation amount are overcome.

Description

A kind of patent automatic classifying method based on patent documentation summary
Technical field
The invention belongs to the application of the computer analytical technology of patent documentation, be specifically related to one and utilize abridgments of specifications Patent classification method.
Background technology
Patent refers to the general designation of all kinds description of Patent Office of various countries or the publication of International Patent tissue, is patent literary composition The main body offered.Adding up according to World Intellectual Property Organization (WIPO), the 70% of invention achievement in the world~90% first Occur in the patent literature rather than in the document of other carriers such as magazine, paper.Additionally, in order to protect self Interests, enterprise can as far as possible early apply for a patent, often concentrate technology the most active and advanced, comprise in patent The technical information of 90%~95% in the world.Simultaneously for the convenience examined, the comparison that patent documentation is often write is detailed Carefully, for other kinds of data, patent documentation can provide more information, is a kind of modal skill Art innovation achievement, records the complete procedure of Patent Activities.It not only reflects showing of technical activity in each technical field Shape, and the developing history of technical activity in certain particular technology area can be embodied.Containing each in patent documentation The concrete technical solution of the innovation and creation applied for a patent, has very important effect for enterprise innovation, not only Up-to-date scientific research is dynamic to make enterprise it will be seen that, it is to avoid repeat research, saves search time and research funding, the most also may be used Edify opening one's minds of business research personnel, improve the starting point of innovation, use for reference conventional invention, greatly shorten scientific research work Make progress.
Along with continuing to bring out of China's recent studies on achievement and innovation and creation, patent numbers presents quickly growth, every year Patent application number close to 1,000,000.Ending December in 2015 29, the patent of invention number that China has announced is More than 5,000,000, wherein authorizing patent of invention sum is 186.610 ten thousand.If the mean size of each patent is 2M, then the capacity of patent data is up to hundreds of TB.In order to scientifically manage these patent documentation data, also it is simultaneously Retrieval Patents document quickly and easily, the classification of patent documentation is particularly important.At present, the most greatly Majority state all uses International Patent Classification IPC (International Patent Classification) to come patent literary composition Offer and classify, IPC according to five grade separations, i.e. portion (Section), big class (Class), group (Subclass), Main group (Main Grop), packet (Grop), be the classification layer of highest ranking in classification chart in the middle part of it, different according to field, Being divided into eight big portions, with the English alphabet labelling of, be A-H respectively, each part subordinate is provided with multiple big class, Big class is made up of binary digit, each the big class having varying number below.Such as: G06F21/00 represents physics Protection computer, its parts, program or the safety device of data of-electricity Digital data processing-prevent unauthorized behavior.
As can be seen here, for the patent of invention announced, all there is a corresponding classification number, as G06F21/00.From the point of view of patent to be applied for, its classification number be unknown and it needs to be determined that, to this, lead at present Normal way is the art according to patent description object or patent content determines.Generally, such point Class method is the narrowest, and the content that associated specialist need to be relied on to carry out manual read's application completes, patent literary composition Sharply increasing of offering makes manual analysis patent need the man power and material expended to increase, and the limitation of expert's its knowledge Property is also difficult to ensure that concordance and the accuracy of classification.To this end, the present invention proposes a kind of based on patent documentation summary Patent automatic classifying method, the method utilizes the information announced in patent of invention summary to come structural classification device or classification letter Number, and the classification determining patent-pending with this, be achieved in the automatic classification of patent.
Summary of the invention
Present invention aims to existing patent classification method can not fully, effectively utilize and announce patent of invention In summary info, propose a kind of based on patent documentation summary patent automatic classifying method, the method will make full use of The classification having announced summary info that patent of invention comprised and correspondence comes structural classification device or classification function, comes with this Determine that classification has been filed on the classification applied for a patent, and with regard to the feature extraction of summary in construction process and selection, grader Determine etc. that aspect proposes corresponding optimal solution.
The technical scheme is that a kind of patent automatic classifying method based on patent documentation summary, including dictionary structure Build, the generation of IPC each stratigraphic classification characteristic vector, patent text feature selection, the vectorization of patent text, based on The disaggregated model of SVM builds, patent classification step to be sorted, it is characterised in that: each level of dictionary creation, IPC The generation of category feature vector, patent text feature selection, the vectorization of patent text, classification mould based on SVM Type builds, patent classification step to be sorted includes:
Step 1 patent data pretreatment, including sample IPC, title and the extraction of summary, Chinese word segmentation, word Property mark, remove stop words and word frequency statistics;
Step 2 utilizes the method that canonical is mated, and filters out non-noun, verb or adjectival word, generates record word The inverted index file of language position and part of speech weight;
Step 3 builds dictionary, and all words are built into a dictionary, and the form of dictionary uses label item form, As: 1 China;
Step 4 generates the characteristic vector of each stratigraphic classification of IPC, and its step includes:
The classification of each subgroup is described the classification description being incorporated to institute owner group by step 4.1, carries out participle, goes at stop words Reason;
Step 4.2 carries out feature selection, the category feature of the structure little class hierarchy of IPC after the description of each main group being merged Vector, vector representation is { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, first master during A01B1/00 is IPC Group, H99Z99/00 is in IPC last main group;
Step 4.3 carries out feature selection after all basic description under same group being merged, and constructs IPC big class layer Secondary category feature vector, vector representation is { VA01B,VA01C,...,VH99Z, wherein, A01B is in IPC first Group, H99Z is last group in IPC.
Step 4.4 carries out feature selection after all basic description under same big class being merged, structure IPC portion level Category feature vector, vector representation is { VA01,VA21,...,VH99, wherein, A01 is first big class, H99Z in IPC It it is last big class in IPC.
Step 5 patent text feature selection, uses card side's feature selection approach to calculate the eigenvalue of institute's prompter language, row Sequence also selects K (being given by user) individual Feature Words to characterize patent text.
Step 6 calculates the eigenvalue of each word, utilizes inverted index file to calculate each statistic;
The vectorization of step 7 patent text, its step includes:
Step 7.1 carries out weight calculation to Feature Words, and computing formula is:
Wherein, whereinRepresent that Feature Words t is at textThe frequency of middle appearance, N represents that full text is concentrated The number of all texts, n represents that full text concentrates textual data lexical item t occur, CtRepresent that Feature Words part of speech institute is right The part of speech weight coefficient answered, PtRepresent the position weight coefficient of Feature Words;
Step 7.2 carries out descending sort to weight, the spatial model vector of structure patent text: Vi(wi1,wi2,...,win), The content of each patent text is represented with this;
Step 8 builds ministerial level SVM training set, and its step includes:
Patent text characteristic vector is carried out merger by the portion of IPC by step 8.1;
The category feature vector of step 8.2 IPC portion level joins in the vector set in step 8.1;
Vector set is converted into following training set form by step 8.3:
Wherein first being classified as IPC portion number, lable is lexical item, and w is characterized word weight;
Step 9 builds disaggregated model based on SVM in A-H portion, uses the SVM multiclass of strategy one to one to divide Class method, is trained training set, generates the sorter model of patent office level;
Step 10 builds big class level SVM training set, and its step includes:
Step 10.1 patent text characteristic vector carries out merger by the big class of IPC;
The category feature vector of the big class hierarchy of step 10.2 IPC joins in the vector set in step 10.1;
Vector set is converted into following training set form by step 10.3:
Wherein the first big class-mark being classified as IPC, lable is lexical item, and w is characterized word weight;
Step 11 builds disaggregated model based on SVM in each big class, divides the training of big class according to A-H portion Collection, is trained the big class training set in each portion respectively, the same SVM multicategory classification side using strategy one to one Method, generates the sorter model of big class hierarchy;
Step 12 patent to be sorted categorised decision in portion and big class, its step includes:
Step 12.1 patent to be sorted carries out title and the extraction of summary, Chinese word segmentation, part-of-speech tagging, removes stop words;
Step 12.2 patent characteristic selects and vectorization;
Step 12.3 utilizes the tree classificator hierarchical model being previously obtained that patent to be sorted is carried out classification decision-making, determines Big class belonging to patent to be sorted;
The classification of the little class hierarchy of step 13 patent, utilizes KNN algorithm based on IPC classification knowledge to carry out classification and knows Not, its step includes:
Step 13.1 calculates patent B to be sortedjCharacteristic vector with its residing for big each group I characteristic vector of apoplexy due to endogenous wind remaining String similarity Sai
Step 13.2 calculates patent B to be sortedjWith its residing for the cosine similarity S of each training patent of big apoplexy due to endogenous windbj
Training patent in step 13.2 is pressed sequencing of similarity by step 13.3, if the highest similarity is less than Thresholding, illustrates that its mistake has divided big class, is classified to misclassification, otherwise proceed to step 13.4;
Step 13.4 chooses similarity the highest K training patent, according to its group classification respectively with in step 13.1 Group Similarity-Weighted summation, calculation is:
S i = S a i + p × Σ B j ∈ I S b j - - - I I I
Wherein p is weight coefficient, SiFinal similarity for patent to be sorted Yu each group;
Patent to be sorted is included into similarity S by step 13.5iMaximum group.
The present invention constructs making full use of the classification announcing summary info that patent of invention comprised and correspondence point With this, class device or classification function, determine that classification has been filed on the classification applied for a patent, and with regard to the spy of summary in construction process Levy extract and select, the aspect such as the determination of grader proposes corresponding optimal solution, it is mainly reflected in:
(1) patent text feature selection aspect
Patent title and summary specify title and the art of patent, and clearly reflecting this patent to solve Technical problem, technical scheme and main uses certainly, patent of the present invention takes full advantage of these information, hence it is evident that reduce The amount of calculation that patent characteristic selects, and classifying quality is very good.It addition, in the calculating of patent text vectorization Cheng Zhong, improves the method calculating TF-IDF, adds part of speech weight and the position weight of word, thus carries further High classifying quality.
(2) design aspect of patent classification method
Owing to patent classification number is relatively big, need to train substantial amounts of SVM classifier, and profound patent class Other similarity is the highest, so being difficult to find corresponding SVM to support vector in patent group.And KNN algorithm is The sorting algorithm of Case-based Reasoning, it is not necessary to training pattern, but the example comprised due to patent office or big class is too much, its point Similarity Measure amount in class process is huge.To this end, a kind of method that patent of the present invention proposes hierarchical classification, the method Take full advantage of the respective advantage of SVM and KNN grader, overcome SVM classifier training too much KNN The shortcoming that algorithm is computationally intensive.
Accompanying drawing explanation
Fig. 1 is the structured flowchart in the embodiment of the present invention
Fig. 2 is the structure flow process of embodiment of the present invention Patent vector space
Fig. 3 is based on the classification process figure improving KNN in the embodiment of the present invention
Detailed description of the invention
This example, according to IPC category table, extracts 1000, patent sample respectively, wherein at same class in A-H portion Other level uniformly extracts according to concrete class.Concrete execution step is as follows:
(1) obtain the description of IPC classification, description is carried out participle and part-of-speech tagging, goes stop words to process, here Be the ICTCLAS participle instrument of the Chinese Academy of Sciences, after word segmentation result is carried out artificial calibration, build user-oriented dictionary.
(2) respectively patent sample is carried out form conversion, title and abstract extraction, title and summary are carried out Chinese Participle, part-of-speech tagging.Utilize regular expression, remove stop words, function word, conjunction etc. to patent classification use not Big word, only retains noun, adjective, verb.
(3) building inverted index file, indexed file structure includes vocabulary and event table, each vocabulary correspondence one Individual event table, event table deposits the patent No. that vocabulary is occurred.Position weight employing computing formula:Wherein n Represent that vocabulary at title and plucks the total degree occurred in use, liRepresent that the weight of present position occurs in vocabulary i & lt, real Bidding topic weight 1, first sentence 0.8 of making a summary, other positions 0.5 in example.Part of speech weight setting is noun 2.5, verb and Adjective is 1.
(4) utilizing inverted index file, build and filtered, by participle, the dictionary that the vocabulary obtained forms, each word is corresponding One lable, the purpose of structure is to train for SVM below to prepare, and concrete outcome is as shown in table 1.
Table 1 user-oriented dictionary and inverted index merge
(5) on the basis of (1), start the most upwards from group, calculate each vocabulary classification at corresponding level Weight, the calculation of weight is similar to TF-IDF, a classification is described and regards a text as, then builds each layer Secondary category feature vector.Such as, by the vocabulary of all groups under A01B group and become an A01B word finder, Group under the big class of other A01 is also such, then calculates the weight of each word in A01B word finder, finally constructs The characteristic vector of A01B group.
(6) short text is belonged to due to patent title and summary, and the content of high level overview patent and affiliated neck Territory, the frequency that the vocabulary that class discrimination degree is high occurs is not necessarily the highest, so using the feature choosing of card side in this example Selection method.Computational methods for card side the most here repeat, and utilize inverted index file statistical computation in the method Several statistics A usedij、Bij、Cij、Dij(AijFor comprising Feature Words tiAnd belong to cjNumber of documents, Bij For comprising Feature Words tiAnd classification is not belonging to cjNumber of documents, CijFor not comprising Feature Words tiAnd classification belongs to cj Number of documents, DijFor not comprising Feature Words tiAnd classification belongs to not cjNumber of documents.).Utilize these four Statistic, calculates the chi-square value between each word and classification in patent, is worth the biggest, the dependency of vocabulary and classification is described more By force.The chi-square value that word each in patent is calculated by this example is ranked up, and selects the most front 10 chi-square value maximums Word as Feature Words.
(7) utilize inverted index file, calculate the weight of each patent characteristic word.Inverted index file has been remembered The word frequency of Feature Words, position weight, part of speech weight are recorded, so having only to add up the text that this feature word occurs equally Number, is also known as total textual data, and the TF-IDF public affairs formula (II) improved of utilization calculates weight, then Weight is carried out normalization, has reduced the amount of calculation of SVM training.Finally, the Text eigenvector of patent is constructed.
(8) training set of SVM is built.The present invention uses libsvm storehouse, thus training set need according to The form of regulation carrys out structure and is good for, and concrete form describes in detail in summary of the invention.Patent text characteristic vector is pressed level It is saved in respectively in different files, after the patent vector format of A-H, is saved in file svm.train, for ministerial level SVM training.Be saved in file svma.train after the patent vector format of A subordinate, the patent of other subordinates to Amount is saved in respective file after formatting the most respectively, trains for the SVM of big class level, concrete outcome such as table 2 institute Show.
The SVM training set of table 2 A-H
(9) the training set file utilizing above-mentioned structure carries out SVM training respectively, builds grader.By training set number According to zooming in and out between 0-1, use the zoom tool carried inside libsvm, this example has changed some and has entered Mouth parameter, can be saved in the result after scaling in file.To the training set after scaling, utilize libsvm next life Becoming model, result produces 9 SVM classifier models altogether.These graders are pressed tree structure tissue, and root saves Point is ministerial level SVM classifier model, and other 8 SVM models hang over ministerial level SVM respectively as direct child node Under model, concrete outcome is as shown in table 3.
The training pattern of table 3 A-H
(10) patent to be sorted is carried out pretreatment, feature selection, vectorization and Data Format Transform.First, By the patent data to be sorted input ministerial level SVM classifier model after conversion, it was predicted that its affiliated portion.Then, then Enter data into the SVM classifier model in affiliated portion, it was predicted that its affiliated big class.
(11) in (5), the category feature vector of each group of IPC is had calculated that, according to be sorted in (10) The big class that patent puts under, calculates this patent to be sorted vector and the cosine phase of all group category features vector under this big class Like angle value (group characteristic similarity).Then patent to be sorted vector and all training patent texts under this big class are calculated The cosine similarity value of vector.Sort after above-mentioned Similarity value is normalized, when maximum similarity is less than territory During value, it is believed that this patent to be divided belongs to misclassification, terminating classification, this threshold value is set by the user.If greater than threshold value, Then choose 30 the training patents the highest with band classification patent similarity, according to the group classification of these training patents, Public formula (III) is utilized to be weighted summation with group characteristic similarity.The result calculated is ranked up, will training Patent is included into the classification that summation Similarity value is the highest.

Claims (3)

1. a patent automatic classifying method based on patent documentation summary, including dictionary creation, each stratigraphic classification of IPC The generation of characteristic vector, patent text feature selection, the vectorization of patent text, disaggregated model structure based on SVM Build, patent classification step to be sorted, it is characterised in that: the generation of dictionary creation, IPC each stratigraphic classification characteristic vector, Patent text feature selection, the vectorization of patent text, disaggregated model based on SVM structure, patent to be sorted are divided Class step includes:
Step 1 patent data pretreatment, including sample IPC, title and the extraction of summary, Chinese word segmentation, word Property mark, remove stop words and word frequency statistics;
Step 2 utilizes the method that canonical is mated, and filters out non-noun, verb or adjectival word, generates record word The inverted index file of language position and part of speech weight;
Step 3 builds dictionary, and all words are built into a dictionary, and the form of dictionary uses label item form, As: 1 China;
Step 4 generates the characteristic vector of each stratigraphic classification of IPC;
Step 5 patent text feature selection, uses card side's feature selection approach to calculate the eigenvalue of institute's prompter language, row Sequence also selects K (being given by user) individual Feature Words to characterize patent text;
Step 6 calculates the eigenvalue of each word, utilizes inverted index file to calculate each statistic;
The vectorization of step 7 patent text, its step includes:
Step 7.1 carries out weight calculation to Feature Words, and computing formula is:
Wherein, whereinRepresent that Feature Words t is at textThe frequency of middle appearance, N represents that full text is concentrated The number of all texts, n represents that full text concentrates textual data lexical item t occur, CtRepresent that Feature Words part of speech institute is right The part of speech weight coefficient answered, PtRepresent the position weight coefficient of Feature Words;
Step 7.2 carries out descending sort to weight, the spatial model vector of structure patent text: Vi(wi1,wi2,...,win), The content of each patent text is represented with this;
Step 8 builds ministerial level SVM training set, and its step includes:
Patent text characteristic vector is carried out merger by the portion of IPC by step 8.1;
The category feature vector of step 8.2 IPC portion level joins in the vector set in step 8.1;
Vector set is converted into following training set form by step 8.3:
Wherein first being classified as IPC portion number, lable is lexical item, and w is characterized word weight;
Step 9 builds disaggregated model based on SVM in A-H portion, uses the SVM multiclass of strategy one to one to divide Class method, is trained training set, generates the sorter model of patent office level;
Step 10 builds big class level SVM training set, and its step includes:
Step 10.1 patent text characteristic vector carries out merger by the big class of IPC;
The category feature vector of the big class hierarchy of step 10.2 IPC joins in the vector set in step 10.1;
Vector set is converted into following training set form by step 10.3:
Wherein the first big class-mark being classified as IPC, lable is lexical item, and w is characterized word weight;
Step 11 builds disaggregated model based on SVM in each big class, divides the training of big class according to A-H portion Collection, is trained the big class training set in each portion respectively, the same SVM multicategory classification side using strategy one to one Method, generates the sorter model of big class hierarchy;
Step 12 patent to be sorted categorised decision in portion and big class, its step includes:
Step 12.1 patent to be sorted carries out title and the extraction of summary, Chinese word segmentation, part-of-speech tagging, removes stop words;
Step 12.2 patent characteristic selects and vectorization;
Step 12.3 utilizes the tree classificator hierarchical model being previously obtained patent to be sorted to be carried out classification decision-making, really Fixed big class belonging to patent to be sorted;
The classification of the little class hierarchy of step 13 patent, utilizes KNN algorithm based on IPC classification knowledge to carry out classification and knows Not.
A kind of patent automatic classifying method based on patent documentation summary the most according to claim 1, its feature It is: the concrete steps of described step 4 include:
The classification of each subgroup is described the classification description being incorporated to institute owner group by step 4.1, carries out participle, goes at stop words Reason;
Step 4.2 carries out feature selection, the category feature of the structure little class hierarchy of IPC after the description of each main group being merged Vector, vector representation is { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, first master during A01B1/00 is IPC Group, H99Z99/00 is in IPC last main group;
Step 4.3 carries out feature selection after all basic description under same group being merged, and constructs IPC big class layer Secondary category feature vector, vector representation is { VA01B,VA01C,...,VH99Z, wherein, A01B is in IPC first Group, H99Z is last group in IPC;
Step 4.4 carries out feature selection after all basic description under same big class being merged, structure IPC portion level Category feature vector, vector representation is { VA01,VA21,...,VH99, wherein, A01 is first big class, H99Z in IPC It it is last big class in IPC.
A kind of patent automatic classifying method based on patent documentation summary the most according to claim 1, its feature It is: the concrete steps of described step 13 include:
Step 13.1 calculates patent B to be sortedjCharacteristic vector and big each group I characteristic vector of apoplexy due to endogenous wind residing for it Cosine similarity Sai
Step 13.2 calculates patent B to be sortedjWith its residing for the cosine similarity S of each training patent of big apoplexy due to endogenous windbj
Training patent in step 13.2 is pressed sequencing of similarity by step 13.3, if the highest similarity is less than Thresholding, illustrates that its mistake has divided big class, is classified to misclassification, otherwise proceed to step 13.4;
Step 13.4 chooses similarity the highest K training patent, according to its group classification respectively with step 13.1 In group Similarity-Weighted summation, calculation is:
S i = S a i + p × Σ B j ∈ I S b j - - - I I I
Wherein p is weight coefficient, SiFinal similarity for patent to be sorted Yu each group;
Patent to be sorted is included into similarity S by step 13.5iMaximum group.
CN201610139838.8A 2016-03-11 2016-03-11 Patent document abstract-based automatic patent classification method Pending CN105808524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610139838.8A CN105808524A (en) 2016-03-11 2016-03-11 Patent document abstract-based automatic patent classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610139838.8A CN105808524A (en) 2016-03-11 2016-03-11 Patent document abstract-based automatic patent classification method

Publications (1)

Publication Number Publication Date
CN105808524A true CN105808524A (en) 2016-07-27

Family

ID=56467234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610139838.8A Pending CN105808524A (en) 2016-03-11 2016-03-11 Patent document abstract-based automatic patent classification method

Country Status (1)

Country Link
CN (1) CN105808524A (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446230A (en) * 2016-10-08 2017-02-22 国云科技股份有限公司 Method for optimizing word classification in machine learning text
CN106708926A (en) * 2016-11-14 2017-05-24 北京赛思信安技术股份有限公司 Realization method for analysis model supporting massive long text data classification
CN106951466A (en) * 2017-03-01 2017-07-14 常州大学怀德学院 Field text feature and system based on KNN SVM
CN107122350A (en) * 2017-04-27 2017-09-01 北京易麦克科技有限公司 A kind of feature extraction system and method for many paragraph texts
CN107122382A (en) * 2017-02-16 2017-09-01 江苏大学 A kind of patent classification method based on specification
CN107145476A (en) * 2017-05-23 2017-09-08 福建师范大学 One kind is based on improvement TF IDF keyword extraction algorithms
CN107577794A (en) * 2017-09-19 2018-01-12 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN108228648A (en) * 2016-12-21 2018-06-29 伊姆西Ip控股有限责任公司 The method and apparatus for creating index
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN109190001A (en) * 2018-09-19 2019-01-11 广东电网有限责任公司 office document management method
CN109271523A (en) * 2018-11-23 2019-01-25 中电科大数据研究院有限公司 A kind of government document subject classification method based on information retrieval
CN109377104A (en) * 2018-12-04 2019-02-22 泰康保险集团股份有限公司 Data processing method, device, medium and electronic equipment
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
CN110362673A (en) * 2019-07-17 2019-10-22 福州大学 Computer vision class papers contents method of discrimination and system based on abstract semantic analysis
CN111126062A (en) * 2019-12-25 2020-05-08 北京中技华软科技服务有限公司 Automatic industry classification method and system
CN111210023A (en) * 2020-01-13 2020-05-29 哈尔滨工业大学 Automatic selection system and method for data set classification learning algorithm
WO2020207431A1 (en) * 2019-04-12 2020-10-15 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus and device, and storage medium
CN112016830A (en) * 2020-08-27 2020-12-01 广东电网有限责任公司 Patent file evaluation task allocation method and device
CN112733549A (en) * 2020-12-31 2021-04-30 厦门智融合科技有限公司 Patent value information analysis method and device based on multiple semantic fusion
CN113378950A (en) * 2021-06-22 2021-09-10 深圳市查策网络信息技术有限公司 Unsupervised classification method for long texts
CN114579749A (en) * 2022-03-11 2022-06-03 北京中知智慧科技有限公司 CPC classification method and device for patent data
CN114706974A (en) * 2021-09-18 2022-07-05 北京墨丘科技有限公司 Technical problem information mining method and device and storage medium
CN116975068A (en) * 2023-09-25 2023-10-31 中国标准化研究院 Metadata-based patent document data storage method, device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996178A (en) * 2009-08-11 2011-03-30 上海汉光知识产权数据科技有限公司 IPC (International Patent Classification) analysis method of patent documentations
CN104199809A (en) * 2014-04-24 2014-12-10 江苏大学 Semantic representation method for patent text vectors

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996178A (en) * 2009-08-11 2011-03-30 上海汉光知识产权数据科技有限公司 IPC (International Patent Classification) analysis method of patent documentations
CN104199809A (en) * 2014-04-24 2014-12-10 江苏大学 Semantic representation method for patent text vectors

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NEIDE DE OLIVEIRA GOMES等: "Text Categorization Study Case: Patents’ Application Documents", 《2011 6TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS》 *
刘玉琴等: "基于 IPC 知识结构的专利自动分类方法", 《计算机工程》 *
蔡云雷等: "基于潜在语义分析的专利文本分类技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446230A (en) * 2016-10-08 2017-02-22 国云科技股份有限公司 Method for optimizing word classification in machine learning text
CN106708926A (en) * 2016-11-14 2017-05-24 北京赛思信安技术股份有限公司 Realization method for analysis model supporting massive long text data classification
CN106708926B (en) * 2016-11-14 2020-10-30 北京赛思信安技术股份有限公司 Implementation method of analysis model supporting massive long text data classification
CN108228648A (en) * 2016-12-21 2018-06-29 伊姆西Ip控股有限责任公司 The method and apparatus for creating index
US11429648B2 (en) 2016-12-21 2022-08-30 EMC IP Holding Company LLC Method and device for creating an index
CN107122382A (en) * 2017-02-16 2017-09-01 江苏大学 A kind of patent classification method based on specification
CN107122382B (en) * 2017-02-16 2021-03-23 江苏大学 Patent classification method based on specification
CN106951466A (en) * 2017-03-01 2017-07-14 常州大学怀德学院 Field text feature and system based on KNN SVM
CN107122350A (en) * 2017-04-27 2017-09-01 北京易麦克科技有限公司 A kind of feature extraction system and method for many paragraph texts
CN107122350B (en) * 2017-04-27 2021-02-05 北京易麦克科技有限公司 Method of multi-paragraph text feature extraction system
CN107145476A (en) * 2017-05-23 2017-09-08 福建师范大学 One kind is based on improvement TF IDF keyword extraction algorithms
CN107577794A (en) * 2017-09-19 2018-01-12 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN107577794B (en) * 2017-09-19 2019-07-05 北京神州泰岳软件股份有限公司 A kind of news category method and device
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN107894986B (en) * 2017-09-26 2021-03-30 北京纳人网络科技有限公司 Enterprise relation division method based on vectorization, server and client
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium
CN110020420B (en) * 2018-01-10 2023-07-21 腾讯科技(深圳)有限公司 Text processing method, device, computer equipment and storage medium
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN109190001A (en) * 2018-09-19 2019-01-11 广东电网有限责任公司 office document management method
CN109543037A (en) * 2018-11-21 2019-03-29 南京安讯科技有限责任公司 A kind of article classification method based on improved TF-IDF
CN109271523A (en) * 2018-11-23 2019-01-25 中电科大数据研究院有限公司 A kind of government document subject classification method based on information retrieval
CN109377104A (en) * 2018-12-04 2019-02-22 泰康保险集团股份有限公司 Data processing method, device, medium and electronic equipment
WO2020207431A1 (en) * 2019-04-12 2020-10-15 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus and device, and storage medium
CN110362673A (en) * 2019-07-17 2019-10-22 福州大学 Computer vision class papers contents method of discrimination and system based on abstract semantic analysis
CN110362673B (en) * 2019-07-17 2022-07-08 福州大学 Computer vision type thesis content distinguishing method and system based on abstract semantic analysis
WO2021128521A1 (en) 2019-12-25 2021-07-01 北京中技华软科技服务有限公司 Automatic industry classification method and system
CN111126062B (en) * 2019-12-25 2021-09-24 北京中技华软科技服务有限公司 Automatic industry classification method and system
CN111126062A (en) * 2019-12-25 2020-05-08 北京中技华软科技服务有限公司 Automatic industry classification method and system
CN111210023B (en) * 2020-01-13 2023-04-11 哈尔滨工业大学 Automatic selection system and method for data set classification learning algorithm
CN111210023A (en) * 2020-01-13 2020-05-29 哈尔滨工业大学 Automatic selection system and method for data set classification learning algorithm
CN112016830A (en) * 2020-08-27 2020-12-01 广东电网有限责任公司 Patent file evaluation task allocation method and device
CN112733549A (en) * 2020-12-31 2021-04-30 厦门智融合科技有限公司 Patent value information analysis method and device based on multiple semantic fusion
CN112733549B (en) * 2020-12-31 2024-03-01 厦门智融合科技有限公司 Patent value information analysis method and device based on multiple semantic fusion
CN113378950A (en) * 2021-06-22 2021-09-10 深圳市查策网络信息技术有限公司 Unsupervised classification method for long texts
CN114706974A (en) * 2021-09-18 2022-07-05 北京墨丘科技有限公司 Technical problem information mining method and device and storage medium
CN114579749A (en) * 2022-03-11 2022-06-03 北京中知智慧科技有限公司 CPC classification method and device for patent data
CN116975068A (en) * 2023-09-25 2023-10-31 中国标准化研究院 Metadata-based patent document data storage method, device and storage medium

Similar Documents

Publication Publication Date Title
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN111177374B (en) Question-answer corpus emotion classification method and system based on active learning
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
Syiam et al. An intelligent system for Arabic text categorization
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
Noaman et al. Naive Bayes classifier based Arabic document categorization
Odeh et al. Arabic text categorization algorithm using vector evaluation method
CN108038099A (en) Low frequency keyword recognition method based on term clustering
CN112417153A (en) Text classification method and device, terminal equipment and readable storage medium
Basha et al. A novel summarization-based approach for feature reduction enhancing text classification accuracy
CN114997288A (en) Design resource association method
Ghalehtaki et al. A combinational method of fuzzy, particle swarm optimization and cellular learning automata for text summarization
CN114579695A (en) Event extraction method, device, equipment and storage medium
Rezaei et al. Event detection in twitter by deep learning classification and multi label clustering virtual backbone formation
CN109635123A (en) A kind of Chinese medicine text concept recognition methods of increment type
CN110674293B (en) Text classification method based on semantic migration
Probierz et al. Clustering of scientific articles using natural language processing
Izquierdo et al. An empirical study on class-based word sense disambiguation
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN112765353B (en) Scientific research text-based biomedical subject classification method and device
CN112541080B (en) New media account label intelligent verification method based on deep learning
CN114707615A (en) Ancient character similarity quantization method based on duration Chinese character knowledge graph
Thielmann et al. Human in the loop: How to effectively create coherent topics by manually labeling only a few documents per class
Fan et al. Mining images on semantics via statistical learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160727