CN105808524A - Patent document abstract-based automatic patent classification method - Google Patents
Patent document abstract-based automatic patent classification method Download PDFInfo
- Publication number
- CN105808524A CN105808524A CN201610139838.8A CN201610139838A CN105808524A CN 105808524 A CN105808524 A CN 105808524A CN 201610139838 A CN201610139838 A CN 201610139838A CN 105808524 A CN105808524 A CN 105808524A
- Authority
- CN
- China
- Prior art keywords
- ipc
- classification
- vector
- big class
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a patent document abstract-based automatic patent classification method. The method comprises the following steps: dictionary construction, generation of various layers and classes or feature vectors of an IPC, patent text feature selection, patent text vectorization, SVM-based classification model construction and to-be-classified patent classification. In the patent text feature selection, the information contained in the patent titles and abstracts are fully utilized so that the calculation amount of the patent feature selection is remarkably decreased; in the calculation process of the patent text vectorization, a TF-IDF calculation method is improved, and the part of speech weights and position weights of words are added, so that the classification effect is further improved; and in the patent classification, a layer-based classification method is adopted, and the method fully utilizes the advantages of an SVM classifier and a KNN classifier, so that the disadvantages that the SVM classifier is overmuch in training and the KNN classifier is large in calculation amount are overcome.
Description
Technical field
The invention belongs to the application of the computer analytical technology of patent documentation, be specifically related to one and utilize abridgments of specifications
Patent classification method.
Background technology
Patent refers to the general designation of all kinds description of Patent Office of various countries or the publication of International Patent tissue, is patent literary composition
The main body offered.Adding up according to World Intellectual Property Organization (WIPO), the 70% of invention achievement in the world~90% first
Occur in the patent literature rather than in the document of other carriers such as magazine, paper.Additionally, in order to protect self
Interests, enterprise can as far as possible early apply for a patent, often concentrate technology the most active and advanced, comprise in patent
The technical information of 90%~95% in the world.Simultaneously for the convenience examined, the comparison that patent documentation is often write is detailed
Carefully, for other kinds of data, patent documentation can provide more information, is a kind of modal skill
Art innovation achievement, records the complete procedure of Patent Activities.It not only reflects showing of technical activity in each technical field
Shape, and the developing history of technical activity in certain particular technology area can be embodied.Containing each in patent documentation
The concrete technical solution of the innovation and creation applied for a patent, has very important effect for enterprise innovation, not only
Up-to-date scientific research is dynamic to make enterprise it will be seen that, it is to avoid repeat research, saves search time and research funding, the most also may be used
Edify opening one's minds of business research personnel, improve the starting point of innovation, use for reference conventional invention, greatly shorten scientific research work
Make progress.
Along with continuing to bring out of China's recent studies on achievement and innovation and creation, patent numbers presents quickly growth, every year
Patent application number close to 1,000,000.Ending December in 2015 29, the patent of invention number that China has announced is
More than 5,000,000, wherein authorizing patent of invention sum is 186.610 ten thousand.If the mean size of each patent is
2M, then the capacity of patent data is up to hundreds of TB.In order to scientifically manage these patent documentation data, also it is simultaneously
Retrieval Patents document quickly and easily, the classification of patent documentation is particularly important.At present, the most greatly
Majority state all uses International Patent Classification IPC (International Patent Classification) to come patent literary composition
Offer and classify, IPC according to five grade separations, i.e. portion (Section), big class (Class), group (Subclass),
Main group (Main Grop), packet (Grop), be the classification layer of highest ranking in classification chart in the middle part of it, different according to field,
Being divided into eight big portions, with the English alphabet labelling of, be A-H respectively, each part subordinate is provided with multiple big class,
Big class is made up of binary digit, each the big class having varying number below.Such as: G06F21/00 represents physics
Protection computer, its parts, program or the safety device of data of-electricity Digital data processing-prevent unauthorized behavior.
As can be seen here, for the patent of invention announced, all there is a corresponding classification number, as
G06F21/00.From the point of view of patent to be applied for, its classification number be unknown and it needs to be determined that, to this, lead at present
Normal way is the art according to patent description object or patent content determines.Generally, such point
Class method is the narrowest, and the content that associated specialist need to be relied on to carry out manual read's application completes, patent literary composition
Sharply increasing of offering makes manual analysis patent need the man power and material expended to increase, and the limitation of expert's its knowledge
Property is also difficult to ensure that concordance and the accuracy of classification.To this end, the present invention proposes a kind of based on patent documentation summary
Patent automatic classifying method, the method utilizes the information announced in patent of invention summary to come structural classification device or classification letter
Number, and the classification determining patent-pending with this, be achieved in the automatic classification of patent.
Summary of the invention
Present invention aims to existing patent classification method can not fully, effectively utilize and announce patent of invention
In summary info, propose a kind of based on patent documentation summary patent automatic classifying method, the method will make full use of
The classification having announced summary info that patent of invention comprised and correspondence comes structural classification device or classification function, comes with this
Determine that classification has been filed on the classification applied for a patent, and with regard to the feature extraction of summary in construction process and selection, grader
Determine etc. that aspect proposes corresponding optimal solution.
The technical scheme is that a kind of patent automatic classifying method based on patent documentation summary, including dictionary structure
Build, the generation of IPC each stratigraphic classification characteristic vector, patent text feature selection, the vectorization of patent text, based on
The disaggregated model of SVM builds, patent classification step to be sorted, it is characterised in that: each level of dictionary creation, IPC
The generation of category feature vector, patent text feature selection, the vectorization of patent text, classification mould based on SVM
Type builds, patent classification step to be sorted includes:
Step 1 patent data pretreatment, including sample IPC, title and the extraction of summary, Chinese word segmentation, word
Property mark, remove stop words and word frequency statistics;
Step 2 utilizes the method that canonical is mated, and filters out non-noun, verb or adjectival word, generates record word
The inverted index file of language position and part of speech weight;
Step 3 builds dictionary, and all words are built into a dictionary, and the form of dictionary uses label item form,
As: 1 China;
Step 4 generates the characteristic vector of each stratigraphic classification of IPC, and its step includes:
The classification of each subgroup is described the classification description being incorporated to institute owner group by step 4.1, carries out participle, goes at stop words
Reason;
Step 4.2 carries out feature selection, the category feature of the structure little class hierarchy of IPC after the description of each main group being merged
Vector, vector representation is { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, first master during A01B1/00 is IPC
Group, H99Z99/00 is in IPC last main group;
Step 4.3 carries out feature selection after all basic description under same group being merged, and constructs IPC big class layer
Secondary category feature vector, vector representation is { VA01B,VA01C,...,VH99Z, wherein, A01B is in IPC first
Group, H99Z is last group in IPC.
Step 4.4 carries out feature selection after all basic description under same big class being merged, structure IPC portion level
Category feature vector, vector representation is { VA01,VA21,...,VH99, wherein, A01 is first big class, H99Z in IPC
It it is last big class in IPC.
Step 5 patent text feature selection, uses card side's feature selection approach to calculate the eigenvalue of institute's prompter language, row
Sequence also selects K (being given by user) individual Feature Words to characterize patent text.
Step 6 calculates the eigenvalue of each word, utilizes inverted index file to calculate each statistic;
The vectorization of step 7 patent text, its step includes:
Step 7.1 carries out weight calculation to Feature Words, and computing formula is:
Wherein, whereinRepresent that Feature Words t is at textThe frequency of middle appearance, N represents that full text is concentrated
The number of all texts, n represents that full text concentrates textual data lexical item t occur, CtRepresent that Feature Words part of speech institute is right
The part of speech weight coefficient answered, PtRepresent the position weight coefficient of Feature Words;
Step 7.2 carries out descending sort to weight, the spatial model vector of structure patent text: Vi(wi1,wi2,...,win),
The content of each patent text is represented with this;
Step 8 builds ministerial level SVM training set, and its step includes:
Patent text characteristic vector is carried out merger by the portion of IPC by step 8.1;
The category feature vector of step 8.2 IPC portion level joins in the vector set in step 8.1;
Vector set is converted into following training set form by step 8.3:
Wherein first being classified as IPC portion number, lable is lexical item, and w is characterized word weight;
Step 9 builds disaggregated model based on SVM in A-H portion, uses the SVM multiclass of strategy one to one to divide
Class method, is trained training set, generates the sorter model of patent office level;
Step 10 builds big class level SVM training set, and its step includes:
Step 10.1 patent text characteristic vector carries out merger by the big class of IPC;
The category feature vector of the big class hierarchy of step 10.2 IPC joins in the vector set in step 10.1;
Vector set is converted into following training set form by step 10.3:
Wherein the first big class-mark being classified as IPC, lable is lexical item, and w is characterized word weight;
Step 11 builds disaggregated model based on SVM in each big class, divides the training of big class according to A-H portion
Collection, is trained the big class training set in each portion respectively, the same SVM multicategory classification side using strategy one to one
Method, generates the sorter model of big class hierarchy;
Step 12 patent to be sorted categorised decision in portion and big class, its step includes:
Step 12.1 patent to be sorted carries out title and the extraction of summary, Chinese word segmentation, part-of-speech tagging, removes stop words;
Step 12.2 patent characteristic selects and vectorization;
Step 12.3 utilizes the tree classificator hierarchical model being previously obtained that patent to be sorted is carried out classification decision-making, determines
Big class belonging to patent to be sorted;
The classification of the little class hierarchy of step 13 patent, utilizes KNN algorithm based on IPC classification knowledge to carry out classification and knows
Not, its step includes:
Step 13.1 calculates patent B to be sortedjCharacteristic vector with its residing for big each group I characteristic vector of apoplexy due to endogenous wind remaining
String similarity Sai;
Step 13.2 calculates patent B to be sortedjWith its residing for the cosine similarity S of each training patent of big apoplexy due to endogenous windbj;
Training patent in step 13.2 is pressed sequencing of similarity by step 13.3, if the highest similarity is less than
Thresholding, illustrates that its mistake has divided big class, is classified to misclassification, otherwise proceed to step 13.4;
Step 13.4 chooses similarity the highest K training patent, according to its group classification respectively with in step 13.1
Group Similarity-Weighted summation, calculation is:
Wherein p is weight coefficient, SiFinal similarity for patent to be sorted Yu each group;
Patent to be sorted is included into similarity S by step 13.5iMaximum group.
The present invention constructs making full use of the classification announcing summary info that patent of invention comprised and correspondence point
With this, class device or classification function, determine that classification has been filed on the classification applied for a patent, and with regard to the spy of summary in construction process
Levy extract and select, the aspect such as the determination of grader proposes corresponding optimal solution, it is mainly reflected in:
(1) patent text feature selection aspect
Patent title and summary specify title and the art of patent, and clearly reflecting this patent to solve
Technical problem, technical scheme and main uses certainly, patent of the present invention takes full advantage of these information, hence it is evident that reduce
The amount of calculation that patent characteristic selects, and classifying quality is very good.It addition, in the calculating of patent text vectorization
Cheng Zhong, improves the method calculating TF-IDF, adds part of speech weight and the position weight of word, thus carries further
High classifying quality.
(2) design aspect of patent classification method
Owing to patent classification number is relatively big, need to train substantial amounts of SVM classifier, and profound patent class
Other similarity is the highest, so being difficult to find corresponding SVM to support vector in patent group.And KNN algorithm is
The sorting algorithm of Case-based Reasoning, it is not necessary to training pattern, but the example comprised due to patent office or big class is too much, its point
Similarity Measure amount in class process is huge.To this end, a kind of method that patent of the present invention proposes hierarchical classification, the method
Take full advantage of the respective advantage of SVM and KNN grader, overcome SVM classifier training too much KNN
The shortcoming that algorithm is computationally intensive.
Accompanying drawing explanation
Fig. 1 is the structured flowchart in the embodiment of the present invention
Fig. 2 is the structure flow process of embodiment of the present invention Patent vector space
Fig. 3 is based on the classification process figure improving KNN in the embodiment of the present invention
Detailed description of the invention
This example, according to IPC category table, extracts 1000, patent sample respectively, wherein at same class in A-H portion
Other level uniformly extracts according to concrete class.Concrete execution step is as follows:
(1) obtain the description of IPC classification, description is carried out participle and part-of-speech tagging, goes stop words to process, here
Be the ICTCLAS participle instrument of the Chinese Academy of Sciences, after word segmentation result is carried out artificial calibration, build user-oriented dictionary.
(2) respectively patent sample is carried out form conversion, title and abstract extraction, title and summary are carried out Chinese
Participle, part-of-speech tagging.Utilize regular expression, remove stop words, function word, conjunction etc. to patent classification use not
Big word, only retains noun, adjective, verb.
(3) building inverted index file, indexed file structure includes vocabulary and event table, each vocabulary correspondence one
Individual event table, event table deposits the patent No. that vocabulary is occurred.Position weight employing computing formula:Wherein n
Represent that vocabulary at title and plucks the total degree occurred in use, liRepresent that the weight of present position occurs in vocabulary i & lt, real
Bidding topic weight 1, first sentence 0.8 of making a summary, other positions 0.5 in example.Part of speech weight setting is noun 2.5, verb and
Adjective is 1.
(4) utilizing inverted index file, build and filtered, by participle, the dictionary that the vocabulary obtained forms, each word is corresponding
One lable, the purpose of structure is to train for SVM below to prepare, and concrete outcome is as shown in table 1.
Table 1 user-oriented dictionary and inverted index merge
(5) on the basis of (1), start the most upwards from group, calculate each vocabulary classification at corresponding level
Weight, the calculation of weight is similar to TF-IDF, a classification is described and regards a text as, then builds each layer
Secondary category feature vector.Such as, by the vocabulary of all groups under A01B group and become an A01B word finder,
Group under the big class of other A01 is also such, then calculates the weight of each word in A01B word finder, finally constructs
The characteristic vector of A01B group.
(6) short text is belonged to due to patent title and summary, and the content of high level overview patent and affiliated neck
Territory, the frequency that the vocabulary that class discrimination degree is high occurs is not necessarily the highest, so using the feature choosing of card side in this example
Selection method.Computational methods for card side the most here repeat, and utilize inverted index file statistical computation in the method
Several statistics A usedij、Bij、Cij、Dij(AijFor comprising Feature Words tiAnd belong to cjNumber of documents, Bij
For comprising Feature Words tiAnd classification is not belonging to cjNumber of documents, CijFor not comprising Feature Words tiAnd classification belongs to cj
Number of documents, DijFor not comprising Feature Words tiAnd classification belongs to not cjNumber of documents.).Utilize these four
Statistic, calculates the chi-square value between each word and classification in patent, is worth the biggest, the dependency of vocabulary and classification is described more
By force.The chi-square value that word each in patent is calculated by this example is ranked up, and selects the most front 10 chi-square value maximums
Word as Feature Words.
(7) utilize inverted index file, calculate the weight of each patent characteristic word.Inverted index file has been remembered
The word frequency of Feature Words, position weight, part of speech weight are recorded, so having only to add up the text that this feature word occurs equally
Number, is also known as total textual data, and the TF-IDF public affairs formula (II) improved of utilization calculates weight, then
Weight is carried out normalization, has reduced the amount of calculation of SVM training.Finally, the Text eigenvector of patent is constructed.
(8) training set of SVM is built.The present invention uses libsvm storehouse, thus training set need according to
The form of regulation carrys out structure and is good for, and concrete form describes in detail in summary of the invention.Patent text characteristic vector is pressed level
It is saved in respectively in different files, after the patent vector format of A-H, is saved in file svm.train, for ministerial level
SVM training.Be saved in file svma.train after the patent vector format of A subordinate, the patent of other subordinates to
Amount is saved in respective file after formatting the most respectively, trains for the SVM of big class level, concrete outcome such as table 2 institute
Show.
The SVM training set of table 2 A-H
(9) the training set file utilizing above-mentioned structure carries out SVM training respectively, builds grader.By training set number
According to zooming in and out between 0-1, use the zoom tool carried inside libsvm, this example has changed some and has entered
Mouth parameter, can be saved in the result after scaling in file.To the training set after scaling, utilize libsvm next life
Becoming model, result produces 9 SVM classifier models altogether.These graders are pressed tree structure tissue, and root saves
Point is ministerial level SVM classifier model, and other 8 SVM models hang over ministerial level SVM respectively as direct child node
Under model, concrete outcome is as shown in table 3.
The training pattern of table 3 A-H
(10) patent to be sorted is carried out pretreatment, feature selection, vectorization and Data Format Transform.First,
By the patent data to be sorted input ministerial level SVM classifier model after conversion, it was predicted that its affiliated portion.Then, then
Enter data into the SVM classifier model in affiliated portion, it was predicted that its affiliated big class.
(11) in (5), the category feature vector of each group of IPC is had calculated that, according to be sorted in (10)
The big class that patent puts under, calculates this patent to be sorted vector and the cosine phase of all group category features vector under this big class
Like angle value (group characteristic similarity).Then patent to be sorted vector and all training patent texts under this big class are calculated
The cosine similarity value of vector.Sort after above-mentioned Similarity value is normalized, when maximum similarity is less than territory
During value, it is believed that this patent to be divided belongs to misclassification, terminating classification, this threshold value is set by the user.If greater than threshold value,
Then choose 30 the training patents the highest with band classification patent similarity, according to the group classification of these training patents,
Public formula (III) is utilized to be weighted summation with group characteristic similarity.The result calculated is ranked up, will training
Patent is included into the classification that summation Similarity value is the highest.
Claims (3)
1. a patent automatic classifying method based on patent documentation summary, including dictionary creation, each stratigraphic classification of IPC
The generation of characteristic vector, patent text feature selection, the vectorization of patent text, disaggregated model structure based on SVM
Build, patent classification step to be sorted, it is characterised in that: the generation of dictionary creation, IPC each stratigraphic classification characteristic vector,
Patent text feature selection, the vectorization of patent text, disaggregated model based on SVM structure, patent to be sorted are divided
Class step includes:
Step 1 patent data pretreatment, including sample IPC, title and the extraction of summary, Chinese word segmentation, word
Property mark, remove stop words and word frequency statistics;
Step 2 utilizes the method that canonical is mated, and filters out non-noun, verb or adjectival word, generates record word
The inverted index file of language position and part of speech weight;
Step 3 builds dictionary, and all words are built into a dictionary, and the form of dictionary uses label item form,
As: 1 China;
Step 4 generates the characteristic vector of each stratigraphic classification of IPC;
Step 5 patent text feature selection, uses card side's feature selection approach to calculate the eigenvalue of institute's prompter language, row
Sequence also selects K (being given by user) individual Feature Words to characterize patent text;
Step 6 calculates the eigenvalue of each word, utilizes inverted index file to calculate each statistic;
The vectorization of step 7 patent text, its step includes:
Step 7.1 carries out weight calculation to Feature Words, and computing formula is:
Wherein, whereinRepresent that Feature Words t is at textThe frequency of middle appearance, N represents that full text is concentrated
The number of all texts, n represents that full text concentrates textual data lexical item t occur, CtRepresent that Feature Words part of speech institute is right
The part of speech weight coefficient answered, PtRepresent the position weight coefficient of Feature Words;
Step 7.2 carries out descending sort to weight, the spatial model vector of structure patent text: Vi(wi1,wi2,...,win),
The content of each patent text is represented with this;
Step 8 builds ministerial level SVM training set, and its step includes:
Patent text characteristic vector is carried out merger by the portion of IPC by step 8.1;
The category feature vector of step 8.2 IPC portion level joins in the vector set in step 8.1;
Vector set is converted into following training set form by step 8.3:
Wherein first being classified as IPC portion number, lable is lexical item, and w is characterized word weight;
Step 9 builds disaggregated model based on SVM in A-H portion, uses the SVM multiclass of strategy one to one to divide
Class method, is trained training set, generates the sorter model of patent office level;
Step 10 builds big class level SVM training set, and its step includes:
Step 10.1 patent text characteristic vector carries out merger by the big class of IPC;
The category feature vector of the big class hierarchy of step 10.2 IPC joins in the vector set in step 10.1;
Vector set is converted into following training set form by step 10.3:
Wherein the first big class-mark being classified as IPC, lable is lexical item, and w is characterized word weight;
Step 11 builds disaggregated model based on SVM in each big class, divides the training of big class according to A-H portion
Collection, is trained the big class training set in each portion respectively, the same SVM multicategory classification side using strategy one to one
Method, generates the sorter model of big class hierarchy;
Step 12 patent to be sorted categorised decision in portion and big class, its step includes:
Step 12.1 patent to be sorted carries out title and the extraction of summary, Chinese word segmentation, part-of-speech tagging, removes stop words;
Step 12.2 patent characteristic selects and vectorization;
Step 12.3 utilizes the tree classificator hierarchical model being previously obtained patent to be sorted to be carried out classification decision-making, really
Fixed big class belonging to patent to be sorted;
The classification of the little class hierarchy of step 13 patent, utilizes KNN algorithm based on IPC classification knowledge to carry out classification and knows
Not.
A kind of patent automatic classifying method based on patent documentation summary the most according to claim 1, its feature
It is: the concrete steps of described step 4 include:
The classification of each subgroup is described the classification description being incorporated to institute owner group by step 4.1, carries out participle, goes at stop words
Reason;
Step 4.2 carries out feature selection, the category feature of the structure little class hierarchy of IPC after the description of each main group being merged
Vector, vector representation is { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, first master during A01B1/00 is IPC
Group, H99Z99/00 is in IPC last main group;
Step 4.3 carries out feature selection after all basic description under same group being merged, and constructs IPC big class layer
Secondary category feature vector, vector representation is { VA01B,VA01C,...,VH99Z, wherein, A01B is in IPC first
Group, H99Z is last group in IPC;
Step 4.4 carries out feature selection after all basic description under same big class being merged, structure IPC portion level
Category feature vector, vector representation is { VA01,VA21,...,VH99, wherein, A01 is first big class, H99Z in IPC
It it is last big class in IPC.
A kind of patent automatic classifying method based on patent documentation summary the most according to claim 1, its feature
It is: the concrete steps of described step 13 include:
Step 13.1 calculates patent B to be sortedjCharacteristic vector and big each group I characteristic vector of apoplexy due to endogenous wind residing for it
Cosine similarity Sai;
Step 13.2 calculates patent B to be sortedjWith its residing for the cosine similarity S of each training patent of big apoplexy due to endogenous windbj;
Training patent in step 13.2 is pressed sequencing of similarity by step 13.3, if the highest similarity is less than
Thresholding, illustrates that its mistake has divided big class, is classified to misclassification, otherwise proceed to step 13.4;
Step 13.4 chooses similarity the highest K training patent, according to its group classification respectively with step 13.1
In group Similarity-Weighted summation, calculation is:
Wherein p is weight coefficient, SiFinal similarity for patent to be sorted Yu each group;
Patent to be sorted is included into similarity S by step 13.5iMaximum group.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610139838.8A CN105808524A (en) | 2016-03-11 | 2016-03-11 | Patent document abstract-based automatic patent classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610139838.8A CN105808524A (en) | 2016-03-11 | 2016-03-11 | Patent document abstract-based automatic patent classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105808524A true CN105808524A (en) | 2016-07-27 |
Family
ID=56467234
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610139838.8A Pending CN105808524A (en) | 2016-03-11 | 2016-03-11 | Patent document abstract-based automatic patent classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105808524A (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446230A (en) * | 2016-10-08 | 2017-02-22 | 国云科技股份有限公司 | Method for optimizing word classification in machine learning text |
CN106708926A (en) * | 2016-11-14 | 2017-05-24 | 北京赛思信安技术股份有限公司 | Realization method for analysis model supporting massive long text data classification |
CN106951466A (en) * | 2017-03-01 | 2017-07-14 | 常州大学怀德学院 | Field text feature and system based on KNN SVM |
CN107122350A (en) * | 2017-04-27 | 2017-09-01 | 北京易麦克科技有限公司 | A kind of feature extraction system and method for many paragraph texts |
CN107122382A (en) * | 2017-02-16 | 2017-09-01 | 江苏大学 | A kind of patent classification method based on specification |
CN107145476A (en) * | 2017-05-23 | 2017-09-08 | 福建师范大学 | One kind is based on improvement TF IDF keyword extraction algorithms |
CN107577794A (en) * | 2017-09-19 | 2018-01-12 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN107894986A (en) * | 2017-09-26 | 2018-04-10 | 北京纳人网络科技有限公司 | A kind of business connection division methods, server and client based on vectorization |
CN108228648A (en) * | 2016-12-21 | 2018-06-29 | 伊姆西Ip控股有限责任公司 | The method and apparatus for creating index |
CN108415959A (en) * | 2018-02-06 | 2018-08-17 | 北京捷通华声科技股份有限公司 | A kind of file classification method and device |
CN108536677A (en) * | 2018-04-09 | 2018-09-14 | 北京信息科技大学 | A kind of patent text similarity calculating method |
CN109190001A (en) * | 2018-09-19 | 2019-01-11 | 广东电网有限责任公司 | office document management method |
CN109271523A (en) * | 2018-11-23 | 2019-01-25 | 中电科大数据研究院有限公司 | A kind of government document subject classification method based on information retrieval |
CN109377104A (en) * | 2018-12-04 | 2019-02-22 | 泰康保险集团股份有限公司 | Data processing method, device, medium and electronic equipment |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
CN110020420A (en) * | 2018-01-10 | 2019-07-16 | 腾讯科技(深圳)有限公司 | Text handling method, device, computer equipment and storage medium |
CN110069627A (en) * | 2017-11-20 | 2019-07-30 | 中国移动通信集团上海有限公司 | Classification method, device, electronic equipment and the storage medium of short text |
CN110362673A (en) * | 2019-07-17 | 2019-10-22 | 福州大学 | Computer vision class papers contents method of discrimination and system based on abstract semantic analysis |
CN111126062A (en) * | 2019-12-25 | 2020-05-08 | 北京中技华软科技服务有限公司 | Automatic industry classification method and system |
CN111210023A (en) * | 2020-01-13 | 2020-05-29 | 哈尔滨工业大学 | Automatic selection system and method for data set classification learning algorithm |
WO2020207431A1 (en) * | 2019-04-12 | 2020-10-15 | 智慧芽信息科技(苏州)有限公司 | Document classification method, apparatus and device, and storage medium |
CN112016830A (en) * | 2020-08-27 | 2020-12-01 | 广东电网有限责任公司 | Patent file evaluation task allocation method and device |
CN112733549A (en) * | 2020-12-31 | 2021-04-30 | 厦门智融合科技有限公司 | Patent value information analysis method and device based on multiple semantic fusion |
CN113378950A (en) * | 2021-06-22 | 2021-09-10 | 深圳市查策网络信息技术有限公司 | Unsupervised classification method for long texts |
CN114579749A (en) * | 2022-03-11 | 2022-06-03 | 北京中知智慧科技有限公司 | CPC classification method and device for patent data |
CN114706974A (en) * | 2021-09-18 | 2022-07-05 | 北京墨丘科技有限公司 | Technical problem information mining method and device and storage medium |
CN116975068A (en) * | 2023-09-25 | 2023-10-31 | 中国标准化研究院 | Metadata-based patent document data storage method, device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996178A (en) * | 2009-08-11 | 2011-03-30 | 上海汉光知识产权数据科技有限公司 | IPC (International Patent Classification) analysis method of patent documentations |
CN104199809A (en) * | 2014-04-24 | 2014-12-10 | 江苏大学 | Semantic representation method for patent text vectors |
-
2016
- 2016-03-11 CN CN201610139838.8A patent/CN105808524A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996178A (en) * | 2009-08-11 | 2011-03-30 | 上海汉光知识产权数据科技有限公司 | IPC (International Patent Classification) analysis method of patent documentations |
CN104199809A (en) * | 2014-04-24 | 2014-12-10 | 江苏大学 | Semantic representation method for patent text vectors |
Non-Patent Citations (3)
Title |
---|
NEIDE DE OLIVEIRA GOMES等: "Text Categorization Study Case: Patents’ Application Documents", 《2011 6TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS》 * |
刘玉琴等: "基于 IPC 知识结构的专利自动分类方法", 《计算机工程》 * |
蔡云雷等: "基于潜在语义分析的专利文本分类技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446230A (en) * | 2016-10-08 | 2017-02-22 | 国云科技股份有限公司 | Method for optimizing word classification in machine learning text |
CN106708926A (en) * | 2016-11-14 | 2017-05-24 | 北京赛思信安技术股份有限公司 | Realization method for analysis model supporting massive long text data classification |
CN106708926B (en) * | 2016-11-14 | 2020-10-30 | 北京赛思信安技术股份有限公司 | Implementation method of analysis model supporting massive long text data classification |
CN108228648A (en) * | 2016-12-21 | 2018-06-29 | 伊姆西Ip控股有限责任公司 | The method and apparatus for creating index |
US11429648B2 (en) | 2016-12-21 | 2022-08-30 | EMC IP Holding Company LLC | Method and device for creating an index |
CN107122382A (en) * | 2017-02-16 | 2017-09-01 | 江苏大学 | A kind of patent classification method based on specification |
CN107122382B (en) * | 2017-02-16 | 2021-03-23 | 江苏大学 | Patent classification method based on specification |
CN106951466A (en) * | 2017-03-01 | 2017-07-14 | 常州大学怀德学院 | Field text feature and system based on KNN SVM |
CN107122350A (en) * | 2017-04-27 | 2017-09-01 | 北京易麦克科技有限公司 | A kind of feature extraction system and method for many paragraph texts |
CN107122350B (en) * | 2017-04-27 | 2021-02-05 | 北京易麦克科技有限公司 | Method of multi-paragraph text feature extraction system |
CN107145476A (en) * | 2017-05-23 | 2017-09-08 | 福建师范大学 | One kind is based on improvement TF IDF keyword extraction algorithms |
CN107577794A (en) * | 2017-09-19 | 2018-01-12 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN107577794B (en) * | 2017-09-19 | 2019-07-05 | 北京神州泰岳软件股份有限公司 | A kind of news category method and device |
CN107894986A (en) * | 2017-09-26 | 2018-04-10 | 北京纳人网络科技有限公司 | A kind of business connection division methods, server and client based on vectorization |
CN107894986B (en) * | 2017-09-26 | 2021-03-30 | 北京纳人网络科技有限公司 | Enterprise relation division method based on vectorization, server and client |
CN110069627A (en) * | 2017-11-20 | 2019-07-30 | 中国移动通信集团上海有限公司 | Classification method, device, electronic equipment and the storage medium of short text |
CN110020420A (en) * | 2018-01-10 | 2019-07-16 | 腾讯科技(深圳)有限公司 | Text handling method, device, computer equipment and storage medium |
CN110020420B (en) * | 2018-01-10 | 2023-07-21 | 腾讯科技(深圳)有限公司 | Text processing method, device, computer equipment and storage medium |
CN108415959A (en) * | 2018-02-06 | 2018-08-17 | 北京捷通华声科技股份有限公司 | A kind of file classification method and device |
CN108536677A (en) * | 2018-04-09 | 2018-09-14 | 北京信息科技大学 | A kind of patent text similarity calculating method |
CN109190001A (en) * | 2018-09-19 | 2019-01-11 | 广东电网有限责任公司 | office document management method |
CN109543037A (en) * | 2018-11-21 | 2019-03-29 | 南京安讯科技有限责任公司 | A kind of article classification method based on improved TF-IDF |
CN109271523A (en) * | 2018-11-23 | 2019-01-25 | 中电科大数据研究院有限公司 | A kind of government document subject classification method based on information retrieval |
CN109377104A (en) * | 2018-12-04 | 2019-02-22 | 泰康保险集团股份有限公司 | Data processing method, device, medium and electronic equipment |
WO2020207431A1 (en) * | 2019-04-12 | 2020-10-15 | 智慧芽信息科技(苏州)有限公司 | Document classification method, apparatus and device, and storage medium |
CN110362673A (en) * | 2019-07-17 | 2019-10-22 | 福州大学 | Computer vision class papers contents method of discrimination and system based on abstract semantic analysis |
CN110362673B (en) * | 2019-07-17 | 2022-07-08 | 福州大学 | Computer vision type thesis content distinguishing method and system based on abstract semantic analysis |
WO2021128521A1 (en) | 2019-12-25 | 2021-07-01 | 北京中技华软科技服务有限公司 | Automatic industry classification method and system |
CN111126062B (en) * | 2019-12-25 | 2021-09-24 | 北京中技华软科技服务有限公司 | Automatic industry classification method and system |
CN111126062A (en) * | 2019-12-25 | 2020-05-08 | 北京中技华软科技服务有限公司 | Automatic industry classification method and system |
CN111210023B (en) * | 2020-01-13 | 2023-04-11 | 哈尔滨工业大学 | Automatic selection system and method for data set classification learning algorithm |
CN111210023A (en) * | 2020-01-13 | 2020-05-29 | 哈尔滨工业大学 | Automatic selection system and method for data set classification learning algorithm |
CN112016830A (en) * | 2020-08-27 | 2020-12-01 | 广东电网有限责任公司 | Patent file evaluation task allocation method and device |
CN112733549A (en) * | 2020-12-31 | 2021-04-30 | 厦门智融合科技有限公司 | Patent value information analysis method and device based on multiple semantic fusion |
CN112733549B (en) * | 2020-12-31 | 2024-03-01 | 厦门智融合科技有限公司 | Patent value information analysis method and device based on multiple semantic fusion |
CN113378950A (en) * | 2021-06-22 | 2021-09-10 | 深圳市查策网络信息技术有限公司 | Unsupervised classification method for long texts |
CN114706974A (en) * | 2021-09-18 | 2022-07-05 | 北京墨丘科技有限公司 | Technical problem information mining method and device and storage medium |
CN114579749A (en) * | 2022-03-11 | 2022-06-03 | 北京中知智慧科技有限公司 | CPC classification method and device for patent data |
CN116975068A (en) * | 2023-09-25 | 2023-10-31 | 中国标准化研究院 | Metadata-based patent document data storage method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
CN111177374B (en) | Question-answer corpus emotion classification method and system based on active learning | |
CN110321925B (en) | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints | |
Syiam et al. | An intelligent system for Arabic text categorization | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN105389379A (en) | Rubbish article classification method based on distributed feature representation of text | |
Noaman et al. | Naive Bayes classifier based Arabic document categorization | |
Odeh et al. | Arabic text categorization algorithm using vector evaluation method | |
CN108038099A (en) | Low frequency keyword recognition method based on term clustering | |
CN112417153A (en) | Text classification method and device, terminal equipment and readable storage medium | |
Basha et al. | A novel summarization-based approach for feature reduction enhancing text classification accuracy | |
CN114997288A (en) | Design resource association method | |
Ghalehtaki et al. | A combinational method of fuzzy, particle swarm optimization and cellular learning automata for text summarization | |
CN114579695A (en) | Event extraction method, device, equipment and storage medium | |
Rezaei et al. | Event detection in twitter by deep learning classification and multi label clustering virtual backbone formation | |
CN109635123A (en) | A kind of Chinese medicine text concept recognition methods of increment type | |
CN110674293B (en) | Text classification method based on semantic migration | |
Probierz et al. | Clustering of scientific articles using natural language processing | |
Izquierdo et al. | An empirical study on class-based word sense disambiguation | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM | |
CN112765353B (en) | Scientific research text-based biomedical subject classification method and device | |
CN112541080B (en) | New media account label intelligent verification method based on deep learning | |
CN114707615A (en) | Ancient character similarity quantization method based on duration Chinese character knowledge graph | |
Thielmann et al. | Human in the loop: How to effectively create coherent topics by manually labeling only a few documents per class | |
Fan et al. | Mining images on semantics via statistical learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160727 |