CN107122382B - Patent classification method based on specification - Google Patents
Patent classification method based on specification Download PDFInfo
- Publication number
- CN107122382B CN107122382B CN201710082677.8A CN201710082677A CN107122382B CN 107122382 B CN107122382 B CN 107122382B CN 201710082677 A CN201710082677 A CN 201710082677A CN 107122382 B CN107122382 B CN 107122382B
- Authority
- CN
- China
- Prior art keywords
- patents
- class
- ipc
- calculating
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 239000013598 vector Substances 0.000 claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims abstract description 6
- 238000010187 selection method Methods 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 10
- 239000006185 dispersion Substances 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 claims 2
- 238000001914 filtration Methods 0.000 claims 1
- 238000011160 research Methods 0.000 abstract description 7
- 238000007418 data mining Methods 0.000 abstract 1
- 238000000605 extraction Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 239000000969 carrier Substances 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000009849 deactivation Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- -1 magazines Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a specification-based patent classification method, and belongs to the field of text processing and data mining. Firstly, preprocessing a patent specification; then constructing an inverted index file, and selecting feature words by using a feature selection method combining information gain and word frequency; further utilizing the improved TF-IDF formula to calculate the weight of the feature words and construct patent feature vectors; then constructing a training patent field set; and finally classifying the patents by using the optimized KNN classifier. The research provides a new idea for classifying patent documents and lays a foundation for further researching intelligent retrieval of the patent documents and the like.
Description
Technical Field
The invention belongs to the application of computer analysis technology in patent documents, and particularly relates to a patent classification method using a patent specification.
Background
The patent is a concrete representation of technical innovation and enterprise value, is one of important carriers, achievements and sources of knowledge development and innovation, and a plurality of invention and creation achievements only appear in patent documents. According to the statistics of the World Intellectual Property Organization (WIPO), 70-90% of the invention results in the world are firstly shown in patent documents, but not in documents of other carriers such as magazines, papers and the like. In addition, in order to protect the benefit of the enterprises, the enterprises apply for patents as early as possible, and the most active and advanced technologies are often concentrated in the patents, and 90% -95% of the technical information in the world is included. Meanwhile, for the convenience of examination, the patent literature is often written in more detail, and compared with other types of data, the patent literature can provide more information, is the most common technical innovation result, and records the complete process of patent activities. It not only reflects the current state of technical activities in various technical fields, but also can reflect the development history of technical activities in a certain specific technical field. The patent literature contains a specific technical solution created by the invention of each applied patent, which plays an important role in enterprise innovation, so that an enterprise can know the latest scientific research dynamics, avoid repeated research, save research time and scientific research expenses, simultaneously can also enlighten the innovation thought of enterprise researchers, improve the starting point of innovation, and greatly shorten the scientific research working progress by referring to the conventional invention.
With the continuous emergence of new research results and invention creations in China, the number of patents is rapidly increased. By 2016, 10 and 5 days, the number of invention patents published in China exceeds 598 thousands, wherein the total number of the invention patents authorized is 223.850 thousands. If the average size of each patent is 2M, the capacity of patent data is up to several hundred TB. In order to scientifically manage these patent document data and to quickly and easily search for relevant patent documents, classification of patent documents is important. At present, most countries in the world classify Patent documents by International Patent Classification (IPC), which is classified according to five grades, namely Section (Section), major Class (Class), minor Class (Subclass), major group (Main group), and group (group), wherein the Section is the highest-grade classification layer in a classification table, and is divided into eight major parts according to different fields, and is marked by one-digit english letters, and is respectively a-H, each part is provided with a plurality of major classes under, each major Class is composed of two-digit numbers, and the bottom of each part is provided with a different number of major classes. For example: G06F21/00 denotes a security device for protecting a computer, its components, programs or data against unauthorized acts, physical-electrical digital data processing.
It can be seen that for the invented patent that has been or will be published, one or more classification numbers corresponding to the invented patent must be assigned, for example, the classification number of the invented patent "a method for protecting private data in association rule mining" is G06F 21/00. For the application patent to be submitted, the classification number is unknown and needs to be determined, for this, the current common practice is to determine according to the belonged field or patent content of the patent description object, and to rely on the relevant experts to manually read the application, and with the rapid increase of the patent application amount (the number of patent applications per year is close to 100 ten thousand), this method needs to consume a lot of manpower and material resources, and the limitation of the expert's own knowledge is difficult to ensure the consistency and accuracy of the classification result. Therefore, the invention provides a patent classification method based on patent document specifications, which utilizes the information in the published patent specifications to construct a classifier or a classification function and determines the classification of the applied patent, thereby realizing the automatic classification of the patent.
Disclosure of Invention
The invention aims to provide a patent classification method based on patent document specifications, aiming at the problem that the conventional patent classification method cannot fully and effectively utilize specification information in published invention patents, and the method fully utilizes the specification information and corresponding classes contained in the published invention patents to construct classifiers or classification functions so as to determine the classes of the submitted patent applications, and provides a corresponding optimization solution for the aspects of specification feature extraction and selection, classifier determination and the like in the construction process.
The technical scheme adopted by the invention is as follows: the patent classification method based on the patent document specification mainly comprises the following steps:
(1) patent data preprocessing
Collecting patent sample data, sampling IPC, extracting specification, Chinese word segmentation and part of speech tagging. The symbols and the numbers in the specification are removed (a large number of paragraph numbers exist in the specification). Regular matching is utilized to filter words such as stop words, null words and conjunctions which are not useful for patent classification, and only key words such as nouns, adjectives and verbs are reserved.
(2) Building inverted index files
And counting the word frequency, the position information, the part-of-speech weight and the inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the patent text information.
(3) Patent text feature selection
And (3) calculating the feature values of the words in the step (2) by using a feature selection method combining information gain and word frequency, sequencing the feature values, and selecting a certain number of feature words to represent the patent text.
Let AijFor containing the feature word tiAnd belong to cjNumber of documents, BijFor containing the feature word tiAnd the category does not belong to cjNumber of documents of CijTo not contain the feature word tiAnd the category belongs to cjNumber of documents, DijTo not contain the feature word tiAnd the category belongs tojThe feature value is calculated as shown in equation (1).
Wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection. Let m be the total number of classes in the training patent, NjDenotes cjTotal number of patents in class, TFjkRepresentation feature word tiAt cjMiddle class patent PkThe calculation of TF is shown in formula (2) if the word frequency is middle.
IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller. Let TFj(ti) Representation feature word tiIn class cjFrequency of middle, TF (t)i) Representation feature word tiThe total frequency of the frequency of (c) is,representation feature word tiThe average of the frequency occurring in all classes is calculated as shown in equation (3).
(4) Patent text vectorization
The method comprises the following steps:
and (4) calculating weight, wherein the calculation is shown as a formula (4).
Wherein,representing the characteristic word t in the textThe frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, CtRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature wordtAnd representing the position weight coefficient of the characteristic word.
Sorting, sorting according to weight descending order, constructing space model vector V of patent texti(wi1,wi2,...,win) The contents of each patent text are expressed in this manner.
(5) Generating IPC hierarchy class feature vectors
The method comprises the following steps:
and merging the category description of each subgroup into the category description of the main group to perform word segmentation and word deactivation processing.
Secondly, combining the description of each main group and then selecting the characteristics, constructing a class characteristic vector of an IPC subclass level, and expressing the vector as { V }A01B1/00,VA01B3/00,...,VH99Z99/00}. Wherein, A01B1/00 is the first main group in IPC, and H99Z99/00 is the last main group in IPC.
Combining all basic descriptions under the same subclass, then selecting features, constructing class feature vector of IPC major class hierarchy, and expressing the vector as { V }A01B,VA01C,...,VH99Z}. Wherein A01B is the first subclass in IPC, and H99Z is the last subclass in IPC.
Fourthly, merging all basic descriptions under the same large class, then selecting the features, constructing a class feature vector of an IPC part level, and expressing the vector as { V }A01,VA21,...,VH99}. Wherein A01 is the first major class in IPC, and H99Z is the last major class in IPC.
(6) Constructing a neighborhood of patent samples
The method comprises the following steps:
calculating the similarity between patents in the patent training set. The similarity can be obtained by calculating the cosine of the included angle between the vectors. Let sim (d)i,dj) Express patent text diAnd djThe degree of similarity of (a) to (b),the calculation formula is shown in equation (5).
Wherein, WikAnd WjkThe weight of the corresponding characteristic word in the patent vector is shown, and n represents the dimension of the vector.
② will diSample d of all other patentsjThe similarity is sorted in descending order, and the first K patent samples are selected to form a set Di,DiReferred to as patent diThe value of K depends on the case.
(7) Similarity calculation of patents to be classified
The method comprises the following steps:
firstly, the patents to be classified are subjected to the extraction of the specification, the Chinese word segmentation, the part of speech tagging and the word stop removal.
② patent feature selection and vectorization.
Calculating to-be-classified patent BjCosine similarity S of eigenvector and each IPC class eigenvectorai。
Fourthly, calculating the patent B to be classifiedjThe cosine similarity S with each patent in the patent training setbj。
Fifthly, the training patents are treated according to the similarity value SbjSorting in descending order, and selecting the top K patents as the neighborhood set.
(8) Classification decision
The method comprises the following steps:
calculating to-be-classified patent BjAnd sample patent diSize L (B) of shared field betweenj,di) I.e., the number of identical patents in both domain sets.
And secondly, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein a calculation formula is shown as a formula (6).
Wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5.
Thirdly, classifying the patents to be classified into the class with the maximum similarity S (i).
The main beneficial effects of the invention are as follows:
(1) patent text feature selection aspect
Compared with the title and abstract of the patent, the content of the patent specification is richer, and the contained information amount is larger. Therefore, the patent specification contains a large amount of noise data, especially the classification of the IPC subclass is achieved, and similar information contained between different patents is more and is not beneficial to classification. Therefore, the invention improves the methods of feature extraction and feature vectorization of the patent specification, reduces noise interference and improves the classification precision of patents.
(2) Design aspect of patent classification method
Because the patent data volume is quite huge and the patent categories are extremely large, the problems of too low training speed of a classification model and the like are caused, and the method is obviously not suitable for patent classification. Therefore, the invention provides a new nearest neighbor classification algorithm, and IPC description information is added in the classification process, so that the accuracy of patent classification is further improved on the premise of ensuring the classification speed.
Drawings
FIG. 1 is a block diagram of the structure in the embodiment of the present invention
FIG. 2 is a flow chart of constructing a patent vector space according to an embodiment of the present invention
FIG. 3 is a classification flow chart based on improved KNN in the embodiment of the present invention
Detailed Description
The patent literature is taken as an example to describe the patent classification method of the invention in detail, and the specific implementation process is as follows:
step 1: and acquiring data of a patent text, and performing text preprocessing on the patent specification, wherein the text preprocessing mainly comprises word segmentation and word stop.
Firstly, obtaining IPC type description, performing word segmentation and part-of-speech tagging on the description, performing word stop removal processing, manually correcting a word segmentation result, and then constructing a user dictionary.
And secondly, respectively carrying out format conversion and specification extraction on the extracted patent samples, adding the user dictionary constructed in the step (1) into a word segmentation program, and then carrying out Chinese word segmentation and part of speech tagging on the specification.
And thirdly, removing words which are not useful for patent classification, such as stop words, dummy words, connecting words and the like in the patent specification by using the regular expression, and only keeping nouns, adjectives and verbs.
Step 2: and counting the word frequency, the position information, the part-of-speech weight and the inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the patent text information.
And (2) constructing an inverted index file according to the words filtered in the step (1), wherein the structure of the index file comprises vocabularies and event tables, each vocabulary corresponds to one event table, and the event tables store the word frequency, the position weight and the part-of-speech weight of the patent number of the vocabulary in the patent document. The position weight calculation formula here is:wherein n represents the total number of occurrences of the word in the specification, and liThe weight of the position where the ith occurrence of the vocabulary is located is shown, and the technical field weight is 1, the background weight is 0.8 and the other positions are 0.5 in the example. The part-of-speech weight is set to be noun 2.5, and both verbs and adjectives are 1, and the specific results are shown in table 1.
TABLE 1 user dictionary and inverted index merging
And step 3: and calculating the characteristic values of the words by using a characteristic selection method combining information gain and word frequency, sequencing the characteristic values, and selecting a certain number of characteristic words to represent the patent text.
Because the information gain has the defect of low-frequency words, the applicant usually repeats some special words to emphasize an innovation point, and the high-frequency words are beneficial to classification, therefore, the invention adopts a feature selection method combining the information gain and the word frequency, firstly calculates the feature values of the words in each patent according to a formula (1), then carries out descending sorting on the words according to the feature values, and selects the first 20 words as the feature words of the patent.
Let AijFor containing the feature word tiAnd belong to cjNumber of documents, BijFor containing the feature word tiAnd the category does not belong to cjNumber of documents of CijTo not contain the feature word tiAnd the category belongs to cjNumber of documents, DijTo not contain the feature word tiAnd the category belongs tojThe feature value is calculated as shown in equation (1).
Wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection. Let m be the total number of classes in the training patent, NjDenotes cjTotal number of patents in class, TFjkRepresentation feature word tiAt cjMiddle class patent PkThe calculation of TF is shown in formula (2) if the word frequency is middle.
IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller. Let TFj(ti) Representation feature word tiIn class cjFrequency of middle, TF (t)i) Representation feature word tiThe total frequency of the frequency of (c) is,representation feature word tiThe average of the frequency occurring in all classes is calculated as shown in equation (3).
And 4, step 4: and calculating the weight of each patent feature word by using the inverted index file, then calculating the weight of the feature word by using an improved TF-IDF formula, and finally constructing a patent feature vector.
The method specifically comprises the following steps:
and (4) calculating weight, wherein the calculation is shown as a formula (4).
Wherein,representing the characteristic word t in the textThe frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, CtRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature wordtAnd representing the position weight coefficient of the characteristic word.
Sorting, sorting according to weight descending order, constructing space model vector V of patent texti(wi1,wi2,...,win) The contents of each patent text are expressed in this manner.
The word frequency, the position weight and the part-of-speech weight of the feature word are recorded in the inverted index file, so that only the number of texts in which the feature word also appears needs to be counted, the total number of texts is also known, and the specific result is shown in table 2.
TABLE 2 patent feature vectors
And 5: generating IPC class feature vectors of each level, calculating the class weight of each vocabulary in the corresponding level layer by layer from the subclass on the basis of step 1, using TF-IDF for calculating the weight, regarding a class description as a text, and then constructing the class feature vectors of each level.
The method specifically comprises the following steps:
and merging the category description of each subgroup into the category description of the main group to perform word segmentation and word deactivation processing.
Secondly, combining the description of each main group and then selecting the characteristics, constructing a class characteristic vector of an IPC subclass level, and expressing the vector as { V }A01B1/00,VA01B3/00,...,VH99Z99/00}. Wherein, A01B1/00 is the first main group in IPC, and H99Z99/00 is the last main group in IPC.
Combining all basic descriptions under the same subclass, then selecting features, constructing class feature vector of IPC major class hierarchy, and expressing the vector as { V }A01B,VA01C,...,VH99Z}. Wherein A01B is the first subclass in IPC, and H99Z is the last subclass in IPC.
Fourthly, merging all basic descriptions under the same large class, then selecting the features, constructing a class feature vector of an IPC part level, and expressing the vector as { V }A01,VA21,...,VH99}. Wherein A01 is the first major class in IPC, and H99Z is the last major class in IPC.
For example, combine all groups of words under the subclass A01B into a vocabulary A01B, and the same applies to the subclasses under the other subclasses A01, then calculate the weight of each word in the vocabulary A01B, and finally construct the feature vector of the subclass A01B.
Step 6: and (4) constructing a neighborhood of a patent sample, calculating the similarity between each patent and other patents by using the patent feature vectors in the step (4), sequencing the similarity of the patents, and selecting 100 patents with the maximum similarity to form a neighborhood set of the patents.
The method specifically comprises the following steps:
calculating the similarity between patents in the patent training set. The similarity can be obtained by calculating the cosine of the included angle between the vectors. Let sim (d)i,dj) Express patent text diAnd djThe calculation formula is shown in formula (5).
Wherein, WikAnd WjkThe weight of the corresponding characteristic word in the patent vector is shown, and n represents the dimension of the vector.
② will diSample d of all other patentsjThe similarity is sorted in descending order, and the first K patent samples are selected to form a set Di,DiReferred to as patent diThe value of K depends on the case.
Specific results are shown in table 3.
TABLE 3 set of patent Domains
And 7: and calculating cosine similarity values between the patent vectors to be classified, the IPC class feature vectors and the patents in the training set, and also calculating a neighborhood set of the patents to be classified.
The method comprises the following steps:
preprocessing, feature selection, vectorization and data format conversion are carried out on the patents to be classified.
② patent feature selection and vectorization.
Calculating to-be-classified patent BjCosine similarity S of eigenvector and each IPC class eigenvectorai。
Fourthly, calculating the patent B to be classifiedjThe cosine similarity S with each patent in the patent training setbj。
Fifthly, the training patents are treated according to the similarity value SbjSorting in descending order, and selecting the top K patents as the neighborhood set.
And 8: and (4) classification decision, namely calculating the size of a shared field between the patents to be classified and the patents in the training set, namely calculating the number of the same patents in the neighborhood set. Then calculating the similarity weighted sum between the patents to be classified and the patent categories, and after the weighted sum is sorted, dividing the patents to be classified into the category with the largest value.
The method specifically comprises the following steps:
calculating to-be-classified patent BjAnd sample patent diSize L (B) of shared field betweenj,di) I.e., the number of identical patents in both domain sets.
And secondly, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein a calculation formula is shown as a formula (6).
Wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5.
Thirdly, classifying the patents to be classified into the class with the maximum similarity S (i).
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Claims (3)
1. A patent classification method based on an instruction book is characterized by comprising the following steps:
step 1, acquiring data of a patent text, and performing text preprocessing on a patent specification;
step 2, counting word frequency, position information, part-of-speech weight and inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the text information of the patent specification;
step 3, calculating the characteristic values of the words by using a characteristic selection method combining information gain and word frequency, sequencing the characteristic values, and selecting a certain number of characteristic words to represent the text of the patent specification;
the calculation process of the characteristic value in the step 3 is as follows:
let AijFor containing the feature word tiAnd belong to cjNumber of documents, BijFor containing the feature word tiAnd the category does not belong to cjNumber of documents of CijTo not contain the feature word tiAnd the category belongs to cjNumber of documents, DijTo not contain the feature word tiAnd the category belongs tojThe feature value is calculated as shown in equation (1):
wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection; let m be the total number of classes in the training patent, NjDenotes cjTotal number of patents in class, TFjkRepresentation feature word tiAt cjMiddle class patent PkThe term frequency in (1), then the calculation of TF is as shown in equation (2):
IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller; let TFj(ti) Representation feature word tiIn class cjFrequency of middle, TF (t)i) Representation feature word tiThe total frequency of the frequency of (c) is,representation feature word tiThe average of the frequencies occurring in all classes is calculated as shown in equation (3):
step 4, calculating the weight of each patent feature word by using the inverted index file, then calculating the weight of the feature word by using an improved TF-IDF formula, and finally constructing a patent feature vector;
the specific process of the step 4 is as follows:
step 4.1, weight calculation, wherein the calculation is shown as the formula (4):
wherein,representing the characteristic word t in the textThe frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, CtRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature wordtRepresenting characteristic wordsA position weight coefficient;
step 4.2, sorting according to weight descending order, and constructing a space model vector V of the text of the patent specificationi(wi1,wi2,...,win) The content of the text of each patent specification is expressed as such;
step 5, generating IPC class characteristic vectors of each level, calculating the class weight of each vocabulary in the corresponding level layer by layer from subclasses upwards on the basis of the step 1, calculating the weight by using TF-IDF, regarding a class description as a text, and then constructing the class characteristic vectors of each level;
the specific process of the step 5 is as follows:
step 5.1, merging the category description of each subgroup into the category description of the main group to which the subgroup belongs, and performing word segmentation and word stop processing;
and 5.2, combining the descriptions of each main group, then selecting features, constructing a class feature vector of an IPC subclass layer, and expressing the vector as { V }A01B1/00,VA01B3/00,...,VH99Z99/00}; wherein, A01B1/00 is the first main group in IPC, H99Z99/00 is the last main group in IPC;
step 5.3, merging all basic descriptions under the same subclass, then performing feature selection, constructing a class feature vector of an IPC (IPC) major class hierarchy, and expressing the vector as { V }A01B,VA01C,...,VH99Z}; wherein, A01B is the first subclass in IPC, H99Z is the last subclass in IPC;
step 5.4, merging all basic descriptions under the same large class, then performing feature selection, constructing a class feature vector of an IPC part level, and expressing the vector as { V }A01,VA21,...,VH99}; wherein, A01 is the first major class in IPC, H99Z is the last major class in IPC;
step 6, constructing a neighborhood of a patent sample, calculating the similarity between each patent and other patents by using the patent feature vectors in the step 4, sequencing the patent similarities, and selecting K patents with the maximum similarity to form a neighborhood set of the patent;
the specific process of the step 6 is as follows:
step 6.1, calculating the similarity among the patents in the patent training set; the similarity can be obtained by calculating the cosine of an included angle between vectors; let sim (d)i,dj) Text d representing patent specificationiAnd djThe calculation formula is shown in formula (5):
wherein, WikAnd WjkRepresenting the weight of a corresponding special token in the patent vector, wherein n represents the dimension of the vector;
step 6.2, mixing diSample d of all other patentsjThe similarity is sorted in descending order, and the first K patent samples are selected to form a set Di,DiReferred to as patent diThe value of K is case specific;
step 7, calculating cosine similarity values between the patent vectors to be classified, the IPC class characteristic vectors and the patents in the training set, and calculating neighborhood sets of the patents to be classified;
step 8, firstly, calculating the size of a shared field between the patents to be classified and the patents in the training set, namely calculating the number of the same patents in the neighborhood set; then calculating the similarity weighted sum between the patents to be classified and the patent categories, and after the weighted sum is sorted, dividing the patents to be classified into the category with the largest value;
the specific process of the step 8 is as follows:
step 8.1, calculate the patent B to be classifiedjAnd sample patent diSize L (B) of shared field betweenj,di) I.e. the number of identical patents in the two domain sets;
and 8.2, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein the calculation formula is shown as the formula (6):
wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5;
and 8.3, classifying the patents to be classified into the class with the maximum similarity S (i).
2. A specification-based patent classification method according to claim 1, characterized in that: the step 1 specifically comprises:
collecting patent sample data, sampling IPC (International patent Classification) numbers, extracting specifications, dividing Chinese words, labeling parts of speech, and removing symbols and numbers in the specifications; regular matching is used for filtering out words with little use for patent classification, such as stop words, dummy words and connecting words, and only nouns, adjectives and verb keywords are reserved.
3. A specification-based patent classification method according to claim 1, characterized in that: the specific process of the step 7 is as follows:
7.1, extracting a specification, dividing Chinese words, labeling parts of speech and removing stop words of the patent to be classified;
step 7.2, selecting and vectorizing patent features;
step 7.3, calculating the patent B to be classifiedjCosine similarity S of eigenvector and each IPC class eigenvectorai;
Step 7.4, calculate the patent B to be classifiedjThe cosine similarity S with each patent in the patent training setbj;
Step 7.5, the above training patents are processed according to the similarity value SbjSorting in descending order, and selecting the top K patents as the neighborhood set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710082677.8A CN107122382B (en) | 2017-02-16 | 2017-02-16 | Patent classification method based on specification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710082677.8A CN107122382B (en) | 2017-02-16 | 2017-02-16 | Patent classification method based on specification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107122382A CN107122382A (en) | 2017-09-01 |
CN107122382B true CN107122382B (en) | 2021-03-23 |
Family
ID=59717475
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710082677.8A Active CN107122382B (en) | 2017-02-16 | 2017-02-16 | Patent classification method based on specification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107122382B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679153A (en) * | 2017-09-27 | 2018-02-09 | 国家电网公司信息通信分公司 | A kind of patent classification method and device |
CN107862328A (en) * | 2017-10-31 | 2018-03-30 | 平安科技(深圳)有限公司 | The regular execution method of information word set generation method and rule-based engine |
CN107844553B (en) * | 2017-10-31 | 2021-07-27 | 浪潮通用软件有限公司 | Text classification method and device |
CN108170666A (en) * | 2017-11-29 | 2018-06-15 | 同济大学 | A kind of improved method based on TF-IDF keyword extractions |
CN108227564B (en) * | 2017-12-12 | 2020-07-21 | 深圳和而泰数据资源与云技术有限公司 | Information processing method, terminal and computer readable medium |
CN108804512B (en) * | 2018-04-20 | 2020-11-24 | 平安科技(深圳)有限公司 | Text classification model generation device and method and computer readable storage medium |
CN109213855A (en) * | 2018-09-12 | 2019-01-15 | 合肥汇众知识产权管理有限公司 | Document labeling method based on patent drafting |
CN109299263B (en) * | 2018-10-10 | 2021-01-05 | 上海观安信息技术股份有限公司 | Text classification method and electronic equipment |
CN110019822B (en) * | 2019-04-16 | 2021-07-06 | 中国科学技术大学 | Few-sample relation classification method and system |
CN111930946A (en) * | 2020-08-18 | 2020-11-13 | 哈尔滨工程大学 | Patent classification method based on similarity measurement |
CN113849655B (en) * | 2021-12-02 | 2022-02-18 | 江西师范大学 | Patent text multi-label classification method |
CN116701633B (en) * | 2023-06-14 | 2024-06-18 | 上交所技术有限责任公司 | Industry classification method based on patent big data |
CN116975068A (en) * | 2023-09-25 | 2023-10-31 | 中国标准化研究院 | Metadata-based patent document data storage method, device and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
-
2017
- 2017-02-16 CN CN201710082677.8A patent/CN107122382B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
Also Published As
Publication number | Publication date |
---|---|
CN107122382A (en) | 2017-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122382B (en) | Patent classification method based on specification | |
CN109101477B (en) | Enterprise field classification and enterprise keyword screening method | |
CN107944480B (en) | Enterprise industry classification method | |
Liu et al. | Text features extraction based on TF-IDF associating semantic | |
CN105512311B (en) | A kind of adaptive features select method based on chi-square statistics | |
CN109376352B (en) | Patent text modeling method based on word2vec and semantic similarity | |
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN111680225B (en) | WeChat financial message analysis method and system based on machine learning | |
CN101625680A (en) | Document retrieval method in patent field | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN106776695A (en) | The method for realizing the automatic identification of secretarial document value | |
CN113342984A (en) | Garden enterprise classification method and system, intelligent terminal and storage medium | |
CN109993216A (en) | A kind of file classification method and its equipment based on K arest neighbors KNN | |
CN111090994A (en) | Chinese-internet-forum-text-oriented event place attribution province identification method | |
CN105205163A (en) | Incremental learning multi-level binary-classification method of scientific news | |
CN114707003A (en) | Method, equipment and storage medium for dissimilarity of names of thesis authors | |
CN110413985B (en) | Related text segment searching method and device | |
CN106708920A (en) | Screening method for personalized scientific research literature | |
CN116204647A (en) | Method and device for establishing target comparison learning model and text clustering | |
CN115687960A (en) | Text clustering method for open source security information | |
CN114117215A (en) | Government affair data personalized recommendation system based on mixed mode |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |