CN107122382B - Patent classification method based on specification - Google Patents

Patent classification method based on specification Download PDF

Info

Publication number
CN107122382B
CN107122382B CN201710082677.8A CN201710082677A CN107122382B CN 107122382 B CN107122382 B CN 107122382B CN 201710082677 A CN201710082677 A CN 201710082677A CN 107122382 B CN107122382 B CN 107122382B
Authority
CN
China
Prior art keywords
patents
class
ipc
calculating
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710082677.8A
Other languages
Chinese (zh)
Other versions
CN107122382A (en
Inventor
朱玉全
金健
佘远程
石亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201710082677.8A priority Critical patent/CN107122382B/en
Publication of CN107122382A publication Critical patent/CN107122382A/en
Application granted granted Critical
Publication of CN107122382B publication Critical patent/CN107122382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a specification-based patent classification method, and belongs to the field of text processing and data mining. Firstly, preprocessing a patent specification; then constructing an inverted index file, and selecting feature words by using a feature selection method combining information gain and word frequency; further utilizing the improved TF-IDF formula to calculate the weight of the feature words and construct patent feature vectors; then constructing a training patent field set; and finally classifying the patents by using the optimized KNN classifier. The research provides a new idea for classifying patent documents and lays a foundation for further researching intelligent retrieval of the patent documents and the like.

Description

Patent classification method based on specification
Technical Field
The invention belongs to the application of computer analysis technology in patent documents, and particularly relates to a patent classification method using a patent specification.
Background
The patent is a concrete representation of technical innovation and enterprise value, is one of important carriers, achievements and sources of knowledge development and innovation, and a plurality of invention and creation achievements only appear in patent documents. According to the statistics of the World Intellectual Property Organization (WIPO), 70-90% of the invention results in the world are firstly shown in patent documents, but not in documents of other carriers such as magazines, papers and the like. In addition, in order to protect the benefit of the enterprises, the enterprises apply for patents as early as possible, and the most active and advanced technologies are often concentrated in the patents, and 90% -95% of the technical information in the world is included. Meanwhile, for the convenience of examination, the patent literature is often written in more detail, and compared with other types of data, the patent literature can provide more information, is the most common technical innovation result, and records the complete process of patent activities. It not only reflects the current state of technical activities in various technical fields, but also can reflect the development history of technical activities in a certain specific technical field. The patent literature contains a specific technical solution created by the invention of each applied patent, which plays an important role in enterprise innovation, so that an enterprise can know the latest scientific research dynamics, avoid repeated research, save research time and scientific research expenses, simultaneously can also enlighten the innovation thought of enterprise researchers, improve the starting point of innovation, and greatly shorten the scientific research working progress by referring to the conventional invention.
With the continuous emergence of new research results and invention creations in China, the number of patents is rapidly increased. By 2016, 10 and 5 days, the number of invention patents published in China exceeds 598 thousands, wherein the total number of the invention patents authorized is 223.850 thousands. If the average size of each patent is 2M, the capacity of patent data is up to several hundred TB. In order to scientifically manage these patent document data and to quickly and easily search for relevant patent documents, classification of patent documents is important. At present, most countries in the world classify Patent documents by International Patent Classification (IPC), which is classified according to five grades, namely Section (Section), major Class (Class), minor Class (Subclass), major group (Main group), and group (group), wherein the Section is the highest-grade classification layer in a classification table, and is divided into eight major parts according to different fields, and is marked by one-digit english letters, and is respectively a-H, each part is provided with a plurality of major classes under, each major Class is composed of two-digit numbers, and the bottom of each part is provided with a different number of major classes. For example: G06F21/00 denotes a security device for protecting a computer, its components, programs or data against unauthorized acts, physical-electrical digital data processing.
It can be seen that for the invented patent that has been or will be published, one or more classification numbers corresponding to the invented patent must be assigned, for example, the classification number of the invented patent "a method for protecting private data in association rule mining" is G06F 21/00. For the application patent to be submitted, the classification number is unknown and needs to be determined, for this, the current common practice is to determine according to the belonged field or patent content of the patent description object, and to rely on the relevant experts to manually read the application, and with the rapid increase of the patent application amount (the number of patent applications per year is close to 100 ten thousand), this method needs to consume a lot of manpower and material resources, and the limitation of the expert's own knowledge is difficult to ensure the consistency and accuracy of the classification result. Therefore, the invention provides a patent classification method based on patent document specifications, which utilizes the information in the published patent specifications to construct a classifier or a classification function and determines the classification of the applied patent, thereby realizing the automatic classification of the patent.
Disclosure of Invention
The invention aims to provide a patent classification method based on patent document specifications, aiming at the problem that the conventional patent classification method cannot fully and effectively utilize specification information in published invention patents, and the method fully utilizes the specification information and corresponding classes contained in the published invention patents to construct classifiers or classification functions so as to determine the classes of the submitted patent applications, and provides a corresponding optimization solution for the aspects of specification feature extraction and selection, classifier determination and the like in the construction process.
The technical scheme adopted by the invention is as follows: the patent classification method based on the patent document specification mainly comprises the following steps:
(1) patent data preprocessing
Collecting patent sample data, sampling IPC, extracting specification, Chinese word segmentation and part of speech tagging. The symbols and the numbers in the specification are removed (a large number of paragraph numbers exist in the specification). Regular matching is utilized to filter words such as stop words, null words and conjunctions which are not useful for patent classification, and only key words such as nouns, adjectives and verbs are reserved.
(2) Building inverted index files
And counting the word frequency, the position information, the part-of-speech weight and the inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the patent text information.
(3) Patent text feature selection
And (3) calculating the feature values of the words in the step (2) by using a feature selection method combining information gain and word frequency, sequencing the feature values, and selecting a certain number of feature words to represent the patent text.
Let AijFor containing the feature word tiAnd belong to cjNumber of documents, BijFor containing the feature word tiAnd the category does not belong to cjNumber of documents of CijTo not contain the feature word tiAnd the category belongs to cjNumber of documents, DijTo not contain the feature word tiAnd the category belongs tojThe feature value is calculated as shown in equation (1).
Figure GDA0001344119680000021
Wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection. Let m be the total number of classes in the training patent, NjDenotes cjTotal number of patents in class, TFjkRepresentation feature word tiAt cjMiddle class patent PkThe calculation of TF is shown in formula (2) if the word frequency is middle.
Figure GDA0001344119680000031
IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller. Let TFj(ti) Representation feature word tiIn class cjFrequency of middle, TF (t)i) Representation feature word tiThe total frequency of the frequency of (c) is,
Figure GDA0001344119680000036
representation feature word tiThe average of the frequency occurring in all classes is calculated as shown in equation (3).
Figure GDA0001344119680000032
(4) Patent text vectorization
The method comprises the following steps:
and (4) calculating weight, wherein the calculation is shown as a formula (4).
Figure GDA0001344119680000033
Wherein,
Figure GDA0001344119680000034
representing the characteristic word t in the text
Figure GDA0001344119680000035
The frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, CtRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature wordtAnd representing the position weight coefficient of the characteristic word.
Sorting, sorting according to weight descending order, constructing space model vector V of patent texti(wi1,wi2,...,win) The contents of each patent text are expressed in this manner.
(5) Generating IPC hierarchy class feature vectors
The method comprises the following steps:
and merging the category description of each subgroup into the category description of the main group to perform word segmentation and word deactivation processing.
Secondly, combining the description of each main group and then selecting the characteristics, constructing a class characteristic vector of an IPC subclass level, and expressing the vector as { V }A01B1/00,VA01B3/00,...,VH99Z99/00}. Wherein, A01B1/00 is the first main group in IPC, and H99Z99/00 is the last main group in IPC.
Combining all basic descriptions under the same subclass, then selecting features, constructing class feature vector of IPC major class hierarchy, and expressing the vector as { V }A01B,VA01C,...,VH99Z}. Wherein A01B is the first subclass in IPC, and H99Z is the last subclass in IPC.
Fourthly, merging all basic descriptions under the same large class, then selecting the features, constructing a class feature vector of an IPC part level, and expressing the vector as { V }A01,VA21,...,VH99}. Wherein A01 is the first major class in IPC, and H99Z is the last major class in IPC.
(6) Constructing a neighborhood of patent samples
The method comprises the following steps:
calculating the similarity between patents in the patent training set. The similarity can be obtained by calculating the cosine of the included angle between the vectors. Let sim (d)i,dj) Express patent text diAnd djThe degree of similarity of (a) to (b),the calculation formula is shown in equation (5).
Figure GDA0001344119680000041
Wherein, WikAnd WjkThe weight of the corresponding characteristic word in the patent vector is shown, and n represents the dimension of the vector.
② will diSample d of all other patentsjThe similarity is sorted in descending order, and the first K patent samples are selected to form a set Di,DiReferred to as patent diThe value of K depends on the case.
(7) Similarity calculation of patents to be classified
The method comprises the following steps:
firstly, the patents to be classified are subjected to the extraction of the specification, the Chinese word segmentation, the part of speech tagging and the word stop removal.
② patent feature selection and vectorization.
Calculating to-be-classified patent BjCosine similarity S of eigenvector and each IPC class eigenvectorai
Fourthly, calculating the patent B to be classifiedjThe cosine similarity S with each patent in the patent training setbj
Fifthly, the training patents are treated according to the similarity value SbjSorting in descending order, and selecting the top K patents as the neighborhood set.
(8) Classification decision
The method comprises the following steps:
calculating to-be-classified patent BjAnd sample patent diSize L (B) of shared field betweenj,di) I.e., the number of identical patents in both domain sets.
And secondly, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein a calculation formula is shown as a formula (6).
Figure GDA0001344119680000042
Wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5.
Thirdly, classifying the patents to be classified into the class with the maximum similarity S (i).
The main beneficial effects of the invention are as follows:
(1) patent text feature selection aspect
Compared with the title and abstract of the patent, the content of the patent specification is richer, and the contained information amount is larger. Therefore, the patent specification contains a large amount of noise data, especially the classification of the IPC subclass is achieved, and similar information contained between different patents is more and is not beneficial to classification. Therefore, the invention improves the methods of feature extraction and feature vectorization of the patent specification, reduces noise interference and improves the classification precision of patents.
(2) Design aspect of patent classification method
Because the patent data volume is quite huge and the patent categories are extremely large, the problems of too low training speed of a classification model and the like are caused, and the method is obviously not suitable for patent classification. Therefore, the invention provides a new nearest neighbor classification algorithm, and IPC description information is added in the classification process, so that the accuracy of patent classification is further improved on the premise of ensuring the classification speed.
Drawings
FIG. 1 is a block diagram of the structure in the embodiment of the present invention
FIG. 2 is a flow chart of constructing a patent vector space according to an embodiment of the present invention
FIG. 3 is a classification flow chart based on improved KNN in the embodiment of the present invention
Detailed Description
The patent literature is taken as an example to describe the patent classification method of the invention in detail, and the specific implementation process is as follows:
step 1: and acquiring data of a patent text, and performing text preprocessing on the patent specification, wherein the text preprocessing mainly comprises word segmentation and word stop.
Firstly, obtaining IPC type description, performing word segmentation and part-of-speech tagging on the description, performing word stop removal processing, manually correcting a word segmentation result, and then constructing a user dictionary.
And secondly, respectively carrying out format conversion and specification extraction on the extracted patent samples, adding the user dictionary constructed in the step (1) into a word segmentation program, and then carrying out Chinese word segmentation and part of speech tagging on the specification.
And thirdly, removing words which are not useful for patent classification, such as stop words, dummy words, connecting words and the like in the patent specification by using the regular expression, and only keeping nouns, adjectives and verbs.
Step 2: and counting the word frequency, the position information, the part-of-speech weight and the inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the patent text information.
And (2) constructing an inverted index file according to the words filtered in the step (1), wherein the structure of the index file comprises vocabularies and event tables, each vocabulary corresponds to one event table, and the event tables store the word frequency, the position weight and the part-of-speech weight of the patent number of the vocabulary in the patent document. The position weight calculation formula here is:
Figure GDA0001344119680000051
wherein n represents the total number of occurrences of the word in the specification, and liThe weight of the position where the ith occurrence of the vocabulary is located is shown, and the technical field weight is 1, the background weight is 0.8 and the other positions are 0.5 in the example. The part-of-speech weight is set to be noun 2.5, and both verbs and adjectives are 1, and the specific results are shown in table 1.
TABLE 1 user dictionary and inverted index merging
Figure GDA0001344119680000061
And step 3: and calculating the characteristic values of the words by using a characteristic selection method combining information gain and word frequency, sequencing the characteristic values, and selecting a certain number of characteristic words to represent the patent text.
Because the information gain has the defect of low-frequency words, the applicant usually repeats some special words to emphasize an innovation point, and the high-frequency words are beneficial to classification, therefore, the invention adopts a feature selection method combining the information gain and the word frequency, firstly calculates the feature values of the words in each patent according to a formula (1), then carries out descending sorting on the words according to the feature values, and selects the first 20 words as the feature words of the patent.
Let AijFor containing the feature word tiAnd belong to cjNumber of documents, BijFor containing the feature word tiAnd the category does not belong to cjNumber of documents of CijTo not contain the feature word tiAnd the category belongs to cjNumber of documents, DijTo not contain the feature word tiAnd the category belongs tojThe feature value is calculated as shown in equation (1).
Figure GDA0001344119680000062
Wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection. Let m be the total number of classes in the training patent, NjDenotes cjTotal number of patents in class, TFjkRepresentation feature word tiAt cjMiddle class patent PkThe calculation of TF is shown in formula (2) if the word frequency is middle.
Figure GDA0001344119680000071
IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller. Let TFj(ti) Representation feature word tiIn class cjFrequency of middle, TF (t)i) Representation feature word tiThe total frequency of the frequency of (c) is,
Figure GDA0001344119680000072
representation feature word tiThe average of the frequency occurring in all classes is calculated as shown in equation (3).
Figure GDA0001344119680000073
And 4, step 4: and calculating the weight of each patent feature word by using the inverted index file, then calculating the weight of the feature word by using an improved TF-IDF formula, and finally constructing a patent feature vector.
The method specifically comprises the following steps:
and (4) calculating weight, wherein the calculation is shown as a formula (4).
Figure GDA0001344119680000074
Wherein,
Figure GDA0001344119680000075
representing the characteristic word t in the text
Figure GDA0001344119680000076
The frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, CtRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature wordtAnd representing the position weight coefficient of the characteristic word.
Sorting, sorting according to weight descending order, constructing space model vector V of patent texti(wi1,wi2,...,win) The contents of each patent text are expressed in this manner.
The word frequency, the position weight and the part-of-speech weight of the feature word are recorded in the inverted index file, so that only the number of texts in which the feature word also appears needs to be counted, the total number of texts is also known, and the specific result is shown in table 2.
TABLE 2 patent feature vectors
Figure GDA0001344119680000077
Figure GDA0001344119680000081
And 5: generating IPC class feature vectors of each level, calculating the class weight of each vocabulary in the corresponding level layer by layer from the subclass on the basis of step 1, using TF-IDF for calculating the weight, regarding a class description as a text, and then constructing the class feature vectors of each level.
The method specifically comprises the following steps:
and merging the category description of each subgroup into the category description of the main group to perform word segmentation and word deactivation processing.
Secondly, combining the description of each main group and then selecting the characteristics, constructing a class characteristic vector of an IPC subclass level, and expressing the vector as { V }A01B1/00,VA01B3/00,...,VH99Z99/00}. Wherein, A01B1/00 is the first main group in IPC, and H99Z99/00 is the last main group in IPC.
Combining all basic descriptions under the same subclass, then selecting features, constructing class feature vector of IPC major class hierarchy, and expressing the vector as { V }A01B,VA01C,...,VH99Z}. Wherein A01B is the first subclass in IPC, and H99Z is the last subclass in IPC.
Fourthly, merging all basic descriptions under the same large class, then selecting the features, constructing a class feature vector of an IPC part level, and expressing the vector as { V }A01,VA21,...,VH99}. Wherein A01 is the first major class in IPC, and H99Z is the last major class in IPC.
For example, combine all groups of words under the subclass A01B into a vocabulary A01B, and the same applies to the subclasses under the other subclasses A01, then calculate the weight of each word in the vocabulary A01B, and finally construct the feature vector of the subclass A01B.
Step 6: and (4) constructing a neighborhood of a patent sample, calculating the similarity between each patent and other patents by using the patent feature vectors in the step (4), sequencing the similarity of the patents, and selecting 100 patents with the maximum similarity to form a neighborhood set of the patents.
The method specifically comprises the following steps:
calculating the similarity between patents in the patent training set. The similarity can be obtained by calculating the cosine of the included angle between the vectors. Let sim (d)i,dj) Express patent text diAnd djThe calculation formula is shown in formula (5).
Figure GDA0001344119680000091
Wherein, WikAnd WjkThe weight of the corresponding characteristic word in the patent vector is shown, and n represents the dimension of the vector.
② will diSample d of all other patentsjThe similarity is sorted in descending order, and the first K patent samples are selected to form a set Di,DiReferred to as patent diThe value of K depends on the case.
Specific results are shown in table 3.
TABLE 3 set of patent Domains
Figure GDA0001344119680000092
And 7: and calculating cosine similarity values between the patent vectors to be classified, the IPC class feature vectors and the patents in the training set, and also calculating a neighborhood set of the patents to be classified.
The method comprises the following steps:
preprocessing, feature selection, vectorization and data format conversion are carried out on the patents to be classified.
② patent feature selection and vectorization.
Calculating to-be-classified patent BjCosine similarity S of eigenvector and each IPC class eigenvectorai
Fourthly, calculating the patent B to be classifiedjThe cosine similarity S with each patent in the patent training setbj
Fifthly, the training patents are treated according to the similarity value SbjSorting in descending order, and selecting the top K patents as the neighborhood set.
And 8: and (4) classification decision, namely calculating the size of a shared field between the patents to be classified and the patents in the training set, namely calculating the number of the same patents in the neighborhood set. Then calculating the similarity weighted sum between the patents to be classified and the patent categories, and after the weighted sum is sorted, dividing the patents to be classified into the category with the largest value.
The method specifically comprises the following steps:
calculating to-be-classified patent BjAnd sample patent diSize L (B) of shared field betweenj,di) I.e., the number of identical patents in both domain sets.
And secondly, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein a calculation formula is shown as a formula (6).
Figure GDA0001344119680000101
Wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5.
Thirdly, classifying the patents to be classified into the class with the maximum similarity S (i).
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (3)

1. A patent classification method based on an instruction book is characterized by comprising the following steps:
step 1, acquiring data of a patent text, and performing text preprocessing on a patent specification;
step 2, counting word frequency, position information, part-of-speech weight and inter-class distribution information of each word, and constructing an inverted index file by using the statistical values and the text information of the patent specification;
step 3, calculating the characteristic values of the words by using a characteristic selection method combining information gain and word frequency, sequencing the characteristic values, and selecting a certain number of characteristic words to represent the text of the patent specification;
the calculation process of the characteristic value in the step 3 is as follows:
let AijFor containing the feature word tiAnd belong to cjNumber of documents, BijFor containing the feature word tiAnd the category does not belong to cjNumber of documents of CijTo not contain the feature word tiAnd the category belongs to cjNumber of documents, DijTo not contain the feature word tiAnd the category belongs tojThe feature value is calculated as shown in equation (1):
Figure FDA0002632214860000011
wherein, TF represents the influence degree of the word frequency in the patent on the patent feature selection; let m be the total number of classes in the training patent, NjDenotes cjTotal number of patents in class, TFjkRepresentation feature word tiAt cjMiddle class patent PkThe term frequency in (1), then the calculation of TF is as shown in equation (2):
Figure FDA0002632214860000012
IC in the formula (1) represents the dispersion degree of the characteristic words among the categories, and the dispersion indicates that the words are less representative and the value is smaller; let TFj(ti) Representation feature word tiIn class cjFrequency of middle, TF (t)i) Representation feature word tiThe total frequency of the frequency of (c) is,
Figure FDA0002632214860000014
representation feature word tiThe average of the frequencies occurring in all classes is calculated as shown in equation (3):
Figure FDA0002632214860000013
step 4, calculating the weight of each patent feature word by using the inverted index file, then calculating the weight of the feature word by using an improved TF-IDF formula, and finally constructing a patent feature vector;
the specific process of the step 4 is as follows:
step 4.1, weight calculation, wherein the calculation is shown as the formula (4):
Figure FDA0002632214860000021
wherein,
Figure FDA0002632214860000022
representing the characteristic word t in the text
Figure FDA0002632214860000023
The frequency of occurrence in (1), N represents the number of all patents in all patent sample sets, N represents the number of patents in all patent sample sets with characteristic words t, CtRepresenting part-of-speech weight coefficients, P, corresponding to the part-of-speech of the feature wordtRepresenting characteristic wordsA position weight coefficient;
step 4.2, sorting according to weight descending order, and constructing a space model vector V of the text of the patent specificationi(wi1,wi2,...,win) The content of the text of each patent specification is expressed as such;
step 5, generating IPC class characteristic vectors of each level, calculating the class weight of each vocabulary in the corresponding level layer by layer from subclasses upwards on the basis of the step 1, calculating the weight by using TF-IDF, regarding a class description as a text, and then constructing the class characteristic vectors of each level;
the specific process of the step 5 is as follows:
step 5.1, merging the category description of each subgroup into the category description of the main group to which the subgroup belongs, and performing word segmentation and word stop processing;
and 5.2, combining the descriptions of each main group, then selecting features, constructing a class feature vector of an IPC subclass layer, and expressing the vector as { V }A01B1/00,VA01B3/00,...,VH99Z99/00}; wherein, A01B1/00 is the first main group in IPC, H99Z99/00 is the last main group in IPC;
step 5.3, merging all basic descriptions under the same subclass, then performing feature selection, constructing a class feature vector of an IPC (IPC) major class hierarchy, and expressing the vector as { V }A01B,VA01C,...,VH99Z}; wherein, A01B is the first subclass in IPC, H99Z is the last subclass in IPC;
step 5.4, merging all basic descriptions under the same large class, then performing feature selection, constructing a class feature vector of an IPC part level, and expressing the vector as { V }A01,VA21,...,VH99}; wherein, A01 is the first major class in IPC, H99Z is the last major class in IPC;
step 6, constructing a neighborhood of a patent sample, calculating the similarity between each patent and other patents by using the patent feature vectors in the step 4, sequencing the patent similarities, and selecting K patents with the maximum similarity to form a neighborhood set of the patent;
the specific process of the step 6 is as follows:
step 6.1, calculating the similarity among the patents in the patent training set; the similarity can be obtained by calculating the cosine of an included angle between vectors; let sim (d)i,dj) Text d representing patent specificationiAnd djThe calculation formula is shown in formula (5):
Figure FDA0002632214860000031
wherein, WikAnd WjkRepresenting the weight of a corresponding special token in the patent vector, wherein n represents the dimension of the vector;
step 6.2, mixing diSample d of all other patentsjThe similarity is sorted in descending order, and the first K patent samples are selected to form a set Di,DiReferred to as patent diThe value of K is case specific;
step 7, calculating cosine similarity values between the patent vectors to be classified, the IPC class characteristic vectors and the patents in the training set, and calculating neighborhood sets of the patents to be classified;
step 8, firstly, calculating the size of a shared field between the patents to be classified and the patents in the training set, namely calculating the number of the same patents in the neighborhood set; then calculating the similarity weighted sum between the patents to be classified and the patent categories, and after the weighted sum is sorted, dividing the patents to be classified into the category with the largest value;
the specific process of the step 8 is as follows:
step 8.1, calculate the patent B to be classifiedjAnd sample patent diSize L (B) of shared field betweenj,di) I.e. the number of identical patents in the two domain sets;
and 8.2, calculating the final weighted similarity between the patents to be classified and each IPC class, wherein the calculation formula is shown as the formula (6):
Figure FDA0002632214860000032
wherein, I represents the category, p, k, alpha and beta are adjustable parameters, and under the default condition of the system, p is 0.8, k is 0.95, alpha is 0.6 and beta is 5;
and 8.3, classifying the patents to be classified into the class with the maximum similarity S (i).
2. A specification-based patent classification method according to claim 1, characterized in that: the step 1 specifically comprises:
collecting patent sample data, sampling IPC (International patent Classification) numbers, extracting specifications, dividing Chinese words, labeling parts of speech, and removing symbols and numbers in the specifications; regular matching is used for filtering out words with little use for patent classification, such as stop words, dummy words and connecting words, and only nouns, adjectives and verb keywords are reserved.
3. A specification-based patent classification method according to claim 1, characterized in that: the specific process of the step 7 is as follows:
7.1, extracting a specification, dividing Chinese words, labeling parts of speech and removing stop words of the patent to be classified;
step 7.2, selecting and vectorizing patent features;
step 7.3, calculating the patent B to be classifiedjCosine similarity S of eigenvector and each IPC class eigenvectorai
Step 7.4, calculate the patent B to be classifiedjThe cosine similarity S with each patent in the patent training setbj
Step 7.5, the above training patents are processed according to the similarity value SbjSorting in descending order, and selecting the top K patents as the neighborhood set.
CN201710082677.8A 2017-02-16 2017-02-16 Patent classification method based on specification Active CN107122382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710082677.8A CN107122382B (en) 2017-02-16 2017-02-16 Patent classification method based on specification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710082677.8A CN107122382B (en) 2017-02-16 2017-02-16 Patent classification method based on specification

Publications (2)

Publication Number Publication Date
CN107122382A CN107122382A (en) 2017-09-01
CN107122382B true CN107122382B (en) 2021-03-23

Family

ID=59717475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710082677.8A Active CN107122382B (en) 2017-02-16 2017-02-16 Patent classification method based on specification

Country Status (1)

Country Link
CN (1) CN107122382B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679153A (en) * 2017-09-27 2018-02-09 国家电网公司信息通信分公司 A kind of patent classification method and device
CN107862328A (en) * 2017-10-31 2018-03-30 平安科技(深圳)有限公司 The regular execution method of information word set generation method and rule-based engine
CN107844553B (en) * 2017-10-31 2021-07-27 浪潮通用软件有限公司 Text classification method and device
CN108170666A (en) * 2017-11-29 2018-06-15 同济大学 A kind of improved method based on TF-IDF keyword extractions
CN108227564B (en) * 2017-12-12 2020-07-21 深圳和而泰数据资源与云技术有限公司 Information processing method, terminal and computer readable medium
CN108804512B (en) * 2018-04-20 2020-11-24 平安科技(深圳)有限公司 Text classification model generation device and method and computer readable storage medium
CN109213855A (en) * 2018-09-12 2019-01-15 合肥汇众知识产权管理有限公司 Document labeling method based on patent drafting
CN109299263B (en) * 2018-10-10 2021-01-05 上海观安信息技术股份有限公司 Text classification method and electronic equipment
CN110019822B (en) * 2019-04-16 2021-07-06 中国科学技术大学 Few-sample relation classification method and system
CN111930946A (en) * 2020-08-18 2020-11-13 哈尔滨工程大学 Patent classification method based on similarity measurement
CN113849655B (en) * 2021-12-02 2022-02-18 江西师范大学 Patent text multi-label classification method
CN116701633B (en) * 2023-06-14 2024-06-18 上交所技术有限责任公司 Industry classification method based on patent big data
CN116975068A (en) * 2023-09-25 2023-10-31 中国标准化研究院 Metadata-based patent document data storage method, device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Also Published As

Publication number Publication date
CN107122382A (en) 2017-09-01

Similar Documents

Publication Publication Date Title
CN107122382B (en) Patent classification method based on specification
CN109101477B (en) Enterprise field classification and enterprise keyword screening method
CN107944480B (en) Enterprise industry classification method
Liu et al. Text features extraction based on TF-IDF associating semantic
CN105512311B (en) A kind of adaptive features select method based on chi-square statistics
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN106599054B (en) Method and system for classifying and pushing questions
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN111680225B (en) WeChat financial message analysis method and system based on machine learning
CN101625680A (en) Document retrieval method in patent field
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN110705247A (en) Based on x2-C text similarity calculation method
CN106776695A (en) The method for realizing the automatic identification of secretarial document value
CN113342984A (en) Garden enterprise classification method and system, intelligent terminal and storage medium
CN109993216A (en) A kind of file classification method and its equipment based on K arest neighbors KNN
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN105205163A (en) Incremental learning multi-level binary-classification method of scientific news
CN114707003A (en) Method, equipment and storage medium for dissimilarity of names of thesis authors
CN110413985B (en) Related text segment searching method and device
CN106708920A (en) Screening method for personalized scientific research literature
CN116204647A (en) Method and device for establishing target comparison learning model and text clustering
CN115687960A (en) Text clustering method for open source security information
CN114117215A (en) Government affair data personalized recommendation system based on mixed mode

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant