CN114357086A - Patent IPC classification number recommendation method and device based on knowledge graph - Google Patents

Patent IPC classification number recommendation method and device based on knowledge graph Download PDF

Info

Publication number
CN114357086A
CN114357086A CN202111009919.3A CN202111009919A CN114357086A CN 114357086 A CN114357086 A CN 114357086A CN 202111009919 A CN202111009919 A CN 202111009919A CN 114357086 A CN114357086 A CN 114357086A
Authority
CN
China
Prior art keywords
patents
knowledge graph
query
entities
ipc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111009919.3A
Other languages
Chinese (zh)
Inventor
石振锋
王嘉瑜
孙赟星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Heilongjiang Yangguang Huiyuan Information Technology Co ltd
Original Assignee
Heilongjiang Yangguang Huiyuan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Heilongjiang Yangguang Huiyuan Information Technology Co ltd filed Critical Heilongjiang Yangguang Huiyuan Information Technology Co ltd
Priority to CN202111009919.3A priority Critical patent/CN114357086A/en
Publication of CN114357086A publication Critical patent/CN114357086A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A patent IPC classification number recommendation method and device based on a knowledge graph relates to the field of data analysis and aims to solve the problems that an existing method for determining the technical field of patents depends on manual analysis, time consumption is long, efficiency is low, and requirements of enterprises and users cannot be met. The method comprises the following steps: constructing a patent knowledge graph, and performing vectorization representation on entities in the graph by using a TransE model to obtain vectorization representation of the invention name; calculating the similarity between the query patent and each patent in the database by using vectorization expression of the name of the invention, and taking the M patents with the highest similarity with the query patent as recommended similar patents; the N IPC codes with the highest occurrence frequency in similar patents are taken as recommended IPC codes. The device comprises a patent knowledge map construction module, an entity vectorization module, a similarity calculation module and an IPC classification number recommendation module.

Description

Patent IPC classification number recommendation method and device based on knowledge graph
Technical Field
The application relates to the field of data analysis, in particular to a prediction technology in the technical field of patents.
Background
In the face of such huge amount of patent data, how to effectively acquire information in different fields from the data, how to accurately grasp the scientific and technical development conditions in the fields to which different industries belong at present, and how to grasp the more advanced technology of the industry become urgent needs of enterprises. With the increasingly fierce world science and technology competition, various analyses aiming at patents gradually become popular fields.
In the patent application process, the technical fields to which the patents belong need to be divided according to the basic information of the patents, which is a complicated and tedious work, and how to effectively realize the recommendation in the technical field of the patents becomes a work worth researching by enterprises or users.
Generally, the technical field of patent determination is mainly to determine the technical scope by manually analyzing the information in the patent text, comparing with the prior art, and further under the guidance of professional technicians. However, as patent data rapidly grows, manual analysis takes longer and higher cost, and sometimes the requirements of enterprises and users are difficult to meet. Therefore, how to determine the technical field to which the patent belongs efficiently and accurately becomes the research direction.
Disclosure of Invention
The patent IPC classification number recommendation method and device based on the knowledge graph are provided for solving the problems that the existing method for determining the technical field of patents depends on manual analysis, consumes long time, is low in efficiency and cannot meet the requirements of enterprises and users.
The patent IPC classification number recommendation method based on the knowledge graph comprises the following steps:
constructing a patent knowledge graph, wherein the patent knowledge graph comprises entities of a query patent and a plurality of patents having the same technical field as the query patent and the relationship among the entities, and the entities comprise applicants, inventors, IPC classification numbers, invention names and keywords;
vectorizing and expressing the entities in the patent knowledge graph by using a TransE model to obtain vectorized expression of the invention name of each patent in the patent knowledge graph;
calculating the similarity between the query patent and each patent in a database by using vectorization expression of the name of the invention, and taking the M patents with the highest similarity with the query patent as recommended similar patents;
and counting the occurrence times of the IPC codes of all the recommended similar patents, and taking the N IPC codes with the highest occurrence times as the recommended IPC codes.
Optionally, the constructing the patent knowledge graph includes:
searching a plurality of patents in the same technical field as the query patent from a patent retrieval database, and merging the plurality of patents and the query patent into a patent field database;
extracting the applicant, inventor, IPC classification number, invention name and keyword of each patent in the patent domain database as an entity;
and storing the entities of each patent and the relationship among the entities into a Neo4j database to form a patent knowledge graph.
Optionally, the similarity is expressed as: and calculating the Euclidean distance between the inquired patent and each patent in the patent knowledge map by using vectorization expression of the invention name.
Alternatively, M ≧ 10.
Optionally, N has a value of 3.
The patent IPC classification number recommendation device based on the knowledge graph comprises:
a patent knowledge graph construction module configured to construct a patent knowledge graph containing entities of a query patent and several patents having the same technical field as the query patent and relationships between the entities, the entities including an applicant, an inventor, an IPC classification number, an invention name, and keywords;
the entity vectorization module is configured to perform vectorization representation on the entity in the patent knowledge graph by using a TransE model to obtain vectorization representation of the invention name of each patent in the patent knowledge graph;
a similarity calculation module configured to calculate similarities between the query patent and patents in a database using vectorized representation of the title, and take the M patents with the highest similarity to the query patent as recommended similar patents; and
and the IPC classification number recommending module is configured to count the occurrence times of the IPC classification numbers of all the recommended similar patents, and take the N IPC classification numbers with the highest occurrence times as the recommended IPC classification numbers.
Optionally, the patent knowledge graph building module includes:
a patent domain database construction sub-module configured to retrieve a number of patents having the same technical field as the query patent from a patent retrieval database, and merge the number of patents and the query patent into a patent domain database;
an entity extraction sub-module configured to extract an applicant, an inventor, an IPC classification number, an invention name, and a keyword of each patent in the patent domain database as an entity; and
a patent knowledge map construction sub-module configured to save the entities of each patent and the relationships between the entities into a Neo4j database to form a patent knowledge map.
Optionally, the similarity is expressed as: and calculating the Euclidean distance between the query patent and each patent in the patent knowledge graph by using vectorization expression of the invention name.
Alternatively, M ≧ 10.
Optionally, N has a value of 3.
The patent IPC classification number recommendation method and device based on the knowledge map relate the applicant, the inventor, the IPC classification number, the invention name and the key word of a patent by constructing the patent knowledge map, then, the entities are vectorized and expressed by using a TransE model to obtain the vectorized expression of the invention name, and the vectorized expression of the invention name comprises the relationship among the entities, so that the similarity of two patents can be more accurately reflected by calculating the Euclidean distance between the two patents as the similarity by using the vectorized expression of the invention name, and a plurality of patents with the highest similarity to the inquired patents are recommended, and the IPC classification number with the largest occurrence frequency is selected as the recommended IPC classification number, and the accuracy of the method and the device is far higher than that of the conventional content-based patent recommendation algorithm.
Drawings
FIG. 1 is a schematic flowchart of a patent IPC classification number recommendation method based on knowledge-graph according to a first embodiment of the present application;
FIG. 2 is a patent knowledge graph used in a method for recommending patent IPC classification numbers based on knowledge graphs according to a first embodiment of the present application;
FIG. 3 is a flow chart of negative sampling in the first embodiment of the present application;
FIG. 4 is a graph illustrating the comparison of the prediction accuracy of two methods according to one embodiment of the present application;
fig. 5 is a schematic structural diagram of a patent IPC classification number recommendation device based on a knowledge graph according to the second embodiment of the present application.
Detailed Description
The first embodiment is as follows: in this embodiment, the technical field to which the patent belongs is represented by an IPC classification number. As shown in fig. 1, the method for recommending patent IPC classification numbers based on knowledge-graphs according to this embodiment may generally include the following steps S1 to S4.
Step S1, constructing a patent knowledge map
For a query patent, the technical field to which the patent belongs needs to be determined, and the technical field refers to the technical field which can be determined directly, and generally covers a large range, such as physics, chemistry, biology and the like, and can also be the field after the technical field is subdivided, such as optics, mechanics, electromagnetism and the like in the physical technical field. After the technical field to which the patent belongs is determined, the patent belonging to the technical field is searched in the patent search database, and a plurality of patents are selected from the search result.
Combining the inquired patent and a plurality of selected patents into a patent domain database, and extracting the applicant, inventor, IPC classification number, invention name and keyword of each patent in the patent domain database as an entity.
There may be several IPC codes in a patent, the IPC code as an entity may be a main code of a patent or all the IPC codes of a patent, and when all the IPC codes of a patent are selected as entities, the accuracy of the recommendation of inquiring the IPC codes of the patent is higher.
A patent may have several inventor, and the inventor data needs to be simply processed into one-to-one data for use.
The main purpose of this embodiment is to implement recommendation in the patent technology field, and therefore, keywords extracted from the patent invention names and abstracts have an important role in recommendation in the patent technology field, so that keyword information in the patent invention names and abstracts is to be fully extracted. In the embodiment, the advantages of the IT-IDF algorithm and the Textrank algorithm are combined, the two algorithms are adopted to respectively extract the key words 10 th before each patent rank, then the weight of the extracted key words is weighted and averaged, and the word segmentation result corresponding to 5 th before the weight rank is used as the key words of the patent. For example, taking patents with publication numbers CN102058606B and CN102151264B as examples, the TF-IDF algorithm is used to extract the top 10 keywords, and the results are shown in table 1; the results of extracting keywords before ranking by using the TextRank algorithm are shown in table 2, and the keywords before ranking by weight 5 are obtained by performing weighted fusion on the weights of the keywords obtained by the TF-IDF algorithm and the TextRank algorithm, as shown in table 3.
TABLE 1 extraction of top 10 keywords Using TF-IDF Algorithm
Figure BDA0003238216680000051
Figure BDA0003238216680000061
TABLE 2 keyword top 10 extracted using Textrank Algorithm
Figure BDA0003238216680000062
TABLE 3 keywords weighted top 5
Figure BDA0003238216680000071
After the entity is extracted, the attribute of the entity needs to be defined, where the attribute includes an object attribute and a data attribute, and as shown in table 4, the object attribute describes the relationship between objects, and the data attribute describes the inherent attribute of the entity. Next, the relationship between the entities needs to be defined, the present embodiment defines four relationships of "application", "invention", "technical field", and "inclusion", the application relationship between the applicant and the patent, the invention relationship between the inventor and the patent, the inclusion relationship between the patent and the keyword, and the relationship between the patent and the IPC classification number belong to the technical field.
Table 4 entity attributes
Figure BDA0003238216680000072
Figure BDA0003238216680000081
And after the entities of each patent are extracted and the relationship among the entities is defined, the construction of the ontology base in the patent field is completed. Then, the entities of each patent in the patent domain ontology library and the relation data between the entities are stored in the Neo4j database, so that the construction of the patent knowledge graph is completed. Fig. 2 shows a part of a patent knowledge graph, wherein 14 patents are related in the graph, nodes of five colors respectively represent disclosure numbers, keywords, inventors, applicants and IPC classification numbers of the patents, the disclosure numbers in the patent knowledge graph represent one patent, the disclosure numbers can also be replaced by patent invention names or patent numbers, and the relationship among all entities can be visually displayed through the patent knowledge graph.
According to the construction idea of the patent field ontology library, semantic information contained in a patent text is extracted, so that entity and relationship information in a patent knowledge graph can be comprehensively and completely displayed, the obtained patent knowledge graph can quickly and comprehensively retrieve required patent information according to different requirements of users, and therefore the patent knowledge graph constructed based on the Neo4j graph database can visually contain and display entity, relationship and attribute information of patents.
Step S2, entity vectorization representation
Semantic information among patents can be correlated by constructing the patent knowledge graph, but the information in the patent knowledge graph cannot be directly used for recommendation, and in order to further realize patent IPC classification number recommendation, entities in the patent knowledge graph need to be vectorized. The vectorization of the patent knowledge graph is to convert a node (i.e., an entity) and an edge (i.e., a connecting line representing the relationship between two entities) into a vector, and simultaneously retain the original semantic information. In this embodiment, a TransE model is used to perform vectorization representation on the entities in the patent knowledge graph, and the invention name of the patent with the number I is mapped into a d-dimensional vector Ii=(E1i,E2i,...,Edi)T
In the training process of the TransE model, an objective function needs to be optimized, in order to train entity and relationship data, a correct triple is needed, and a negative triple is also needed to be introduced. When negative sampling is performed, the entity is usually replaced randomly, which often results in erroneous samples. Based on the problem, the negative sampling algorithm is optimized by the embodiment, so that the final patent IPC classification number recommendation result is more accurate.
All positive triples and original triples in the TransE model are already in the established patent knowledge mapWhen negative sampling faces complex relations of one-to-many and many-to-many, random replacement can cause a plurality of wrong negative samples to be generated, and the training effect of the model is influenced. For example, in the case of one-to-many data, there are triples (h, r, t) and triples (h, r, t '), and if t is replaced with t ', when a negative sample is generated for (h, r, t), erroneous data (h, r, t ') is generated, but if (h, r, t ') exists in the positive triplet set, it cannot be considered that (h, r, t ') is a negative sample. In order to make the sampling process more reasonable, the present embodiment introduces a bernoulli sampling algorithm, which replaces an entity with a certain probability for a triplet other than one-to-one. For each relation in the patent knowledge graph, respectively counting the average value N of the tail entity number corresponding to the head entity under the relation according to the existing triple data informationtpAverage value N of the number of head entities corresponding to tail entities under the relationshiphpThe formula for calculating the probability p of the alternative entity is:
Figure BDA0003238216680000091
the replacement entity can now be considered to obey a bernoulli distribution with a parameter p. And X is used for representing the replacement entity, the distribution law P of X is as follows:
P{X=x}=px(1-p)1-x
where x-0, 1, x-1 represents a replacement head entity and x-0 represents a replacement tail entity.
By improving the negative sampling algorithm, entity data are not replaced randomly any more, and excessive wrong negative samples generated in the negative sampling process can be avoided to a great extent, so that the relatively complex semantic correlation among the original correct triples can be kept, the TransE model is further more practical in the vectorization process, and the improved negative sampling process is shown in FIG. 3.
And vectorizing the patent knowledge graph by using a TransE model to obtain vectorized representation of the name of each patent invention.
Step S3, similarity calculation
In step S2, obtainAfter vectorization representation of each patent invention name is obtained, the Euclidean distance d (I) between the patent invention name entity and other patent invention name entities in the patent field database is calculated and inquired by using the vectorization representation of each patent invention namei,Ij):
Figure BDA0003238216680000101
The Euclidean distance obtained is a number greater than 0, and this data is normalized to (0, 1)]To obtain similarity sim (I)i,Ii)KGThe calculation formula is as follows:
Figure BDA0003238216680000102
according to the formula, the closer the calculated numerical value is to 1, the closer the semantics of the two patent entities are, the higher the similarity is.
After the similarity calculation is completed, all the similarities are arranged in a descending order, and the top M (for example, 10, 20 or 30) patents with the highest similarity to the name entity of the inquired patent are taken as recommended similar specialties.
Step S4, IPC classification number recommendation
And counting the recommended IPC classification numbers of the M similar patents and the occurrence frequency of each IPC classification number in the M similar patents, and taking the N IPC classification numbers with the highest occurrence frequency as the recommended IPC classification numbers.
Taking the invention Patent with the main classification number of A61K9/19, named as 'an esomeprazole sodium freeze-dried preparation for injection and a preparation method thereof', as an example, the Patent is taken as an inquiry Patent, and the existing Content-Based Patent Recommendation algorithm (CB-PR) and the Patent IPC classification number Recommendation method Based on the knowledge map of the embodiment are adopted for Recommendation respectively.
The top 10 ranked patent data similar to the query patent content, obtained using the content-based patent recommendation algorithm, is shown in table 5.
TABLE 5 content-based patent recommendation algorithm to get top-10 similar patents
Ranking Title: principal class number
1 Azacitidine freeze-dried preparation for injection and preparation method thereof A61K9/19
2 Novel application of TRPML1 specific small molecule inhibitor ML-SI3 A61K31/495
3 Active oxygen responsive gel storage and preparation method and application thereof A61K9/06
4 Fludarabine phosphate freeze-drying agent and preparation method thereof A61K9/19
5 Lansoprazole freeze-dried preparation for injection and preparation method thereof A61K9/19
6 Injection of forsythin and forsythiaside and derivatives thereof for children A61K9/08
7 Tadalafil enteric-coated tablet and preparation method thereof A61K9/36
8 Freeze-drying process of bortezomib freeze-dried powder injection for injection A61K9/19
9 Azacitidine freeze-dried powder injection and preparation method thereof A61K9/19
10 Somatostatin freeze-dried powder injection pharmaceutical composition and preparation method thereof A61K9/19
As can be seen from Table 5, in the top 10 patents similar to the query patent contents, the number of patents with main classification numbers A61K9/19 is only 6, and the number of patents with main classification numbers A61K31/495, A61K9/06, A61K9/08 and A61K9/36 is 1 each.
Similar patent data ranked 10 top (M ═ 10) with the query specificity similarity obtained by the knowledge-graph-based patent IPC classification number recommendation method of this example are shown in table 6.
TABLE 6 similar patents ranked top 10 were obtained by the method of this example
Figure BDA0003238216680000121
As can be seen from table 6, in the top 10 patents recommended by the knowledge-graph-based patent IPC classification recommendation method of this embodiment, IPC classification a61K9/19 appears 9 times, IPC classification a61K31/56 appears 1 time, and when N is 1, a61K9/19 is taken as the recommended main classification.
As can be seen from the recommended main classification numbers of the top-10 ranked patents, the accuracy of the knowledge-graph-based patent IPC classification number recommendation method of the embodiment is obviously higher than that of the content-based patent recommendation algorithm.
The patent IPC classification number recommendation method based on the knowledge map of this example was used to recommend the main classification number of the invention named "a lyophilized preparation of esomeprazole sodium for injection and its preparation method", where N is 1, and M is 10, 20, 30, 50, and 100, respectively, the recommendation results are shown in table 7.
TABLE 7 IPC classification number recommendation results when N is 1 and M is different
Value of M IPC classification number recommendation result
10 A61K9/19
20 A61K9/19
30 A61K9/19
50 A61K9/19
100 A61K9/19
100 published patents are selected as inquiry patents to verify the accuracy of the content-based patent recommendation algorithm and the knowledge-graph-based patent IPC classification number recommendation method of the embodiment. The patents with the similarity ranks of top 10, top 20, top 30, top 50 and top 100 are respectively selected as similar patent recommendation results, the IPC classification number is used as the technical field, the recommendation results given by the two methods are compared with the actual technical field, the proportion of correct results is calculated, and the prediction accuracy of the two recommendation methods for the technical field of 100 inquired patents is shown in figure 4. As can be seen from fig. 4, the prediction accuracy of the Patent IPC classification number Recommendation method (KG-PR) Based on Knowledge Graph is 20% higher than that of the content-Based Patent Recommendation algorithm (CB-PR), relatively speaking, the KG-PR algorithm is more practical when performing Recommendation in the Patent technical field, and it is further illustrated that the Patent Knowledge Graph constructed in the embodiment is very effective in implementing Recommendation in the Patent technical field.
The technical field related to one patent is usually more than one, and a plurality of IPC classification numbers are provided for the user to be referred as the prediction result, so that the time spent by the user in determining the technical field of the patent can be saved, the technical field to which each patent belongs can be more accurately determined, and the user can conveniently analyze the patent from multiple aspects. In this embodiment, in combination with actual needs of a user, for each query patent, three (N ═ 3) IPC classification numbers with the highest occurrence frequency in 30 recommended (M ═ 30) similar patents are recommended to the user as prediction results, and statistics is performed on the prediction results as follows: carrying out comparative analysis according to the IPC classification number group and the actual main IPC classification number to obtain the prediction accuracy of 78% in the technical field of 100 inquired patents; the technical field prediction accuracy of 100 inquired patents is 98% by comparing and analyzing the IPC classification large group and the actual main IPC classification.
The patent IPC classification number recommendation method based on the knowledge graph is adopted to predict the technical field of patent inquiry, 100 patents are selected as inquiry patents, descending order sorting is carried out according to the semantic similarity of patent texts, the patents with the top 30 of the similarity ranking are selected as recommended similar patents, the times of occurrence of IPC classification numbers in the 30 similar patents are counted, the technical field of 100 patents is predicted, then comparison is carried out with the actual technical field, the accuracy rate under two conditions of recommending one main IPC classification number and recommending a plurality of IPC classification numbers is obtained, and specific data are shown in a table 8.
Table 8100 recommendation accuracy in patent technical field
Figure BDA0003238216680000141
As can be seen from Table 8, three IPC classification numbers are recommended for each query patent, which can greatly improve the accuracy of technical field prediction.
The second embodiment is as follows: as shown in fig. 5, the present embodiment provides a knowledge-graph-based patent IPC classification number recommendation apparatus, including:
a patent knowledge graph construction module 1 configured to construct a patent knowledge graph including entities of a query patent and several patents having the same technical field as the query patent and relationships between the entities, the entities including an applicant, an inventor, an IPC classification number, an invention name, and keywords;
the entity vectorization module 2 is configured to perform vectorization representation on the entities in the patent knowledge graph by using a TransE model to obtain vectorization representation of the invention name of each patent in the patent knowledge graph;
a similarity calculation module 3 configured to calculate similarities between the query patent and patents in the database by using vectorization expression of the invention name, and take the M patents with the highest similarity to the query patent as recommended similar patents; and
and the IPC classification number recommending module 4 is configured to count the occurrence times of the IPC classification numbers of all the recommended similar patents, and take the N IPC classification numbers with the highest occurrence times as the recommended IPC classification numbers.
As a preferred embodiment of the present application, the patent knowledge graph building module 1 includes:
a patent domain database construction sub-module 11 configured to retrieve a plurality of patents having the same technical field as the query patent from a patent retrieval database, and merge the plurality of patents and the query patent into a patent domain database;
an entity extraction sub-module 12 configured to extract an applicant, an inventor, an IPC classification number, an invention name, and a keyword of each patent in the patent domain database as an entity; and
and a patent knowledge map construction submodule 13 configured to save the entities of each patent and the relationship between the entities into a Neo4j database to form a patent knowledge map.
As a preferred embodiment of the present application, the similarity is expressed as: and calculating the Euclidean distance between the inquired patent and each patent in the patent knowledge graph by using vectorization expression of the invention name.
As a preferred embodiment of the present application, M.gtoreq.10.
As a preferred embodiment of the present application, N has a value of 3.
The principle and effect of the apparatus for recommending patent IPC based on knowledge map in this embodiment are the same as those of the method for recommending patent IPC based on knowledge map in the first embodiment, and are not described herein again.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the processes or functions described in accordance with the embodiments of the application to be performed, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., DVDs), or semiconductor media (e.g., Solid State Disks (SSDs)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A patent IPC classification number recommendation method based on knowledge graph is characterized by comprising the following steps:
constructing a patent knowledge graph, wherein the patent knowledge graph comprises entities of a query patent and a plurality of patents having the same technical field as the query patent and the relationship among the entities, and the entities comprise applicants, inventors, IPC classification numbers, invention names and keywords;
vectorizing and expressing the entities in the patent knowledge graph by using a TransE model to obtain vectorized expression of the invention name of each patent in the patent knowledge graph;
calculating the similarity between the query patent and each patent in a database by using vectorization expression of the name of the invention, and taking the M patents with the highest similarity with the query patent as recommended similar patents;
and counting the occurrence times of the IPC codes of all the recommended similar patents, and taking the N IPC codes with the highest occurrence times as the recommended IPC codes.
2. The method of claim 1, wherein the constructing a patent knowledge graph comprises:
searching a plurality of patents in the same technical field as the query patent from a patent retrieval database, and merging the plurality of patents and the query patent into a patent field database;
extracting the applicant, inventor, IPC classification number, invention name and keyword of each patent in the patent domain database as an entity;
and storing the entities of each patent and the relationship among the entities into a Neo4j database to form a patent knowledge graph.
3. The method according to claim 1 or 2, wherein the similarity is expressed as: and calculating the Euclidean distance between the query patent and each patent in the patent knowledge graph by using vectorization expression of the invention name.
4. The method of claim 1, wherein M.gtoreq.10.
5. The method of claim 1, wherein N has a value of 3.
6. A patent IPC classification number recommendation device based on knowledge graph is characterized by comprising:
a patent knowledge graph construction module configured to construct a patent knowledge graph including entities of a query patent and several patents having the same technical field as the query patent and relationships between the entities, the entities including an applicant, an inventor, an IPC classification number, an invention name, and keywords;
the entity vectorization module is configured to utilize a TransE model to carry out vectorization representation on the entities in the patent knowledge graph to obtain vectorization representation of the invention name of each patent in the patent knowledge graph;
the similarity calculation module is configured to calculate the similarity between the query patent and each patent in a database by using vectorization expression of the invention name, and takes the M patents with the highest similarity with the query patent as recommended similar patents; and
and the IPC classification number recommending module is configured to count the occurrence times of the IPC classification numbers of all the recommended similar patents, and take the N IPC classification numbers with the highest occurrence times as the recommended IPC classification numbers.
7. The apparatus of claim 6, wherein the patent knowledge graph building module comprises:
a patent domain database construction sub-module configured to retrieve a number of patents having the same technical field as the query patent from a patent retrieval database, and merge the number of patents and the query patent into a patent domain database;
an entity extraction sub-module configured to extract an applicant, an inventor, an IPC classification number, an invention name, and a keyword of each patent in the patent domain database as an entity; and
and the patent knowledge map construction sub-module is configured to store the entities of each patent and the relationship among the entities into a Neo4j database to form a patent knowledge map.
8. The apparatus according to claim 6 or 7, wherein the similarity is expressed as: and calculating the Euclidean distance between the query patent and each patent in the patent knowledge graph by using vectorization expression of the invention name.
9. The device of claim 6, wherein M.gtoreq.10.
10. The apparatus of claim 6, wherein N has a value of 3.
CN202111009919.3A 2021-08-31 2021-08-31 Patent IPC classification number recommendation method and device based on knowledge graph Pending CN114357086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111009919.3A CN114357086A (en) 2021-08-31 2021-08-31 Patent IPC classification number recommendation method and device based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111009919.3A CN114357086A (en) 2021-08-31 2021-08-31 Patent IPC classification number recommendation method and device based on knowledge graph

Publications (1)

Publication Number Publication Date
CN114357086A true CN114357086A (en) 2022-04-15

Family

ID=81095604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111009919.3A Pending CN114357086A (en) 2021-08-31 2021-08-31 Patent IPC classification number recommendation method and device based on knowledge graph

Country Status (1)

Country Link
CN (1) CN114357086A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595197A (en) * 2023-07-10 2023-08-15 清华大学深圳国际研究生院 Link prediction method and system for patent classification number associated knowledge graph

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595197A (en) * 2023-07-10 2023-08-15 清华大学深圳国际研究生院 Link prediction method and system for patent classification number associated knowledge graph
CN116595197B (en) * 2023-07-10 2023-11-07 清华大学深圳国际研究生院 Link prediction method and system for patent classification number associated knowledge graph

Similar Documents

Publication Publication Date Title
US9418144B2 (en) Similar document detection and electronic discovery
US6665661B1 (en) System and method for use in text analysis of documents and records
Schwartz et al. A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
WO2022142027A1 (en) Knowledge graph-based fuzzy matching method and apparatus, computer device, and storage medium
US8533203B2 (en) Identifying synonyms of entities using a document collection
JP6056610B2 (en) Text information processing apparatus, text information processing method, and text information processing program
CN102789452A (en) Similar content extraction method
Adamu et al. A survey on big data indexing strategies
CN103761286B (en) A kind of Service Source search method based on user interest
KR20180129001A (en) Method and System for Entity summarization based on multilingual projected entity space
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN110795613A (en) Commodity searching method, device and system and electronic equipment
CN114357086A (en) Patent IPC classification number recommendation method and device based on knowledge graph
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN114491232B (en) Information query method and device, electronic equipment and storage medium
JP6260678B2 (en) Information processing apparatus, information processing method, and information processing program
US20150169583A1 (en) Trending analysis for streams of documents
CN114943285A (en) Intelligent auditing system for internet news content data
CN117056392A (en) Big data retrieval service system and method based on dynamic hypergraph technology
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
Bochkaryov et al. The use of clustering algorithms ensemble with variable distance metrics in solving problems of web mining
CN116414939B (en) Article generation method based on multidimensional data
CN115408491B (en) Text retrieval method and system for historical data
CN112966126B (en) High-reliability knowledge base construction method capable of inquiring and tracing mass unstructured data content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination