CN109903198B - Patent comparative analysis method - Google Patents

Patent comparative analysis method Download PDF

Info

Publication number
CN109903198B
CN109903198B CN201910067706.2A CN201910067706A CN109903198B CN 109903198 B CN109903198 B CN 109903198B CN 201910067706 A CN201910067706 A CN 201910067706A CN 109903198 B CN109903198 B CN 109903198B
Authority
CN
China
Prior art keywords
phrase
target
important
candidate
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910067706.2A
Other languages
Chinese (zh)
Other versions
CN109903198A (en
Inventor
汪云霄
覃婷婷
刘峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910067706.2A priority Critical patent/CN109903198B/en
Publication of CN109903198A publication Critical patent/CN109903198A/en
Application granted granted Critical
Publication of CN109903198B publication Critical patent/CN109903198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a patent comparative analysis method. The patent comparison analysis method utilizes a network technology to establish a patent database, establishes a candidate phrase set of a patent document set based on a word segmentation technology, extracts an important phrase set based on an optimization method, calculates similarity scores and difference scores of the important phrases and target patents and comparison patents, and extracts the similar phrase set and the difference phrase set of the target patents and the comparison patents based on the optimization method, thereby quickly and effectively realizing the patent comparison analysis.

Description

Patent comparative analysis method
Technical Field
The invention relates to a patent comparison and analysis method, and belongs to the field of natural language processing and patent analysis.
Background
Patent contrastive analysis belongs to a type of patent analysis, similarity and difference between patent documents can be rapidly identified through an effective patent document contrastive analysis method, and in a certain sense, the patent level of an enterprise represents the overall innovation level of the enterprise. The core personnel of the enterprise can identify the core technologies of other enterprises by a comparative analysis method, thereby making an effective technical strategy.
Nowadays, a plurality of patent retrieval and analysis systems, such as IncoPat, sopat, patsonap and the like, exist, but the patent retrieval and simple patent statistical analysis are mainly provided by the patent retrieval systems, and the basic analysis cannot meet the deep patent mining requirements; in addition, the annual patent application amount shows a rapid rising trend, and the workload of manually examining and verifying patents is continuously increased, so that the development of an automatic patent comparison and analysis system is of great significance.
In view of the above, it is necessary to provide a patent comparative analysis method to solve the above problems.
Disclosure of Invention
The invention aims to provide a patent comparison and analysis method, which is used for more deeply excavating the similarity and difference among patent documents so as to more accurately and quickly find the patent value of a target patent.
In order to achieve the above object, the present invention provides a patent comparative analysis method, which comprises the following steps:
s1, establishing a patent database based on a web crawler method;
s2, extracting a patent document set D of the target subject from the patent database, and establishing a candidate phrase set of the patent document set D, wherein the patent document set D comprises at least one discourse target patent and at least one comparison patent;
s3, extracting important phrase sets of the target patent and the comparative patent in the candidate phrase set based on an optimization selection model, wherein the important phrase sets comprise the important phrase sets of the target patent and the comparative patent;
s4, establishing a relevance measurement of an important phrase-patent document bipartite graph, and calculating a similarity score and a difference score of an important phrase in an important phrase set and a target patent and a similarity score and a difference score of an important phrase and a contrast patent;
and S5, respectively extracting similar phrase sets and difference phrase sets of the target patent and the comparison patent based on an optimization target method.
As a further improvement of the present invention, the step S1 specifically includes: selecting a plurality of target patent websites, constructing a plurality of crawler modules by using a distributed crawler framework, starting a plurality of crawler threads to crawl the target patent websites simultaneously, establishing a database table to store the crawled patent information according to the composition of the crawled patent information, and constructing a patent database.
As a further improvement of the present invention, the step S2 specifically includes:
s21, extracting a patent document set D of the target subject from the patent database;
s22, performing word segmentation processing on the patent documents in the patent document set D to obtain a word segmentation set of the patent document set D, wherein the word segmentation set comprises a plurality of words;
s23, establishing a stop word list, and screening and filtering the participles in the participle set according to the stop word list to obtain an effective participle set of the patent document set D;
and S24, calculating mutual information values MI of the participles in the effective participle set to extract a candidate phrase set of the patent document set D in the effective participle set.
As a further improvement of the present invention, the step S24 specifically includes: defining a word segmentation frequency threshold value as F and a mutual information threshold value of the segmented words as I, and calculating and acquiring a mutual information value MI of the candidate segmented words by calculating the joint distribution and marginal distribution of the candidate segmented words in the effective word segmentation set; if the frequency of the candidate participles is greater than a set participle frequency threshold value F, adding the candidate participles into a candidate phrase set; and if the frequency of the candidate word segmentation is less than a set word segmentation frequency threshold value F, considering the size of a mutual information value MI of the candidate word segmentation, if the mutual information value MI of the candidate word segmentation is greater than a set mutual information threshold value I, adding a candidate phrase set, otherwise, discarding the candidate word segmentation.
As a further improvement of the present invention, the step S3 specifically includes:
s31, calculating the significance score of each candidate phrase in the candidate phrase set in the patent document where the candidate phrase is located so as to represent the significance of the candidate phrase in the patent document where the candidate phrase is located;
s32, calculating the uniqueness score of each candidate phrase in the candidate phrase set in the patent document where the candidate phrase is located so as to represent the uniqueness of the candidate phrase in the patent document where the candidate phrase is located;
and S33, extracting an important phrase set S of the target patent and the comparative patent based on an optimization selection method and combining the significance score and the uniqueness score of each candidate phrase in the candidate phrase set, wherein the important phrase set S comprises a target patent important phrase set related to the target patent and a comparative patent important phrase set related to the comparative patent.
As a further improvement of the present invention, the step S33 specifically includes: defining a threshold value of the number of important phrases in an important phrase set as K, taking the significance score and the uniqueness score of the candidate phrases in the candidate phrase set as extraction criteria, establishing an optimization target, and obtaining an important phrase set of a target patent and a comparative patent through the optimization target, wherein the important phrase set comprises the important phrase set of the target patent and the important phrase set of the comparative patent, and the important phrase set of the target patent comprises K important phrases related to the target patent; the set of patent significant phrases comprises K significant phrases associated with the patent.
As a further improvement of the present invention, the step S4 specifically includes:
s41, constructing an important phrase-patent document bipartite graph;
s42, calculating the correlation between the important phrase and the target patent and the correlation between the important phrase and the comparison patent in the important phrase-patent document bipartite graph;
s43, calculating similarity scores between the important phrases and the target patent and the comparison patent in the important phrase-patent document bipartite graph;
and S44, calculating the difference scores between the important phrases and the target patents and the comparison patents in the important phrase-patent document bipartite graph.
As a further improvement of the present invention, the step S5 specifically includes:
s51, based on an optimization target method, and combining similarity scores between the important phrases in the important phrase set S and the target patent and the comparison patent to obtain a similar phrase set C between the target patent and the comparison patent;
and S52, acquiring a target patent difference phrase set and a comparison patent difference phrase set based on an optimization target method and by combining the difference scores between the important phrases in the important phrase set S and the target patent and the comparison patent.
As a further improvement of the present invention, the step S51 specifically includes: and defining an optimization target and at least two similarity constraint conditions, so that the sum of the similarity scores of the similar phrases in the similar phrase set C is maximized, and ensuring that the extracted similarity scores of the similar phrases are respectively greater than the average of the similarity scores of the target patent important phrase set and the average of the similarity scores of the comparison patent important phrase set through the similarity constraint conditions.
As a further improvement of the present invention, the step S52 specifically includes: defining an optimization target and at least three difference constraint conditions, maximizing the sum of difference scores of difference phrases in the target patent difference phrase set and the contrast patent difference phrase set, ensuring that the difference scores of the extracted difference phrases are respectively greater than the average value of the difference scores of the target patent important phrase set and the average value of the difference scores of the contrast patent important phrase set, and ensuring that no intersection exists among the similar phrase set C of the target patent and the contrast patent, the target patent difference phrase set and the contrast patent difference phrase set.
The invention has the beneficial effects that: the patent comparative analysis method of the invention realizes the patent comparative analysis quickly and effectively by utilizing the web crawler technology to establish a patent database, establishing a candidate phrase set of a patent document set D based on a word segmentation technology, extracting an important phrase set S based on an optimization method, calculating similarity scores and difference scores of the important phrases and target patents and comparative patents, and extracting the similar phrase sets and the difference phrase sets of the target patents and the comparative patents based on the optimization method.
Drawings
FIG. 1 is a structural function diagram of the comparative analysis method of the present invention.
FIG. 2 is a flow chart of the comparative analysis method of the present invention.
FIG. 3 is a diagram illustrating the structure of a bipartite graph of the patent document, an important phrase in FIG. 2.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Referring to fig. 1 in combination with fig. 2, the present invention discloses a patent comparison analysis method, which includes the following steps:
s1, establishing a patent database based on a web crawler method;
s2, extracting a patent document set D of the target subject from the patent database, and establishing a candidate phrase set P of the patent document set D, wherein the patent document set D comprises at least one discourse target patent D c And at least one reference patent d' c
S3, extracting the target patent d from the candidate phrase set P based on the optimization selection model c And comparative patent d' c And the important phrase set S comprises the important phrase set of the target patent
Figure BDA0001956266100000051
And comparing patent important phrase sets
Figure BDA0001956266100000052
Namely, it is
Figure BDA0001956266100000053
S4, establishing the relevance measurement of important phrase-patent document bipartite graph, calculating important phrase in important phrase set S and target patent d c Similarity score and difference score and important phrase of (1) and comparison patent d' c A similarity score and a difference score of;
s5, respectively extracting target patent d based on optimization target method c And comparative patent d' c A set of similar phrases and a set of differential phrases.
The following description will be made in detail only with respect to steps S1 to S5.
Step S1 specifically includes: and establishing a patent database by adopting a web crawler method. The web crawler is an efficient information acquisition sharer, various data resources can be acquired quickly and accurately, the web crawler method in the prior art is easy to seal when a website has a certain anti-crawling strategy, so that the crawling times of the same IP and the same account within a period of time are severely limited, and based on the method, the patent comparison analysis method builds a crawler camouflage module by maintaining an agent IP pool and a Cookies pool, builds a plurality of crawler modules by using a distributed crawler framework, starts a plurality of crawler threads to crawl a target patent website simultaneously, acquires patent information by using a request library and a bs4 webpage analysis package, and builds a reasonable database table according to the composition of the acquired patent information to store the crawled patent information.
Further, the patent information crawled by the web crawler method comprises the following steps: the patent information is stored in a patent database according to a table structure so as to ensure that the content of the patent database is comprehensive and the operation is stable.
Step S2 specifically includes:
s21, extracting a patent document set D of the target subject from the patent database;
s22, performing word segmentation processing on the patent documents in the patent document set D to obtain a word segmentation set of the patent document set D, wherein the word segmentation set comprises a plurality of words;
s23, establishing a stop word list, and screening and filtering the participles in the participle set according to the stop word list to obtain an effective participle set of the patent document set D;
s24, calculating mutual information values MI of the participles in the effective participle set to extract the candidate phrase set P of the patent document set D in the effective participle set.
In step S21, the patent document set D of the target subject is extracted from the patent database mainly by screening IPC classification numbers or setting keywords. In the present invention, the patent document set D ═ D 1 ,d 2 ,…,d n N is the number of patent documents in the patent document set D, and for any patent document D, which mainly includes application number, application date, applicant, address, inventor, patent agency, IPC classification number, invention content and the like, the target patent D is defined c And comparative patent d' c In which d is c ,d′ c E is D, and D c ≠d′ c
Because the patent document D generally has the characteristics of long text, complex language and word disturbance due to the requirement of the writing format of the patent document D, if the patent document D is directly analyzed, a result of patent comparison analysis has a large error, so in steps S22 to S24 of the present invention, the patent document D in the patent document set D is processed based on natural language processing to establish a candidate phrase set P of the patent document set D of the target subject, and the following description section will exemplify the patent document D as a chinese text.
In step S22, when performing natural language processing, since the chinese text has rich sentence unstructured forms and the sentence word sequence has no obvious rules and boundaries, the chinese text of the patent document D needs to be participled, preferably, in this embodiment, a general chinese word segmentation system may be used to perform word segmentation on the patent document D to obtain a word segmentation set of the patent document set D, where the word segmentation set includes a plurality of words.
In step S23, stop words are defined, wherein the stop words refer to words without actual meaning, including null words, functional words, connective words, etc., such as "yes", "and", etc., and a stop word list is established, and meanwhile, the segmentation words in the segmentation word set are filtered according to the stop word list to obtain an effective segmentation word set of the patent document set D.
In the conventional phrase selection method, only frequency factors of the participles are considered, and then the participles with low occurrence frequency but rich semantic features are ignored, in order to prevent the above problem, in step S24, a candidate phrase set P of the patent document set D is extracted from the effective participle set by calculating mutual information values MI of the candidate participles in the effective participle set, wherein the candidate phrase set P is { P ═ { P } 1 ,p 2 ,…p m P is a candidate phrase, and m is the number of candidate phrases P in the candidate phrase set P.
Specifically, in step S24, a participle frequency threshold is defined as F, a mutual information threshold of participles is defined as I, and a calculation formula of a mutual information value MI is as follows:
Figure BDA0001956266100000071
wherein, X and Y are two candidate participles in the effective participle set; p (X, Y) is the joint distribution of two candidate participles X, Y, and p (X) is the marginal distribution of the candidate participles X; and p (Y) is the marginal distribution of the candidate participle Y. If the frequency of the candidate participles is greater than a set participle frequency threshold value F, adding the candidate participles into a candidate phrase set P; and if the frequency of the candidate participle is less than a set participle frequency threshold value F, considering the size of a mutual information value MI of the candidate participle in the corresponding patent document d, if the mutual information value MI of the candidate participle is greater than a set mutual information threshold value I, adding a candidate phrase set P, and otherwise, discarding the candidate participle.
Step S3 specifically includes:
s31, calculating the significance score of each candidate phrase P in the candidate phrase set P in the patent document d where the candidate phrase P is located to represent the significance of the candidate phrase P in the patent document d where the candidate phrase P is located;
s32, calculating the uniqueness score of each candidate phrase P in the candidate phrase set P in the patent document d where the candidate phrase P is located to represent the uniqueness of the candidate phrase P in the patent document d where the candidate phrase P is located;
s33, extracting the target patent d based on the optimization selection method and combining the significance score and the uniqueness score of each candidate phrase P in the candidate phrase set P c And comparative patent d' c The important phrase set S comprises a target patent important phrase set S
Figure BDA0001956266100000081
And comparing the set of important phrases of the patent
Figure BDA0001956266100000082
After the candidate phrase set P is extracted from the entire patent document set D, each patent document D may be regarded as being composed of several candidate phrases P, and in fact, since most of the candidate phrases P cannot represent the patent documents D, the candidate phrase set P of the patent document set D needs to be further processed to better characterize each patent document D in the patent document set D.
Specifically, step S31 is mainly used to calculate the significance score r of each candidate phrase P in the candidate phrase set P in the patent document d where it is located p,d To characterize the significance of the candidate phrase p in the patent document d in which it is located. A candidate phrase p appears frequently in the patent document d in which it is located, and isThe low frequency of occurrence in other patent documents D in the patent document set D indicates that the candidate phrase p has strong significance with respect to the patent document D in which the candidate phrase p is located, and therefore the significance of a single candidate phrase p with respect to the patent document D in which the candidate phrase p is located can be represented by a significance score r p,d A significance score r of a single candidate phrase p in a patent document d p,d Expressed as:
Figure BDA0001956266100000083
wherein, P d Represents the set of all candidate phrases p of the patent document D, n (p, D) represents the frequency of occurrence of the candidate phrase p in the patent document D in which it is located, and n (p, D) represents the frequency of occurrence of the candidate phrase p in the patent document set D.
Step S32 is mainly used to calculate the uniqueness score of each candidate phrase P in the candidate phrase set P in the patent document d where it is located, so as to characterize the uniqueness of the candidate phrase P in the patent document d where it is located. Specifically, an important candidate phrase P needs to be different from other candidate phrases P in the candidate phrase set P, and needs to have strong uniqueness, so that the uniqueness of a single candidate phrase P can be calculated by combining semantic similarity between the candidate phrases P.
In step S32, the uniqueness of the individual candidate phrase p is obtained for semantic tree based semantic similarity calculation, i.e. a similarity measure of semantics is performed using the information content and a semantic dictionary is used to construct a semantic tree to calculate the ith candidate phrase p based on the path length between the candidate phrases p i And the jth candidate phrase p j Semantic similarity Sim (p) i ,p j ) To characterize the uniqueness of the candidate phrase p.
Further, in step S33, based on the optimization selection method, the significance score r of each candidate phrase P in the candidate phrase set P is combined p,d And the uniqueness score, extracting the target patent d c And comparative patent d' c S of the important phrase set. In the present invention, a set of important phrases is definedAnd (3) the threshold value of the number of the important phrases P' in the S is K, and the significance and the uniqueness of the candidate phrases P in the candidate phrase set P are used as extraction standards to establish an optimization target:
Figure BDA0001956266100000091
wherein the content of the first and second substances,
Figure BDA0001956266100000092
is a set of important phrases
Figure BDA0001956266100000093
Significance scores of all important phrases p' in (1)
Figure BDA0001956266100000094
The sum of the total weight of the components,
Figure BDA0001956266100000095
significance score for important phrase p
Figure BDA0001956266100000096
The weight of (a) is determined,
Figure BDA0001956266100000097
the sum of the comprehensive similarity scores of all important phrases p' in the important phrase set is a penalty term in the optimization goal, because the higher the similarity of the candidate phrase p and other candidate phrases p is, the more the candidate phrase p has no uniqueness; mu is
Figure BDA0001956266100000098
The weight of the score of (a); λ is
Figure BDA0001956266100000099
The weight of the score of (a). So arranged that the target patent d can be extracted c And comparative patent d' c Significant phrase set
Figure BDA00019562661000000910
And
Figure BDA00019562661000000911
step S4 specifically includes:
s41, constructing an important phrase-patent document bipartite graph;
s42, calculating important phrase-important phrase p' and target patent d in patent document bipartite graph c And comparative patent d' c The degree of correlation between;
s43, calculating important phrase-important phrase p' and target patent d in patent document bipartite graph c And comparative patent d' c A similarity score between;
s44, calculating important phrase-important phrase p' and target patent d in patent document bipartite graph c And comparative patent d' c The fraction of variability between.
In step S41, the important phrase-patent document bipartite graph may be used to characterize the correlation between the important phrase set S and the patent document set D (as in fig. 3), where each important phrase p' node and the patent document D has a connecting edge, and the weight of the connecting edge may be obtained through the calculation of BM25 correlation.
Further, in step S42, the important phrase-the important phrase p' in the bipartite graph of the patent document and the target patent d can be calculated by using the random walk simrank algorithm c Degree of correlation f (p', d) c ) And the important phrase p 'with reference patent d' c Correlation degree f (p ', d' c )。
Step S43 is mainly used for calculating the important phrase p' and the target patent d c And comparative patent d' c Similarity fraction of phi (p', d) c ,d′ c ):
Φ(p′,d c ,d′ c )=ln(1+f(p′,d c )·f(p′,d′ c ))
Wherein, f (p', d) c ) Is the important phrase p' and the target patent d c The degree of correlation between; f (p ', d' c ) Is the important phrase p' and the comparison patentd′ c The degree of correlation between them.
In fact, when an important phrase p' is simultaneously associated with the target patent d c And comparative patent d' c When there is a high correlation degree between the important phrases p ' and the target patent d, it indicates that the important phrase p ' has a strong importance in the important phrase set S, and thus for a certain important phrase p ', it is associated with the target patent d c Correlation degree f (p ', d' c ) And c 'of comparison patent' c Correlation degree f (p ', d' c ) The larger the important phrase p' and the target patent d are c And comparative patent d' c Fraction of similarity between phi (p', d) c ,d′ c ) The higher. While in the present invention the similarity score Φ (p', d) c ,d′ c ) In the calculation process, the important phrase p' and the target patent d are used c And comparative patent d' c Taking logarithm of the product of the correlation degrees, and comprehensively considering the important phrase p' and the target patent d c Degree of correlation f (p', d) c ) And the important phrase p 'with the comparison patent d' c Correlation degree f (p ', d' c ) Two terms, the important phrase p' and the target patent d are better characterized c And comparative patent d' c
Step S44 is mainly used for calculating the important phrase p' and the target patent d c And comparative patent d' c Differential fraction of [ omega ] (p', d) c |d′ c ):
Figure BDA0001956266100000101
Wherein γ is a smoothing parameter to prevent the important phrase p' and the target patent d c Degree of correlation between f (p', d) c ) And the important phrase p 'with reference patent d' c Correlation degree of f (p ', d' c ) Tending towards 0.
Specifically, in calculating the objective patent d c And comparative patent d' c Of [ d ] is [ omega ] (p', d) c |d′ c ) When the important phrase p' should be related to the target patent d c And comparative patent d′ c One of the important phrases p 'is very high and the other is very low, and the important phrase p' should have higher importance in the important phrase set S, so it is for the target patent d c Of the important phrase p 'with a difference score omega (p', d) c |d′ c ) There are two cases: one, if the important phrase p' and the target patent d c Correlation degree is very high and is compared with the patent d' c Is relatively low, the difference score Ω (p ', d) of the important phrase p' is determined c |d′ c ) Higher; if the important phrase p' and the target patent d c Correlation is relatively high, compared with comparative patent d' c Is very low, the difference score Ω (p ', d) of the important phrase p' is determined c |d′ c ) And is also higher.
Second, the important phrase p' and the target patent d c Not significantly similar, but significantly different from comparative patent d' c Then the important phrase p 'can also be used as the difference phrase Ω (p', d) c |d′ c ) To embody the objective patent d c And comparative patent d' c The difference between them. And when the important phrase p 'is compared with the comparison patent d' c Not significantly similar, but significantly different from the target patent d c Then the important phrase p 'can also be used as the difference phrase Ω (p', d) c |d′ c ) To embody the objective patent d c And comparative patent d' c The difference between them.
Step S5 specifically includes:
s51, based on the optimization target method, combining the important phrase p' in the important phrase set S with the target patent d c And comparative patent d' c Similarity fraction phi (p ', d, d') between them, and obtains the target patent d c And comparative patent d' c A set of similar phrases C in between;
s52, based on the optimization target method, combining the important phrase p' in the important phrase set S with the target patent d c And comparative patent d' c Fraction of difference between Ω (p', d) c |d′ c ) Obtaining the difference phrase set Q and comparison of the target patentPatent difference phrase set Q'.
Step S51 specifically includes: an optimization objective and at least two similarity constraints are defined, and the optimization objective in step S51 is:
Figure BDA0001956266100000111
Figure BDA0001956266100000112
Figure BDA0001956266100000113
wherein p is i Is the ith similar phrase in the similar phrase set C;
Figure BDA0001956266100000121
is a set of important phrases of the target patent, an
Figure BDA0001956266100000122
For comparison with the important phrase sets of patents, and
Figure BDA0001956266100000123
as a decision variable, x i 0 or 1 indicates whether the ith candidate phrase is a similar phrase, x i 1 is a similar phrase, x i 0 means not a similar phrase.
Further, the optimization goal is defined so that similar phrases p in the set of similar phrases C are similar s Similarity score of phi (p) s ,d c ,d′ c ) The sum is maximized, and the extracted similar phrases p are ensured through similarity constraint conditions s Similarity score of phi (p) s ,d c ,d′ c ) Are respectively larger than the important phrase set of the target patent
Figure BDA0001956266100000124
Is similar toFraction of sexual activity Φ (p', d) c ,d′ c ) Average value of (1) and set of comparative patent significant phrases
Figure BDA0001956266100000125
Similarity fraction of phi (p', d) c ,d′ c ) To limit the size of the similar phrase set C.
It should be noted that, only two similarity constraints are set in the present invention as an example for illustration, and of course, in other embodiments of the present invention, the similarity constraints may also be set in other numbers.
Step S52 specifically includes: an optimization objective and at least three differential constraints are defined, and the optimization objective in S52 is:
Figure BDA0001956266100000126
Figure BDA0001956266100000127
Figure BDA0001956266100000128
C∩Q=C∩Q′=φ
wherein Q is a target patent difference phrase set; q' is a set of comparison patent difference phrases; y is i ,y i ' are decision variables and are 0-1 variables, y i Express target patent d c Whether the candidate phrase in (a) is a differential phrase, y i 'denotes comparative patent d' c Whether the candidate phrases in (a) are differential phrases.
Specifically, the significance of the optimization goal establishment in step S52 is: maximizing the sum of the difference scores in the target patent difference phrase set Q and the comparison patent difference phrase set Q'; the difference constraint condition is used for ensuring the extracted difference phrase p i Differential fraction of (p) omega (p) i ,d c |d c ') are respectively larger than the important phrase sets of the target patent
Figure BDA0001956266100000131
Differential fraction of [ omega ] (p', d) c |d c ) Average value of (1) and set of comparative patent significant phrases
Figure BDA0001956266100000132
Differential fraction of [ omega ] (p', d) c |d c ) Average value of (d); on the other hand, let the target patent d c And the comparison patent
Figure BDA0001956266100000133
There is no intersection between the target patent difference phrase set Q and the comparison patent difference phrase set Q'.
In summary, the patent comparison analysis method of the invention establishes the patent database by using web crawler technology, establishes the candidate phrase set P of the patent document set D based on word segmentation technology, extracts the important phrase set S based on optimization method, calculates the important phrase P' and the target patent D c And comparative patent d' c Similarity fraction of phi (p', d) c |d′ c ) And a differential fraction Ω (p', d) c |d c ) And extracting the target patent d based on the optimization method c And comparative patent d' c The similar phrase set and the difference phrase set quickly and effectively realize the comparative analysis of the patent.
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims (10)

1. A patent comparative analysis method is characterized by comprising the following steps:
s1, establishing a patent database based on a web crawler method;
s2, extracting a patent document set D of the target subject from the patent database, and establishing a candidate phrase set of the patent document set D, wherein the patent document set D comprises at least one discourse target patent and at least one comparison patent;
s3, extracting important phrase sets of the target patent and the comparative patent in the candidate phrase set based on an optimization selection model, wherein the important phrase sets comprise the important phrase sets of the target patent and the comparative patent;
s4, establishing a relevance measurement of an important phrase-patent document bipartite graph, and calculating a similarity score and a difference score of an important phrase and a target patent in an important phrase set and a similarity score and a difference score of an important phrase and a contrast patent, wherein the important phrase-patent document bipartite graph can be used for representing the relevance of the important phrase set and a patent document set D, a connecting edge is arranged between each important phrase node and a patent document, and the weight of the connecting edge can be obtained through calculation of the BM25 relevance;
and S5, respectively extracting similar phrase sets and difference phrase sets of the target patent and the comparison patent based on an optimization target method.
2. The patent comparative analysis method according to claim 1, wherein the step S1 specifically comprises: selecting a plurality of target patent websites, constructing a plurality of crawler modules by using a distributed crawler framework, starting a plurality of crawler threads to crawl the target patent websites simultaneously, establishing a database table to store the crawled patent information according to the composition of the crawled patent information, and constructing a patent database.
3. The patent comparative analysis method according to claim 1, wherein the step S2 specifically includes:
s21, extracting a patent document set D of the target subject from the patent database;
s22, performing word segmentation processing on the patent documents in the patent document set D to obtain a word segmentation set of the patent document set D, wherein the word segmentation set comprises a plurality of words;
s23, establishing a stop word list, and screening and filtering the participles in the participle set according to the stop word list to obtain an effective participle set of the patent document set D;
and S24, calculating mutual information values MI of the participles in the effective participle set to extract a candidate phrase set of the patent document set D in the effective participle set.
4. The patent comparative analysis method according to claim 3, wherein the step S24 is specifically: defining a word segmentation frequency threshold value as F and a mutual information threshold value of the segmented words as I, and calculating and acquiring a mutual information value MI of the candidate segmented words by calculating the joint distribution and marginal distribution of the candidate segmented words in the effective word segmentation set; if the frequency of the candidate participles is greater than a set participle frequency threshold value F, adding the candidate participles into a candidate phrase set; and if the frequency of the candidate word segmentation is less than a set word segmentation frequency threshold value F, considering the size of a mutual information value MI of the candidate word segmentation, if the mutual information value MI of the candidate word segmentation is greater than a set mutual information threshold value I, adding a candidate phrase set, otherwise, discarding the candidate word segmentation.
5. The patent comparative analysis method according to claim 1, wherein the step S3 specifically comprises:
s31, calculating the significance score of each candidate phrase in the candidate phrase set in the patent document where the candidate phrase is located so as to represent the significance of the candidate phrase in the patent document where the candidate phrase is located;
s32, calculating the uniqueness score of each candidate phrase in the candidate phrase set in the patent document where the candidate phrase is located so as to represent the uniqueness of the candidate phrase in the patent document where the candidate phrase is located;
and S33, extracting an important phrase set S of the target patent and the comparative patent based on an optimization selection method and combining the significance score and the uniqueness score of each candidate phrase in the candidate phrase set, wherein the important phrase set S comprises a target patent important phrase set related to the target patent and a comparative patent important phrase set related to the comparative patent.
6. The patent comparative analysis method according to claim 5, wherein the step S33 is specifically as follows: defining a threshold value of the number of important phrases in an important phrase set as K, taking the significance score and the uniqueness score of the candidate phrases in the candidate phrase set as extraction criteria, establishing an optimization target, and obtaining an important phrase set of a target patent and a comparative patent through the optimization target, wherein the important phrase set comprises the important phrase set of the target patent and the important phrase set of the comparative patent, and the important phrase set of the target patent comprises K important phrases related to the target patent; the set of patent significant phrases comprises K significant phrases associated with the patent.
7. The patent comparative analysis method according to claim 1, wherein the step S4 specifically includes:
s41, constructing an important phrase-patent document bipartite graph;
s42, calculating the correlation between the important phrase and the target patent and the correlation between the important phrase and the comparison patent in the important phrase-patent document bipartite graph;
s43, calculating similarity scores between the important phrases and the target patent and the comparison patent in the important phrase-patent document bipartite graph;
and S44, calculating the difference scores between the important phrases and the target patents and the comparison patents in the important phrase-patent document bipartite graph.
8. The patent comparative analysis method according to claim 7, wherein the step S5 specifically includes:
s51, based on an optimization target method, and combining similarity scores between the important phrases in the important phrase set S and the target patent and the comparison patent to obtain a similar phrase set C between the target patent and the comparison patent;
and S52, acquiring a target patent difference phrase set and a comparison patent difference phrase set based on an optimization target method and by combining the difference scores between the important phrases in the important phrase set S and the target patent and the comparison patent.
9. The patent comparative analysis method according to claim 8, wherein the step S51 is specifically: and defining an optimization target and at least two similarity constraint conditions, so that the sum of the similarity scores of the similar phrases in the similar phrase set C is maximized, and ensuring that the extracted similarity scores of the similar phrases are respectively greater than the average value of the similarity scores of the target patent important phrase set and the average value of the similarity scores of the comparison patent important phrase set through the similarity constraint conditions.
10. The patent comparative analysis method according to claim 8, wherein the step S52 is specifically: defining an optimization target and at least three difference constraint conditions, maximizing the sum of difference scores of difference phrases in a target patent difference phrase set and a contrast patent difference phrase set, and ensuring that the difference scores of extracted difference phrases are respectively greater than the average value of the difference scores of the target patent important phrase set and the average value of the difference scores of the contrast patent important phrase set through the difference constraint conditions, wherein no intersection exists among the similar phrase set C, the target patent difference phrase set and the contrast patent difference phrase set of the target patent and the contrast patent.
CN201910067706.2A 2019-01-24 2019-01-24 Patent comparative analysis method Active CN109903198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910067706.2A CN109903198B (en) 2019-01-24 2019-01-24 Patent comparative analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910067706.2A CN109903198B (en) 2019-01-24 2019-01-24 Patent comparative analysis method

Publications (2)

Publication Number Publication Date
CN109903198A CN109903198A (en) 2019-06-18
CN109903198B true CN109903198B (en) 2022-08-30

Family

ID=66944149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910067706.2A Active CN109903198B (en) 2019-01-24 2019-01-24 Patent comparative analysis method

Country Status (1)

Country Link
CN (1) CN109903198B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291816B (en) * 2020-02-17 2021-08-06 支付宝(杭州)信息技术有限公司 Method and device for carrying out feature processing aiming at user classification model
CN111552783A (en) * 2020-04-30 2020-08-18 深圳前海微众银行股份有限公司 Content analysis query method, device, equipment and computer storage medium
CN117112735B (en) * 2023-10-19 2024-02-13 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547739B (en) * 2016-11-03 2019-04-02 同济大学 A kind of text semantic similarity analysis method
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body

Also Published As

Publication number Publication date
CN109903198A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
US10437867B2 (en) Scenario generating apparatus and computer program therefor
US7346487B2 (en) Method and apparatus for identifying translations
El-Beltagy et al. KP-Miner: A keyphrase extraction system for English and Arabic documents
US10095685B2 (en) Phrase pair collecting apparatus and computer program therefor
EP2821923B1 (en) Predicate template gathering device, specified phrase pair gathering device and computer program for said devices
US10248715B2 (en) Media content recommendation method and apparatus
US20150356091A1 (en) Method and system for identifying microblog user identity
CN109903198B (en) Patent comparative analysis method
US10430717B2 (en) Complex predicate template collecting apparatus and computer program therefor
US8229960B2 (en) Web-scale entity summarization
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN111651559B (en) Social network user relation extraction method based on event extraction
CN114706972A (en) Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression
CN112100470A (en) Expert recommendation method, device, equipment and storage medium based on thesis data analysis
CN117271736A (en) Question-answer pair generation method and system, electronic equipment and storage medium
CN111444713B (en) Method and device for extracting entity relationship in news event
JP4873739B2 (en) Text multiple topic extraction apparatus, text multiple topic extraction method, program, and recording medium
DE102018007024A1 (en) DOCUMENT BROKEN BY GRAMMATIC UNITS
CN112948527B (en) Improved TextRank keyword extraction method and device
CN115640439A (en) Method, system and storage medium for network public opinion monitoring
KR102454261B1 (en) Collaborative partner recommendation system and method based on user information
Angdresey et al. Classification and Sentiment Analysis on Tweets of the Ministry of Health Republic of Indonesia
CN108733824B (en) Interactive theme modeling method and device considering expert knowledge
Lai et al. An unsupervised approach to discover media frames
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant