CN109903198B

CN109903198B - Patent comparative analysis method

Info

Publication number: CN109903198B
Application number: CN201910067706.2A
Authority: CN
Inventors: 汪云霄; 覃婷婷; 刘峥
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2022-08-30
Anticipated expiration: 2039-01-24
Also published as: CN109903198A

Abstract

The invention provides a patent comparative analysis method. The patent comparison analysis method utilizes a network technology to establish a patent database, establishes a candidate phrase set of a patent document set based on a word segmentation technology, extracts an important phrase set based on an optimization method, calculates similarity scores and difference scores of the important phrases and target patents and comparison patents, and extracts the similar phrase set and the difference phrase set of the target patents and the comparison patents based on the optimization method, thereby quickly and effectively realizing the patent comparison analysis.

Description

Patent comparative analysis method

Technical Field

The invention relates to a patent comparison and analysis method, and belongs to the field of natural language processing and patent analysis.

Background

Patent contrastive analysis belongs to a type of patent analysis, similarity and difference between patent documents can be rapidly identified through an effective patent document contrastive analysis method, and in a certain sense, the patent level of an enterprise represents the overall innovation level of the enterprise. The core personnel of the enterprise can identify the core technologies of other enterprises by a comparative analysis method, thereby making an effective technical strategy.

Nowadays, a plurality of patent retrieval and analysis systems, such as IncoPat, sopat, patsonap and the like, exist, but the patent retrieval and simple patent statistical analysis are mainly provided by the patent retrieval systems, and the basic analysis cannot meet the deep patent mining requirements; in addition, the annual patent application amount shows a rapid rising trend, and the workload of manually examining and verifying patents is continuously increased, so that the development of an automatic patent comparison and analysis system is of great significance.

In view of the above, it is necessary to provide a patent comparative analysis method to solve the above problems.

Disclosure of Invention

The invention aims to provide a patent comparison and analysis method, which is used for more deeply excavating the similarity and difference among patent documents so as to more accurately and quickly find the patent value of a target patent.

In order to achieve the above object, the present invention provides a patent comparative analysis method, which comprises the following steps:

s1, establishing a patent database based on a web crawler method;

s2, extracting a patent document set D of the target subject from the patent database, and establishing a candidate phrase set of the patent document set D, wherein the patent document set D comprises at least one discourse target patent and at least one comparison patent;

s3, extracting important phrase sets of the target patent and the comparative patent in the candidate phrase set based on an optimization selection model, wherein the important phrase sets comprise the important phrase sets of the target patent and the comparative patent;

s4, establishing a relevance measurement of an important phrase-patent document bipartite graph, and calculating a similarity score and a difference score of an important phrase in an important phrase set and a target patent and a similarity score and a difference score of an important phrase and a contrast patent;

and S5, respectively extracting similar phrase sets and difference phrase sets of the target patent and the comparison patent based on an optimization target method.

As a further improvement of the present invention, the step S1 specifically includes: selecting a plurality of target patent websites, constructing a plurality of crawler modules by using a distributed crawler framework, starting a plurality of crawler threads to crawl the target patent websites simultaneously, establishing a database table to store the crawled patent information according to the composition of the crawled patent information, and constructing a patent database.

As a further improvement of the present invention, the step S2 specifically includes:

s21, extracting a patent document set D of the target subject from the patent database;

s22, performing word segmentation processing on the patent documents in the patent document set D to obtain a word segmentation set of the patent document set D, wherein the word segmentation set comprises a plurality of words;

s23, establishing a stop word list, and screening and filtering the participles in the participle set according to the stop word list to obtain an effective participle set of the patent document set D;

and S24, calculating mutual information values MI of the participles in the effective participle set to extract a candidate phrase set of the patent document set D in the effective participle set.

As a further improvement of the present invention, the step S24 specifically includes: defining a word segmentation frequency threshold value as F and a mutual information threshold value of the segmented words as I, and calculating and acquiring a mutual information value MI of the candidate segmented words by calculating the joint distribution and marginal distribution of the candidate segmented words in the effective word segmentation set; if the frequency of the candidate participles is greater than a set participle frequency threshold value F, adding the candidate participles into a candidate phrase set; and if the frequency of the candidate word segmentation is less than a set word segmentation frequency threshold value F, considering the size of a mutual information value MI of the candidate word segmentation, if the mutual information value MI of the candidate word segmentation is greater than a set mutual information threshold value I, adding a candidate phrase set, otherwise, discarding the candidate word segmentation.

As a further improvement of the present invention, the step S3 specifically includes:

s31, calculating the significance score of each candidate phrase in the candidate phrase set in the patent document where the candidate phrase is located so as to represent the significance of the candidate phrase in the patent document where the candidate phrase is located;

s32, calculating the uniqueness score of each candidate phrase in the candidate phrase set in the patent document where the candidate phrase is located so as to represent the uniqueness of the candidate phrase in the patent document where the candidate phrase is located;

and S33, extracting an important phrase set S of the target patent and the comparative patent based on an optimization selection method and combining the significance score and the uniqueness score of each candidate phrase in the candidate phrase set, wherein the important phrase set S comprises a target patent important phrase set related to the target patent and a comparative patent important phrase set related to the comparative patent.

As a further improvement of the present invention, the step S33 specifically includes: defining a threshold value of the number of important phrases in an important phrase set as K, taking the significance score and the uniqueness score of the candidate phrases in the candidate phrase set as extraction criteria, establishing an optimization target, and obtaining an important phrase set of a target patent and a comparative patent through the optimization target, wherein the important phrase set comprises the important phrase set of the target patent and the important phrase set of the comparative patent, and the important phrase set of the target patent comprises K important phrases related to the target patent; the set of patent significant phrases comprises K significant phrases associated with the patent.

As a further improvement of the present invention, the step S4 specifically includes:

s41, constructing an important phrase-patent document bipartite graph;

s42, calculating the correlation between the important phrase and the target patent and the correlation between the important phrase and the comparison patent in the important phrase-patent document bipartite graph;

s43, calculating similarity scores between the important phrases and the target patent and the comparison patent in the important phrase-patent document bipartite graph;

and S44, calculating the difference scores between the important phrases and the target patents and the comparison patents in the important phrase-patent document bipartite graph.

As a further improvement of the present invention, the step S5 specifically includes:

s51, based on an optimization target method, and combining similarity scores between the important phrases in the important phrase set S and the target patent and the comparison patent to obtain a similar phrase set C between the target patent and the comparison patent;

and S52, acquiring a target patent difference phrase set and a comparison patent difference phrase set based on an optimization target method and by combining the difference scores between the important phrases in the important phrase set S and the target patent and the comparison patent.

As a further improvement of the present invention, the step S51 specifically includes: and defining an optimization target and at least two similarity constraint conditions, so that the sum of the similarity scores of the similar phrases in the similar phrase set C is maximized, and ensuring that the extracted similarity scores of the similar phrases are respectively greater than the average of the similarity scores of the target patent important phrase set and the average of the similarity scores of the comparison patent important phrase set through the similarity constraint conditions.

As a further improvement of the present invention, the step S52 specifically includes: defining an optimization target and at least three difference constraint conditions, maximizing the sum of difference scores of difference phrases in the target patent difference phrase set and the contrast patent difference phrase set, ensuring that the difference scores of the extracted difference phrases are respectively greater than the average value of the difference scores of the target patent important phrase set and the average value of the difference scores of the contrast patent important phrase set, and ensuring that no intersection exists among the similar phrase set C of the target patent and the contrast patent, the target patent difference phrase set and the contrast patent difference phrase set.

The invention has the beneficial effects that: the patent comparative analysis method of the invention realizes the patent comparative analysis quickly and effectively by utilizing the web crawler technology to establish a patent database, establishing a candidate phrase set of a patent document set D based on a word segmentation technology, extracting an important phrase set S based on an optimization method, calculating similarity scores and difference scores of the important phrases and target patents and comparative patents, and extracting the similar phrase sets and the difference phrase sets of the target patents and the comparative patents based on the optimization method.

Drawings

FIG. 1 is a structural function diagram of the comparative analysis method of the present invention.

FIG. 2 is a flow chart of the comparative analysis method of the present invention.

FIG. 3 is a diagram illustrating the structure of a bipartite graph of the patent document, an important phrase in FIG. 2.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1 in combination with fig. 2, the present invention discloses a patent comparison analysis method, which includes the following steps:

s1, establishing a patent database based on a web crawler method;

s2, extracting a patent document set D of the target subject from the patent database, and establishing a candidate phrase set P of the patent document set D, wherein the patent document set D comprises at least one discourse target patent D _c And at least one reference patent d' _c ；

S3, extracting the target patent d from the candidate phrase set P based on the optimization selection model _c And comparative patent d' _c And the important phrase set S comprises the important phrase set of the target patent

And comparing patent important phrase sets

Namely, it is

S4, establishing the relevance measurement of important phrase-patent document bipartite graph, calculating important phrase in important phrase set S and target patent d _c Similarity score and difference score and important phrase of (1) and comparison patent d' _c A similarity score and a difference score of;

s5, respectively extracting target patent d based on optimization target method _c And comparative patent d' _c A set of similar phrases and a set of differential phrases.

The following description will be made in detail only with respect to steps S1 to S5.

Step S1 specifically includes: and establishing a patent database by adopting a web crawler method. The web crawler is an efficient information acquisition sharer, various data resources can be acquired quickly and accurately, the web crawler method in the prior art is easy to seal when a website has a certain anti-crawling strategy, so that the crawling times of the same IP and the same account within a period of time are severely limited, and based on the method, the patent comparison analysis method builds a crawler camouflage module by maintaining an agent IP pool and a Cookies pool, builds a plurality of crawler modules by using a distributed crawler framework, starts a plurality of crawler threads to crawl a target patent website simultaneously, acquires patent information by using a request library and a bs4 webpage analysis package, and builds a reasonable database table according to the composition of the acquired patent information to store the crawled patent information.

Further, the patent information crawled by the web crawler method comprises the following steps: the patent information is stored in a patent database according to a table structure so as to ensure that the content of the patent database is comprehensive and the operation is stable.

Step S2 specifically includes:

s24, calculating mutual information values MI of the participles in the effective participle set to extract the candidate phrase set P of the patent document set D in the effective participle set.

In step S21, the patent document set D of the target subject is extracted from the patent database mainly by screening IPC classification numbers or setting keywords. In the present invention, the patent document set D ═ D ₁ ，d ₂ ，…，d _n N is the number of patent documents in the patent document set D, and for any patent document D, which mainly includes application number, application date, applicant, address, inventor, patent agency, IPC classification number, invention content and the like, the target patent D is defined _c And comparative patent d' _c In which d is _c ，d′ _c E is D, and D _c ≠d′ _c 。

Because the patent document D generally has the characteristics of long text, complex language and word disturbance due to the requirement of the writing format of the patent document D, if the patent document D is directly analyzed, a result of patent comparison analysis has a large error, so in steps S22 to S24 of the present invention, the patent document D in the patent document set D is processed based on natural language processing to establish a candidate phrase set P of the patent document set D of the target subject, and the following description section will exemplify the patent document D as a chinese text.

In step S22, when performing natural language processing, since the chinese text has rich sentence unstructured forms and the sentence word sequence has no obvious rules and boundaries, the chinese text of the patent document D needs to be participled, preferably, in this embodiment, a general chinese word segmentation system may be used to perform word segmentation on the patent document D to obtain a word segmentation set of the patent document set D, where the word segmentation set includes a plurality of words.

In step S23, stop words are defined, wherein the stop words refer to words without actual meaning, including null words, functional words, connective words, etc., such as "yes", "and", etc., and a stop word list is established, and meanwhile, the segmentation words in the segmentation word set are filtered according to the stop word list to obtain an effective segmentation word set of the patent document set D.

In the conventional phrase selection method, only frequency factors of the participles are considered, and then the participles with low occurrence frequency but rich semantic features are ignored, in order to prevent the above problem, in step S24, a candidate phrase set P of the patent document set D is extracted from the effective participle set by calculating mutual information values MI of the candidate participles in the effective participle set, wherein the candidate phrase set P is { P ═ { P } ₁ ，p ₂ ，…p _m P is a candidate phrase, and m is the number of candidate phrases P in the candidate phrase set P.

Specifically, in step S24, a participle frequency threshold is defined as F, a mutual information threshold of participles is defined as I, and a calculation formula of a mutual information value MI is as follows:

wherein, X and Y are two candidate participles in the effective participle set; p (X, Y) is the joint distribution of two candidate participles X, Y, and p (X) is the marginal distribution of the candidate participles X; and p (Y) is the marginal distribution of the candidate participle Y. If the frequency of the candidate participles is greater than a set participle frequency threshold value F, adding the candidate participles into a candidate phrase set P; and if the frequency of the candidate participle is less than a set participle frequency threshold value F, considering the size of a mutual information value MI of the candidate participle in the corresponding patent document d, if the mutual information value MI of the candidate participle is greater than a set mutual information threshold value I, adding a candidate phrase set P, and otherwise, discarding the candidate participle.

Step S3 specifically includes:

s31, calculating the significance score of each candidate phrase P in the candidate phrase set P in the patent document d where the candidate phrase P is located to represent the significance of the candidate phrase P in the patent document d where the candidate phrase P is located;

s32, calculating the uniqueness score of each candidate phrase P in the candidate phrase set P in the patent document d where the candidate phrase P is located to represent the uniqueness of the candidate phrase P in the patent document d where the candidate phrase P is located;

s33, extracting the target patent d based on the optimization selection method and combining the significance score and the uniqueness score of each candidate phrase P in the candidate phrase set P _c And comparative patent d' _c The important phrase set S comprises a target patent important phrase set S

And comparing the set of important phrases of the patent

After the candidate phrase set P is extracted from the entire patent document set D, each patent document D may be regarded as being composed of several candidate phrases P, and in fact, since most of the candidate phrases P cannot represent the patent documents D, the candidate phrase set P of the patent document set D needs to be further processed to better characterize each patent document D in the patent document set D.

Specifically, step S31 is mainly used to calculate the significance score r of each candidate phrase P in the candidate phrase set P in the patent document d where it is located _p，d To characterize the significance of the candidate phrase p in the patent document d in which it is located. A candidate phrase p appears frequently in the patent document d in which it is located, and isThe low frequency of occurrence in other patent documents D in the patent document set D indicates that the candidate phrase p has strong significance with respect to the patent document D in which the candidate phrase p is located, and therefore the significance of a single candidate phrase p with respect to the patent document D in which the candidate phrase p is located can be represented by a significance score r _p，d A significance score r of a single candidate phrase p in a patent document d _p，d Expressed as:

wherein, P _d Represents the set of all candidate phrases p of the patent document D, n (p, D) represents the frequency of occurrence of the candidate phrase p in the patent document D in which it is located, and n (p, D) represents the frequency of occurrence of the candidate phrase p in the patent document set D.

Step S32 is mainly used to calculate the uniqueness score of each candidate phrase P in the candidate phrase set P in the patent document d where it is located, so as to characterize the uniqueness of the candidate phrase P in the patent document d where it is located. Specifically, an important candidate phrase P needs to be different from other candidate phrases P in the candidate phrase set P, and needs to have strong uniqueness, so that the uniqueness of a single candidate phrase P can be calculated by combining semantic similarity between the candidate phrases P.

In step S32, the uniqueness of the individual candidate phrase p is obtained for semantic tree based semantic similarity calculation, i.e. a similarity measure of semantics is performed using the information content and a semantic dictionary is used to construct a semantic tree to calculate the ith candidate phrase p based on the path length between the candidate phrases p _i And the jth candidate phrase p _j Semantic similarity Sim (p) _i ，p _j ) To characterize the uniqueness of the candidate phrase p.

Further, in step S33, based on the optimization selection method, the significance score r of each candidate phrase P in the candidate phrase set P is combined _p，d And the uniqueness score, extracting the target patent d _c And comparative patent d' _c S of the important phrase set. In the present invention, a set of important phrases is definedAnd (3) the threshold value of the number of the important phrases P' in the S is K, and the significance and the uniqueness of the candidate phrases P in the candidate phrase set P are used as extraction standards to establish an optimization target:

wherein the content of the first and second substances,

is a set of important phrases

Significance scores of all important phrases p' in (1)

The sum of the total weight of the components,

significance score for important phrase p

The weight of (a) is determined,

the sum of the comprehensive similarity scores of all important phrases p' in the important phrase set is a penalty term in the optimization goal, because the higher the similarity of the candidate phrase p and other candidate phrases p is, the more the candidate phrase p has no uniqueness; mu is

The weight of the score of (a); λ is

The weight of the score of (a). So arranged that the target patent d can be extracted _c And comparative patent d' _c Significant phrase set

And

step S4 specifically includes:

s41, constructing an important phrase-patent document bipartite graph;

s42, calculating important phrase-important phrase p' and target patent d in patent document bipartite graph _c And comparative patent d' _c The degree of correlation between;

s43, calculating important phrase-important phrase p' and target patent d in patent document bipartite graph _c And comparative patent d' _c A similarity score between;

s44, calculating important phrase-important phrase p' and target patent d in patent document bipartite graph _c And comparative patent d' _c The fraction of variability between.

In step S41, the important phrase-patent document bipartite graph may be used to characterize the correlation between the important phrase set S and the patent document set D (as in fig. 3), where each important phrase p' node and the patent document D has a connecting edge, and the weight of the connecting edge may be obtained through the calculation of BM25 correlation.

Further, in step S42, the important phrase-the important phrase p' in the bipartite graph of the patent document and the target patent d can be calculated by using the random walk simrank algorithm _c Degree of correlation f (p', d) _c ) And the important phrase p 'with reference patent d' _c Correlation degree f (p ', d' _c )。

Step S43 is mainly used for calculating the important phrase p' and the target patent d _c And comparative patent d' _c Similarity fraction of phi (p', d) _c ，d′ _c )：

Φ(p′，d _c ，d′ _c )＝ln(1+f(p′，d _c )·f(p′，d′ _c ))

Wherein, f (p', d) _c ) Is the important phrase p' and the target patent d _c The degree of correlation between; f (p ', d' _c ) Is the important phrase p' and the comparison patentd′ _c The degree of correlation between them.

In fact, when an important phrase p' is simultaneously associated with the target patent d _c And comparative patent d' _c When there is a high correlation degree between the important phrases p ' and the target patent d, it indicates that the important phrase p ' has a strong importance in the important phrase set S, and thus for a certain important phrase p ', it is associated with the target patent d _c Correlation degree f (p ', d' _c ) And c 'of comparison patent' _c Correlation degree f (p ', d' _c ) The larger the important phrase p' and the target patent d are _c And comparative patent d' _c Fraction of similarity between phi (p', d) _c ，d′ _c ) The higher. While in the present invention the similarity score Φ (p', d) _c ，d′ _c ) In the calculation process, the important phrase p' and the target patent d are used _c And comparative patent d' _c Taking logarithm of the product of the correlation degrees, and comprehensively considering the important phrase p' and the target patent d _c Degree of correlation f (p', d) _c ) And the important phrase p 'with the comparison patent d' _c Correlation degree f (p ', d' _c ) Two terms, the important phrase p' and the target patent d are better characterized _c And comparative patent d' _c 。

Step S44 is mainly used for calculating the important phrase p' and the target patent d _c And comparative patent d' _c Differential fraction of [ omega ] (p', d) _c |d′ _c )：

Wherein γ is a smoothing parameter to prevent the important phrase p' and the target patent d _c Degree of correlation between f (p', d) _c ) And the important phrase p 'with reference patent d' _c Correlation degree of f (p ', d' _c ) Tending towards 0.

Specifically, in calculating the objective patent d _c And comparative patent d' _c Of [ d ] is [ omega ] (p', d) _c |d′ _c ) When the important phrase p' should be related to the target patent d _c And comparative patent d′ _c One of the important phrases p 'is very high and the other is very low, and the important phrase p' should have higher importance in the important phrase set S, so it is for the target patent d _c Of the important phrase p 'with a difference score omega (p', d) _c |d′ _c ) There are two cases: one, if the important phrase p' and the target patent d _c Correlation degree is very high and is compared with the patent d' _c Is relatively low, the difference score Ω (p ', d) of the important phrase p' is determined _c |d′ _c ) Higher; if the important phrase p' and the target patent d _c Correlation is relatively high, compared with comparative patent d' _c Is very low, the difference score Ω (p ', d) of the important phrase p' is determined _c |d′ _c ) And is also higher.

Second, the important phrase p' and the target patent d _c Not significantly similar, but significantly different from comparative patent d' _c Then the important phrase p 'can also be used as the difference phrase Ω (p', d) _c |d′ _c ) To embody the objective patent d _c And comparative patent d' _c The difference between them. And when the important phrase p 'is compared with the comparison patent d' _c Not significantly similar, but significantly different from the target patent d _c Then the important phrase p 'can also be used as the difference phrase Ω (p', d) _c |d′ _c ) To embody the objective patent d _c And comparative patent d' _c The difference between them.

Step S5 specifically includes:

s51, based on the optimization target method, combining the important phrase p' in the important phrase set S with the target patent d _c And comparative patent d' _c Similarity fraction phi (p ', d, d') between them, and obtains the target patent d _c And comparative patent d' _c A set of similar phrases C in between;

s52, based on the optimization target method, combining the important phrase p' in the important phrase set S with the target patent d _c And comparative patent d' _c Fraction of difference between Ω (p', d) _c |d′ _c ) Obtaining the difference phrase set Q and comparison of the target patentPatent difference phrase set Q'.

Step S51 specifically includes: an optimization objective and at least two similarity constraints are defined, and the optimization objective in step S51 is:

wherein p is _i Is the ith similar phrase in the similar phrase set C;

is a set of important phrases of the target patent, an

For comparison with the important phrase sets of patents, and

as a decision variable, x _i 0 or 1 indicates whether the ith candidate phrase is a similar phrase, x _i 1 is a similar phrase, x _i 0 means not a similar phrase.

Further, the optimization goal is defined so that similar phrases p in the set of similar phrases C are similar _s Similarity score of phi (p) _s ，d _c ，d′ _c ) The sum is maximized, and the extracted similar phrases p are ensured through similarity constraint conditions _s Similarity score of phi (p) _s ，d _c ，d′ _c ) Are respectively larger than the important phrase set of the target patent

Is similar toFraction of sexual activity Φ (p', d) _c ，d′ _c ) Average value of (1) and set of comparative patent significant phrases

Similarity fraction of phi (p', d) _c ，d′ _c ) To limit the size of the similar phrase set C.

It should be noted that, only two similarity constraints are set in the present invention as an example for illustration, and of course, in other embodiments of the present invention, the similarity constraints may also be set in other numbers.

Step S52 specifically includes: an optimization objective and at least three differential constraints are defined, and the optimization objective in S52 is:

C∩Q＝C∩Q′＝φ

wherein Q is a target patent difference phrase set; q' is a set of comparison patent difference phrases; y is _i ，y _i ' are decision variables and are 0-1 variables, y _i Express target patent d _c Whether the candidate phrase in (a) is a differential phrase, y _i 'denotes comparative patent d' _c Whether the candidate phrases in (a) are differential phrases.

Specifically, the significance of the optimization goal establishment in step S52 is: maximizing the sum of the difference scores in the target patent difference phrase set Q and the comparison patent difference phrase set Q'; the difference constraint condition is used for ensuring the extracted difference phrase p _i Differential fraction of (p) omega (p) _i ，d _c |d _c ') are respectively larger than the important phrase sets of the target patent

Differential fraction of [ omega ] (p', d) _c |d _c ) Average value of (1) and set of comparative patent significant phrases

Differential fraction of [ omega ] (p', d) _c |d _c ) Average value of (d); on the other hand, let the target patent d _c And the comparison patent

There is no intersection between the target patent difference phrase set Q and the comparison patent difference phrase set Q'.

In summary, the patent comparison analysis method of the invention establishes the patent database by using web crawler technology, establishes the candidate phrase set P of the patent document set D based on word segmentation technology, extracts the important phrase set S based on optimization method, calculates the important phrase P' and the target patent D _c And comparative patent d' _c Similarity fraction of phi (p', d) _c |d′ _c ) And a differential fraction Ω (p', d) _c |d _c ) And extracting the target patent d based on the optimization method _c And comparative patent d' _c The similar phrase set and the difference phrase set quickly and effectively realize the comparative analysis of the patent.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A patent comparative analysis method is characterized by comprising the following steps:

s1, establishing a patent database based on a web crawler method;

s4, establishing a relevance measurement of an important phrase-patent document bipartite graph, and calculating a similarity score and a difference score of an important phrase and a target patent in an important phrase set and a similarity score and a difference score of an important phrase and a contrast patent, wherein the important phrase-patent document bipartite graph can be used for representing the relevance of the important phrase set and a patent document set D, a connecting edge is arranged between each important phrase node and a patent document, and the weight of the connecting edge can be obtained through calculation of the BM25 relevance;

2. The patent comparative analysis method according to claim 1, wherein the step S1 specifically comprises: selecting a plurality of target patent websites, constructing a plurality of crawler modules by using a distributed crawler framework, starting a plurality of crawler threads to crawl the target patent websites simultaneously, establishing a database table to store the crawled patent information according to the composition of the crawled patent information, and constructing a patent database.

3. The patent comparative analysis method according to claim 1, wherein the step S2 specifically includes:

4. The patent comparative analysis method according to claim 3, wherein the step S24 is specifically: defining a word segmentation frequency threshold value as F and a mutual information threshold value of the segmented words as I, and calculating and acquiring a mutual information value MI of the candidate segmented words by calculating the joint distribution and marginal distribution of the candidate segmented words in the effective word segmentation set; if the frequency of the candidate participles is greater than a set participle frequency threshold value F, adding the candidate participles into a candidate phrase set; and if the frequency of the candidate word segmentation is less than a set word segmentation frequency threshold value F, considering the size of a mutual information value MI of the candidate word segmentation, if the mutual information value MI of the candidate word segmentation is greater than a set mutual information threshold value I, adding a candidate phrase set, otherwise, discarding the candidate word segmentation.

5. The patent comparative analysis method according to claim 1, wherein the step S3 specifically comprises:

6. The patent comparative analysis method according to claim 5, wherein the step S33 is specifically as follows: defining a threshold value of the number of important phrases in an important phrase set as K, taking the significance score and the uniqueness score of the candidate phrases in the candidate phrase set as extraction criteria, establishing an optimization target, and obtaining an important phrase set of a target patent and a comparative patent through the optimization target, wherein the important phrase set comprises the important phrase set of the target patent and the important phrase set of the comparative patent, and the important phrase set of the target patent comprises K important phrases related to the target patent; the set of patent significant phrases comprises K significant phrases associated with the patent.

7. The patent comparative analysis method according to claim 1, wherein the step S4 specifically includes:

s41, constructing an important phrase-patent document bipartite graph;

8. The patent comparative analysis method according to claim 7, wherein the step S5 specifically includes:

9. The patent comparative analysis method according to claim 8, wherein the step S51 is specifically: and defining an optimization target and at least two similarity constraint conditions, so that the sum of the similarity scores of the similar phrases in the similar phrase set C is maximized, and ensuring that the extracted similarity scores of the similar phrases are respectively greater than the average value of the similarity scores of the target patent important phrase set and the average value of the similarity scores of the comparison patent important phrase set through the similarity constraint conditions.

10. The patent comparative analysis method according to claim 8, wherein the step S52 is specifically: defining an optimization target and at least three difference constraint conditions, maximizing the sum of difference scores of difference phrases in a target patent difference phrase set and a contrast patent difference phrase set, and ensuring that the difference scores of extracted difference phrases are respectively greater than the average value of the difference scores of the target patent important phrase set and the average value of the difference scores of the contrast patent important phrase set through the difference constraint conditions, wherein no intersection exists among the similar phrase set C, the target patent difference phrase set and the contrast patent difference phrase set of the target patent and the contrast patent.