CN109903198A

CN109903198A - Patent Reference's analysis method

Info

Publication number: CN109903198A
Application number: CN201910067706.2A
Authority: CN
Inventors: 汪云霄; 覃婷婷; 刘峥
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2019-06-18
Anticipated expiration: 2039-01-24
Also published as: CN109903198B

Abstract

The present invention provides a kind of Patent Reference's analysis methods.Patent Reference's analysis method is established patent database using network technology, the candidate phrase collection of patent file collection is established based on participle technique, important phrases collection is extracted based on optimal method, calculating similar phrase book and difference phrase book that important phrases extract target patent and comparison patent to the similarity scores and otherness score of target patent and comparison patent and based on optimal method, and Patent Reference's analysis is quickly and efficiently realized.

Description

Patent Reference's analysis method

Technical field

The present invention relates to a kind of Patent Reference's analysis methods, belong to natural language processing and patent analysis field.

Background technique

Patent Reference's analysis belongs to a seed type of patent analysis, can be with by effective patent file comparative analysis method The quickly similitude and otherness between identification patent file, in a sense, the patent level representation of enterprise enterprise Whole innovation level.Enterprise key personnel can identify the core technology of other enterprises by the method for comparative analysis, to make Fixed effective technology strategy.

Have many patent retrievals and analysis system, such as IncoPat, SooPat, Patsnap now, but these are specially Sharp system is mainly to provide patent retrieval and simple statistical analysis on patents, these fundamental analysis are unable to satisfy profound patent Excavation demand；In addition, quick ascendant trend is presented in annual amount of the application for patent, the workload of manual examination and verification patent constantly adds Greatly, thus develop it is a kind of automation Patent Reference analysis system have great importance.

In view of this, it is necessory to provide a kind of Patent Reference's analysis method, to solve the above problems.

Summary of the invention

The purpose of the present invention is to provide a kind of Patent Reference's analysis method, with it is deeper excavate patent file it Between similitude and otherness, thus it is more accurate, quickly discovery target patent patent value where.

For achieving the above object, the present invention provides a kind of Patent Reference's analysis method, Patent Reference's analyses Method the following steps are included:

S1, patent database is established based on web crawlers method；

S2, the patent file collection D that target topic is extracted from the patent database, and establish the time of patent file collection D Phrase book is selected, wherein the patent file collection D includes an at least table of contents mark patent and at least one comparison patent；

S3, it is based on optimization selection model, concentrates extraction target patent in the candidate phrase and compares the important of patent Phrase book, and the important phrases collection includes target patent important phrases collection and comparison patent important phrases collection；

S4, establish important phrases-patent file bigraph (bipartite graph) relativity measurement, calculate important phrases concentrate important phrases with The similarity scores and otherness score and important phrases of target patent and the similarity scores and otherness score of comparison patent；

S5, it extracts target patent respectively based on optimum target method and compares the similar phrase book and difference phrase of patent Collection.

As a further improvement of the present invention, the step S1 specifically: select multiple target patent websites, use distribution Formula crawler framework constructs multiple crawler modules, opens multiple crawler threads while crawling target patent website, and according to crawling Patent information composition, establish the patent information that database table storage crawls, construct patent database.

As a further improvement of the present invention, the step S2 is specifically included:

S21, the patent file collection D that target topic is extracted from the patent database；

S22, word segmentation processing is carried out to the patent file in patent file collection D, to obtain the participle collection of patent file collection D, The participle collection includes several participles；

S23, deactivated vocabulary is established, is screened and filtered according to the participle that deactivated vocabulary concentrates the participle, obtains Effective participle of the patent file collection D is taken to collect；

S24, the association relationship MI that participle is concentrated in effective participle is calculated, extracts patent text to concentrate in effective participle The candidate phrase collection of shelves collection D.

As a further improvement of the present invention, the step S24 specifically: definition participle frequency threshold be F, participle it is mutual Information threshold is I, concentrates the Joint Distribution of candidate participle and limit to be distributed by calculating effectively participle, obtains candidate point to calculate The association relationship MI of word；If the frequency of candidate's participle is greater than the participle frequency threshold F of setting, candidate participle is added candidate In phrase book；If the frequency of candidate's participle is less than the participle frequency threshold F of setting, the association relationship MI of candidate participle is investigated Size, if the candidate participle association relationship MI be greater than setting mutual information threshold value I, candidate phrase collection is added, otherwise should Candidate's participle is dropped.

As a further improvement of the present invention, the step S3 specifically:

S31, each of candidate phrase collection conspicuousness score of the candidate phrase in the patent file where it is calculated, To characterize conspicuousness of the candidate phrase in the patent file where it；

S32, each of candidate phrase collection uniqueness score of the candidate phrase in the patent file where it is calculated, To characterize uniqueness of the candidate phrase in the patent file where it；

S33, be based on optimized selection method, and in conjunction with candidate phrase concentrate each candidate phrase conspicuousness score and Uniqueness score, important phrases the collection S, the important phrases collection S for extracting target patent and comparison patent include and target patent Relevant target patent important phrases collection and comparison patent important phrases collection relevant to comparison patent.

As a further improvement of the present invention, the step S33 specifically: define the number that important phrases concentrate important phrases Amount threshold value is K, and the conspicuousness score and uniqueness score for concentrating candidate phrase using the candidate phrase are established as extraction standard Optimum target, and target patent is obtained by the optimum target and compares the important phrases collection of patent, the important phrases Collection includes target patent important phrases collection and comparison patent important phrases collection, the target patent important phrases collection include K with The relevant important phrases of the target patent；The comparison patent important phrases collection includes K relevant to the comparison patent Important phrases.

As a further improvement of the present invention, the step S4 is specifically included:

S41, building important phrases-patent file bigraph (bipartite graph)；

S42, it calculates in important phrases-patent file bigraph (bipartite graph), the degree of correlation between important phrases and target patent and again It wants phrase and compares the degree of correlation between patent；

S43, it calculates in important phrases-patent file bigraph (bipartite graph), between important phrases and target patent and comparison patent Similarity scores；

S44, it calculates in important phrases-patent file bigraph (bipartite graph), between important phrases and target patent and comparison patent Otherness score.

As a further improvement of the present invention, the step S5 is specifically included:

S51, it is based on optimum target method, and combines important phrases and target patent and comparison in important phrases collection S special Similarity scores between benefit obtain the similar phrase book C between target patent and comparison patent；

S52, it is based on optimum target method, and combines important phrases and target patent and comparison in important phrases collection S special Otherness score between benefit obtains target patent difference phrase book and comparison patent difference phrase book.

As a further improvement of the present invention, the step S51 specifically: define optimum target and at least two similar Property constraint condition so that the sum of similarity scores of similar phrase maximize in similar phrase book C, and about by the similitude Beam condition guarantees that the similarity scores of the similar phrase extracted are respectively greater than the similarity scores of target patent important phrases collection Average value and comparison patent important phrases collection similarity scores average value.

As a further improvement of the present invention, the step S52 specifically: define optimum target and at least three differences Property constraint condition so that target patent difference phrase book and comparison patent difference phrase book in difference phrase otherness score it And maximization, and guarantee that the otherness score of the difference phrase extracted is respectively greater than target by the otherness constraint condition The average value of the otherness score of the average value and comparison patent important phrases collection of the otherness score of patent important phrases collection, and Between target patent and the similar phrase book C, target patent difference phrase book and comparison patent difference phrase book that compare patent Without intersection.

The beneficial effects of the present invention are: the invention patent comparative analysis method, special by being established using web crawlers technology Sharp database, the candidate phrase collection that patent file collection D is established based on participle technique extract important phrases collection based on optimal method S, the similarity scores and otherness score of important phrases and target patent and comparison patent are calculated and are based on optimal method It extracts target patent and compares the similar phrase book and difference phrase book of patent, quickly and efficiently realize Patent Reference's analysis.

Detailed description of the invention

Fig. 1 is the structure function figure of the invention patent comparative analysis method.

Fig. 2 is the flow chart of the invention patent comparative analysis method.

Fig. 3 is important phrases in Fig. 2-patent file bigraph (bipartite graph) structural schematic diagram.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.

Fig. 1 and as shown in connection with fig. 2 is please referred to, present invention discloses a kind of Patent Reference's analysis methods comprising following step It is rapid:

S1, patent database is established based on web crawlers method；

S2, the patent file collection D that target topic is extracted from the patent database, and establish the time of patent file collection D Phrase book P is selected, wherein the patent file collection D includes an at least table of contents mark patent d_cAt least one comparison patent d '_c；

S3, it is based on optimization selection model, target patent d is extracted in the candidate phrase collection P_cWith comparison patent d '_c's Important phrases collection S, and important phrases collection S includes target patent important phrases collectionWith comparison patent important phrases collectionI.e.

S4, important phrases-patent file bigraph (bipartite graph) relativity measurement is established, calculates important phrases in important phrases collection S With target patent d_cSimilarity scores and otherness score and important phrases and comparison patent d '_cSimilarity scores and difference Property score；

S5, target patent d is extracted respectively based on optimum target method_cWith comparison patent d '_cSimilar phrase book and difference Phrase book.

Following description part is described in detail only for step S1~S5.

Step S1 specifically: patent database is established using web crawlers method.Web crawlers is a kind of efficient information Sharp weapon are acquired, various data resources can be quickly and accurately acquired, web crawlers method in the prior art has one in website Be easy to be sealed when fixed " counter to climb " strategy so that same IP and same account whithin a period of time crawl number critical constraints, Based on this, the invention patent comparative analysis method constructs crawler by maintenance Agent IP pond and the pond Cookies and pretends module, uses Distributed reptile framework constructs multiple crawler modules, opens multiple crawler threads while crawling target patent website, and uses The library request and bs4 web analysis packet obtain patent information, to form building reasonably according to the patent information got Database table is to store the patent information crawled.

Further, the patent information that web crawlers method crawls includes: patent name, application number, applying date, openly Number, publication date, applicant, inventor, address of the applicant, IPC code, abridgments of specifications, keyword, CPC classification number, applicant Postcode, agency, agent, claims, specification, Figure of description, PDF text, statutory status effective date, law The fields such as state meaning, related application number, related patents publication number, related patents title, and patent information will be according to table Structure is stored into patent database, and the content to guarantee patent database is comprehensive and stable.

Step S2 is specifically included:

S24, the association relationship MI that participle is concentrated in effective participle is calculated, extracts patent text to concentrate in effective participle The candidate phrase collection P of shelves collection D.

In the step s 21, the patent file collection D of target topic mainly pass through screening IPC code or setting keyword from It is extracted in patent database.In the present invention, patent file collection D={ d₁, d₂..., d_n, n is patent file collection D special secondary school The number of sharp document, for any one patent file d mainly include application number, the applying date, applicant, address, inventor, specially Sharp agency, IPC code and summary of the invention etc. define target patent d_cWith comparison patent d '_c, wherein d_c, d '_c∈ D, And d_c≠d′_c。

By patent file d Writing of Medical Professional require influenced, patent file d generally have text is tediously long, language it is complicated with And word interfere the characteristics of, if therefore directly patent file d is analyzed, will lead to Patent Reference analysis result exist compared with Big error, therefore in step S22~S24 of the invention, it will be based on natural language processing, to special in patent file collection D Sharp document d is handled, and to establish the candidate phrase collection P of the patent file collection D of target topic, following description part will be with special Sharp document d be Chinese text for be illustrated.

In step S22, when carrying out natural language processing, due to Chinese text have the non-structural form of sentence abundant and Sentence word sequence is without apparent rule and boundary, it is therefore desirable to carry out word segmentation processing to the Chinese text of patent file d, preferably , in the present embodiment, general Chinese automatic word-cut can be used, patent file d is segmented, to obtain patent file collection D Participle collection, and segment collection include several participle.

In step S23, stop words is defined, wherein stop words refers to the word of no practical significance, including function word, function Word, conjunction etc., as " ", "Yes", " and ", and establish deactivate vocabulary, while according to deactivated vocabulary to participle concentrate Participle is screened and is filtered, to obtain effective participle collection of the patent file collection D.

In the selection method of traditional phrase, consideration is only the frequency factor segmented, then part is occurred The low participle but with abundant semantic feature of frequency is ignored, to prevent the above-described problem from occurring, in step s 24, passes through meter The association relationship MI that candidate participle is concentrated in effective participle is calculated, to concentrate the candidate for extracting patent file collection D short in effective participle Language collection P, wherein candidate phrase collection P={ p₁, p₂... p_m, p is candidate phrase, and m is of candidate phrase p in candidate phrase collection P Number.

Specifically, in step s 24, definition participle frequency threshold is F, and the mutual information threshold value of participle is I, association relationship The calculation formula of MI is as follows:

Wherein, X, Y are two candidate participles of effectively participle concentration；P (x, y) is the joint point of two candidate participle X, Y Cloth, p (x) are the limit distribution of candidate participle X；P (y) is the limit distribution of candidate participle Y.It is set if the frequency of candidate's participle is greater than The candidate is then segmented and is added in candidate phrase collection P by fixed participle frequency threshold F；If the frequency of candidate's participle is less than setting Frequency threshold F is segmented, then investigates the size of association relationship MI of the candidate participle in corresponding patent file d, if the candidate point The association relationship MI of word is greater than the mutual information threshold value I of setting, then candidate phrase collection P is added, and otherwise candidate participle is dropped.

Step S3 specifically:

S31, each of candidate phrase collection P conspicuousness of the candidate phrase p in the patent file d where it point is calculated Number, to characterize conspicuousness of the candidate phrase p in the patent file d where it；

S32, each of candidate phrase collection P uniqueness of the candidate phrase p in the patent file d where it point is calculated Number, to characterize uniqueness of the candidate phrase p in the patent file d where it；

S33, it is based on optimized selection method, and combines the conspicuousness score of each candidate phrase p in candidate phrase collection P And uniqueness score, extract target patent d_cWith comparison patent d '_cImportant phrases collection S, the important phrases collection S includes target Patent important phrases collectionWith comparison patent important phrases collection

After being extracted candidate phrase collection P to entire patent file collection D, if every patent file d can be regarded as by A dry candidate phrase p is formed, in fact, since most of candidate phrase p can not represent patent file d, it need to be to patent text The candidate phrase collection P of shelves collection D is further processed, preferably to characterize each patent file d in patent file collection D.

Specifically, step S31 is mainly used for calculating each of candidate phrase collection P candidate phrase p where it Conspicuousness score r in patent file d_{P, d}, to characterize conspicuousness of the candidate phrase p in the patent file d where it.One The frequency that a candidate phrase p occurs in the patent file d where it is high, and in other patents of patent file collection D text The frequency occurred in shelves d is low, then illustrates that candidate phrase p has stronger conspicuousness about the patent file d where it, therefore Single candidate phrase p can use conspicuousness score r about the conspicuousness of the patent file d where it_{P, d}It indicates, and in the present invention Conspicuousness score r of the single candidate phrase p in patent file d_{P, d}It indicates are as follows:

Wherein, P_dIndicate the set of all candidate phrase p of patent file d, n (p, d) indicates candidate phrase p where it Patent file d in the frequency that occurs, n (p, D) indicates the frequency that candidate phrase p occurs in patent file collection D.

Step S32 is mainly used for calculating each of candidate phrase collection P patent file d of the candidate phrase p where it In uniqueness score, to characterize uniqueness of the candidate phrase p in the patent file d where it.Specifically, a weight It is necessary to have stronger uniquenesses by other candidate phrases p that the candidate phrase p wanted needs to be different from candidate phrase collection P, therefore, The unique of single candidate phrase p can be calculated in conjunction with the semantic similarity between candidate phrase p.

In step s 32, the uniqueness of single candidate phrase p is that the Semantic Similarity Measurement based on semantic tree obtains, that is, Use information content carries out semantic similarity measure, and constructs semantic tree using semantic dictionary, based between candidate phrase p Path length calculate i-th of candidate phrase p_iWith j-th candidates phrase p_jSemantic similarity Sim (p_i, p_j), to candidate The uniqueness of phrase p is characterized.

Further, in step S33, it is based on optimized selection method, and combines each in candidate phrase collection P candidate The conspicuousness score r of phrase p_{P, d}And uniqueness score, extract target patent d_cWith comparison patent d '_cImportant phrases collection S.At this It in invention, defines important phrases and integrates the amount threshold of important phrases p ' in S as K, shown with candidate phrase p in candidate phrase collection P Work property and uniqueness are used as extraction standard, establish optimum target:

Wherein,For important phrase bookIn all important phrases p ' conspicuousness scoreThe sum of,For the conspicuousness score of important phrases p 'Power Weight,For the synthesis of important phrases p ' all in important phrase book The sum of similarity scores, since candidate phrase p and the similarity of other candidate phrases p are higher, candidate phrase p does not have more solely Characteristic, therefore in the optimum target, the sum of comprehensive similarity scores are penalty term；μ is? The weight divided；λ isScore weight.So set, just Extractable target patent d_cWith comparison patent d '_cImportant phrases collectionWith

Step S4 is specifically included:

S41, building important phrases-patent file bigraph (bipartite graph)；

S42, it calculates in important phrases-patent file bigraph (bipartite graph), important phrases p ' and target patent d_cWith comparison patent d '_c Between the degree of correlation；

S43, it calculates in important phrases-patent file bigraph (bipartite graph), important phrases p ' and target patent d_cWith comparison patent d '_c Between similarity scores；

S44, it calculates in important phrases-patent file bigraph (bipartite graph), important phrases p ' and target patent d_cWith comparison patent d '_c Between otherness score.

In step S41, important phrases-patent file bigraph (bipartite graph) can be used for characterizing important phrases collection S and patent file collection D Correlation (such as Fig. 3) connect the weight on side wherein having connection side between each important phrases p ' node and patent file d It can be obtained by BM25 relatedness computation.

Further, in step S42, important phrases-patent text can be calculated using the simrank algorithm of random walk Important phrases p ' and target patent d in shelves bigraph (bipartite graph)_cDegree of correlation f (p ', d_c) and important phrases p ' and comparison patent d '_cPhase Pass degree f (p ', d '_c)。

Step S43 is mainly used for calculating important phrases p ' and target patent d_cWith comparison patent d '_cSimilarity scores Φ (p ', d_c, d '_c):

Φ (p ', d_c, d '_c)=ln (1+f (p ', d_c) f (p ', d '_c))

Wherein, f (p ', d_c) it is important phrases p ' and target patent d_cBetween the degree of correlation；F (p ', d '_c) it is important phrases P ' and comparison patent d '_cBetween the degree of correlation.

In fact, when important phrases p ' simultaneously with target patent d_cWith comparison patent d '_cBetween have very high correlation When spending, then show that the important phrases p ' has stronger importance in important phrases collection S, therefore for a certain important phrases P ', with target patent d_cDegree of correlation f (p ', d '_c) and with comparison patent d '_cDegree of correlation f (p ', d '_c) bigger, then table Bright important phrases p ' and target patent d_cWith comparison patent d '_cBetween similarity scores Φ (p ', d_c, d '_c) higher.And Similarity scores Φ (p ', d of the invention_c, d '_c) calculating process in, use important phrases p ' and target patent d_cIt is special with comparison Sharp d '_cThe product of the degree of correlation take logarithm again, comprehensively considered important phrases p ' and target patent d_cDegree of correlation f (p ', d_c) with And important phrases p ' and comparison patent d '_cDegree of correlation f (p ', d '_c) two, it is special with target preferably to characterize important phrases p ' Sharp d_cWith comparison patent d '_c。

Step S44 is mainly used for calculating important phrases p ' and target patent d_cWith comparison patent d '_cOtherness score Ω (p ', d_c|d′_c):

Wherein, γ is smoothing parameter, to prevent important phrases p ' and target patent d_cBetween degree of correlation f (p ', d_c) and again Want phrase p ' and comparison patent d '_cBetween degree of correlation f (p ', d '_c) it is intended to 0.

Specifically, target patent d is being calculated_cWith comparison patent d '_cOtherness score Ω (p ', d_c|d′_c) when, it is important Phrase p ' should be with target patent d_cWith comparison patent d '_cIn a degree of correlation it is very high, and it is very low with another degree of correlation, And important phrases p ' should in important phrases collection S importance with higher, therefore for target patent d_cIn it is a certain Important phrases p ', otherness score Ω (p ', d_c|d′_c) there are following two situations: if one, important phrases p ' and target patent d_cThe degree of correlation is very high, and with comparison patent d '_cThe degree of correlation it is relatively low, then the otherness score Ω of important phrases p ' (p ', d_c|d′_c) higher；If important phrases p ' and target patent d_cThe degree of correlation is relatively high, and with comparison patent d '_cThe degree of correlation It is very low, then otherness score Ω (p ', the d of important phrases p '_c|d′_c) also higher.

Two, important phrases p ' and target patent d_cIt is non-significant similar, but its significant is different from compares patent d '_c, then this is important Phrase p ' can also be used as difference phrase Ω (p ', d_c|d′_c), to embody target patent d_cWith comparison patent d '_cBetween difference Property.And as important phrases p ' and comparison patent d '_cIt is non-significant similar, but its significant is different from target patent d_c, then the important phrases P ' can also be used as difference phrase Ω (p ', d_c|d′_c), to embody target patent d_cWith comparison patent d '_cBetween otherness.

Step S5 is specifically included:

S51, it is based on optimum target method, and combines important phrases p ' and target patent d in important phrases collection S_cWith it is right Than patent d '_cBetween similarity scores Φ (p ', d, d '), obtain target patent d_cWith comparison patent d '_cBetween similar phrase Collect C；

S52, it is based on optimum target method, and combines important phrases p ' and target patent d in important phrases collection S_cWith it is right Than patent d '_cBetween otherness score Ω (p ', d_c|d′_c), it obtains target patent difference phrase book Q and comparison patent difference is short Language collection Q '.

Step S51 specifically: define optimum target and at least two similarity constraint conditions, and in step S51 most Optimization aim are as follows:

Wherein, p_iFor i-th of similar phrase in similar phrase book C；For target patent important phrases collection, andTo compare patent important phrases collection, andFor decision variable, x_i=0 or 1 Indicate whether i-th of phrase to be selected is similar phrase, x_i=1 indicates to be similar phrase, x_i=0 indicates not being similar phrase.

Further, the purpose for defining optimum target is so that similar phrase p in similar phrase book C_sSimilitude Fractions phi (p_s, d_c, d '_c) the sum of maximize, and the similar phrase p that extracts is guaranteed by similarity constraint condition_sSimilitude Fractions phi (p_s, d_c, d '_c) it is respectively greater than target patent important phrases collectionSimilarity scores Φ (p ', d_c, d '_c) average value With comparison patent important phrases collectionSimilarity scores Φ (p ', d_c, d '_c) average value, to limit similar phrase book C's Scale.

It should be noted that be only illustrated there are two for by similarity constraint condition setting in the present invention, Certainly in other embodiments of the invention, similarity constraint condition may also be configured to other quantity.

Step S52 specifically: define optimum target and at least three otherness constraint conditions, and the optimization in S52 Target are as follows:

C ∩ Q=C ∩ Q '=φ

Wherein, Q is target patent difference phrase book；Q ' is comparison patent difference phrase book；y_i, y_i' it is decision variable, It and is 0-1 variable, y_iIndicate target patent d_cIn phrase to be selected whether be otherness phrase, y_i' indicate comparison patent d '_cIn Phrase to be selected whether be otherness phrase.

Specifically, in step S52 optimum target establish meaning be: so that target patent difference phrase book Q and Compare the sum of the otherness score in patent difference phrase book Q ' maximum；On the one hand otherness constraint condition is used to guarantee to extract Difference phrase p_iOtherness score Ω (p_i, d_c|d_c') it is respectively greater than target patent important phrases collectionOtherness score Ω (p ', d_c|d_c) average value and comparison patent important phrases collectionOtherness score Ω (p ', d_c|d_c) average value；Separately On the one hand, so that target patent d_cWith comparison patentSimilar phrase book C, target patent difference phrase book Q and comparison it is special Without intersection between the different phrase book Q ' of profit.

In conclusion Patent Reference's analysis method of the invention, by using web crawlers technology establish patent database, The candidate phrase collection P of patent file collection D is established based on participle technique, important phrases collection S is extracted based on optimal method, calculates weight Want phrase p ' and target patent d_cWith comparison patent d '_cSimilarity scores Φ (p ', d_c|d′_c) and otherness score Ω (p ', d_c| d_c) and based on optimal method extraction target patent d_cWith comparison patent d '_cSimilar phrase book and difference phrase book, quickly, Have effectively achieved the comparative analysis of patent.

The above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to preferred embodiment to this hair It is bright to be described in detail, those skilled in the art should understand that, it can modify to technical solution of the present invention Or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. a kind of Patent Reference's analysis method, which comprises the following steps:

S1, patent database is established based on web crawlers method；

S2, the patent file collection D that target topic is extracted from the patent database, and the candidate for establishing patent file collection D is short Language collection, wherein the patent file collection D includes an at least table of contents mark patent and at least one comparison patent；

S3, it is based on optimization selection model, the important phrases extracted target patent and compare patent is concentrated in the candidate phrase Collection, and the important phrases collection includes target patent important phrases collection and comparison patent important phrases collection；

S4, important phrases-patent file bigraph (bipartite graph) relativity measurement is established, calculates important phrases and concentrates important phrases and target The similarity scores and otherness score and important phrases of patent and the similarity scores and otherness score of comparison patent；

S5, it extracts target patent respectively based on optimum target method and compares the similar phrase book and difference phrase book of patent.

2. Patent Reference's analysis method according to claim 1, which is characterized in that the step S1 specifically: selection is more A target patent website constructs multiple crawler modules using distributed reptile framework, opens multiple crawler threads while crawling mesh Patent website is marked, and according to the composition of the patent information crawled, establishes the patent information that database table storage crawls, is constructed Patent database.

3. Patent Reference's analysis method according to claim 1, which is characterized in that the step S2 is specifically included:

S22, word segmentation processing is carried out to the patent file in patent file collection D, it is described to obtain the participle collection of patent file collection D Participle collection includes several participles；

S23, deactivated vocabulary is established, is screened and filtered according to the participle that deactivated vocabulary concentrates the participle, obtains State effective participle collection of patent file collection D；

S24, the association relationship MI that participle is concentrated in effective participle is calculated, extracts patent file collection to concentrate in effective participle The candidate phrase collection of D.

4. Patent Reference's analysis method according to claim 3, which is characterized in that the step S24 specifically: definition point Word frequency threshold is F, and the mutual information threshold value of participle is I, and Joint Distribution and the side of candidate participle are concentrated by calculating effectively participle Border distribution, to calculate the association relationship MI for obtaining candidate participle；If the frequency of candidate's participle is greater than the participle frequency threshold of setting The candidate is then segmented and candidate phrase concentration is added by F；If the frequency of candidate's participle is less than the participle frequency threshold F of setting, examine The size of the association relationship MI of candidate participle is examined, if the association relationship MI of candidate participle is greater than the mutual information threshold value I of setting, Candidate phrase collection is then added, otherwise candidate participle is dropped.

5. Patent Reference's analysis method according to claim 1, which is characterized in that the step S3 specifically:

S31, each of candidate phrase collection conspicuousness score of the candidate phrase in the patent file where it is calculated, with table Levy conspicuousness of the candidate phrase in the patent file where it；

S32, each of candidate phrase collection uniqueness score of the candidate phrase in the patent file where it is calculated, with table Levy uniqueness of the candidate phrase in the patent file where it；

S33, it is based on optimized selection method, and concentrates the conspicuousness score and uniqueness of each candidate phrase in conjunction with candidate phrase Property score, extract target patent and compare patent important phrases collection S, the important phrases collection S includes related to target patent Target patent important phrases collection and to the relevant comparison patent important phrases collection of comparison patent.

6. Patent Reference's analysis method according to claim 5, which is characterized in that the step S33 specifically: definition weight The amount threshold for wanting important phrases in phrase book is K, and the conspicuousness score and uniqueness of candidate phrase are concentrated with the candidate phrase Property score establish optimum target as extraction standard, and target patent is obtained by the optimum target and compares patent Important phrases collection, the important phrases collection include target patent important phrases collection and comparison patent important phrases collection, the target Patent important phrases collection includes K important phrases relevant to the target patent；The comparison patent important phrases collection includes K A important phrases relevant to the comparison patent.

7. Patent Reference's analysis method according to claim 1, which is characterized in that the step S4 is specifically included:

S41, building important phrases-patent file bigraph (bipartite graph)；

S42, it calculates in important phrases-patent file bigraph (bipartite graph), the degree of correlation between important phrases and target patent and important short The degree of correlation between language and comparison patent；

S43, it calculates in important phrases-patent file bigraph (bipartite graph), it is similar between important phrases and target patent and comparison patent Property score；

S44, it calculates in important phrases-patent file bigraph (bipartite graph), the difference between important phrases and target patent and comparison patent Property score.

8. Patent Reference's analysis method according to claim 7, which is characterized in that the step S5 is specifically included:

S51, be based on optimum target method, and combine in important phrases collection S important phrases and target patent and compare patent it Between similarity scores, obtain target patent and compare patent between similar phrase book C；

S52, be based on optimum target method, and combine in important phrases collection S important phrases and target patent and compare patent it Between otherness score, obtain target patent difference phrase book and comparison patent difference phrase book.

9. Patent Reference's analysis method according to claim 8, which is characterized in that the step S51 specifically: definition is most Optimization aim and at least two similarity constraint conditions, so that the sum of similarity scores of similar phrase are most in similar phrase book C Bigization, and guarantee that the similarity scores of the similar phrase extracted are respectively greater than target patent by the similarity constraint condition The average value of the similarity scores of the average value and comparison patent important phrases collection of the similarity scores of important phrases collection.

10. Patent Reference's analysis method according to claim 8, which is characterized in that the step S52 specifically: definition Optimum target and at least three otherness constraint conditions, so that target patent difference phrase book and comparison patent difference phrase book The sum of otherness score of middle difference phrase maximizes, and the difference phrase extracted by otherness constraint condition guarantee Otherness score be respectively greater than target patent important phrases collection otherness score average value and comparison patent important phrases The average value of the otherness score of collection, and target patent and the comparison similar phrase book C of patent, target patent difference phrase book with And without intersection between comparison patent difference phrase book.