CN109903198A - Patent Reference's analysis method - Google Patents

Patent Reference's analysis method Download PDF

Info

Publication number
CN109903198A
CN109903198A CN201910067706.2A CN201910067706A CN109903198A CN 109903198 A CN109903198 A CN 109903198A CN 201910067706 A CN201910067706 A CN 201910067706A CN 109903198 A CN109903198 A CN 109903198A
Authority
CN
China
Prior art keywords
collection
target
important phrases
phrase
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910067706.2A
Other languages
Chinese (zh)
Other versions
CN109903198B (en
Inventor
汪云霄
覃婷婷
刘峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910067706.2A priority Critical patent/CN109903198B/en
Publication of CN109903198A publication Critical patent/CN109903198A/en
Application granted granted Critical
Publication of CN109903198B publication Critical patent/CN109903198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of Patent Reference's analysis methods.Patent Reference's analysis method is established patent database using network technology, the candidate phrase collection of patent file collection is established based on participle technique, important phrases collection is extracted based on optimal method, calculating similar phrase book and difference phrase book that important phrases extract target patent and comparison patent to the similarity scores and otherness score of target patent and comparison patent and based on optimal method, and Patent Reference's analysis is quickly and efficiently realized.

Description

Patent Reference's analysis method
Technical field
The present invention relates to a kind of Patent Reference's analysis methods, belong to natural language processing and patent analysis field.
Background technique
Patent Reference's analysis belongs to a seed type of patent analysis, can be with by effective patent file comparative analysis method The quickly similitude and otherness between identification patent file, in a sense, the patent level representation of enterprise enterprise Whole innovation level.Enterprise key personnel can identify the core technology of other enterprises by the method for comparative analysis, to make Fixed effective technology strategy.
Have many patent retrievals and analysis system, such as IncoPat, SooPat, Patsnap now, but these are specially Sharp system is mainly to provide patent retrieval and simple statistical analysis on patents, these fundamental analysis are unable to satisfy profound patent Excavation demand;In addition, quick ascendant trend is presented in annual amount of the application for patent, the workload of manual examination and verification patent constantly adds Greatly, thus develop it is a kind of automation Patent Reference analysis system have great importance.
In view of this, it is necessory to provide a kind of Patent Reference's analysis method, to solve the above problems.
Summary of the invention
The purpose of the present invention is to provide a kind of Patent Reference's analysis method, with it is deeper excavate patent file it Between similitude and otherness, thus it is more accurate, quickly discovery target patent patent value where.
For achieving the above object, the present invention provides a kind of Patent Reference's analysis method, Patent Reference's analyses Method the following steps are included:
S1, patent database is established based on web crawlers method;
S2, the patent file collection D that target topic is extracted from the patent database, and establish the time of patent file collection D Phrase book is selected, wherein the patent file collection D includes an at least table of contents mark patent and at least one comparison patent;
S3, it is based on optimization selection model, concentrates extraction target patent in the candidate phrase and compares the important of patent Phrase book, and the important phrases collection includes target patent important phrases collection and comparison patent important phrases collection;
S4, establish important phrases-patent file bigraph (bipartite graph) relativity measurement, calculate important phrases concentrate important phrases with The similarity scores and otherness score and important phrases of target patent and the similarity scores and otherness score of comparison patent;
S5, it extracts target patent respectively based on optimum target method and compares the similar phrase book and difference phrase of patent Collection.
As a further improvement of the present invention, the step S1 specifically: select multiple target patent websites, use distribution Formula crawler framework constructs multiple crawler modules, opens multiple crawler threads while crawling target patent website, and according to crawling Patent information composition, establish the patent information that database table storage crawls, construct patent database.
As a further improvement of the present invention, the step S2 is specifically included:
S21, the patent file collection D that target topic is extracted from the patent database;
S22, word segmentation processing is carried out to the patent file in patent file collection D, to obtain the participle collection of patent file collection D, The participle collection includes several participles;
S23, deactivated vocabulary is established, is screened and filtered according to the participle that deactivated vocabulary concentrates the participle, obtains Effective participle of the patent file collection D is taken to collect;
S24, the association relationship MI that participle is concentrated in effective participle is calculated, extracts patent text to concentrate in effective participle The candidate phrase collection of shelves collection D.
As a further improvement of the present invention, the step S24 specifically: definition participle frequency threshold be F, participle it is mutual Information threshold is I, concentrates the Joint Distribution of candidate participle and limit to be distributed by calculating effectively participle, obtains candidate point to calculate The association relationship MI of word;If the frequency of candidate's participle is greater than the participle frequency threshold F of setting, candidate participle is added candidate In phrase book;If the frequency of candidate's participle is less than the participle frequency threshold F of setting, the association relationship MI of candidate participle is investigated Size, if the candidate participle association relationship MI be greater than setting mutual information threshold value I, candidate phrase collection is added, otherwise should Candidate's participle is dropped.
As a further improvement of the present invention, the step S3 specifically:
S31, each of candidate phrase collection conspicuousness score of the candidate phrase in the patent file where it is calculated, To characterize conspicuousness of the candidate phrase in the patent file where it;
S32, each of candidate phrase collection uniqueness score of the candidate phrase in the patent file where it is calculated, To characterize uniqueness of the candidate phrase in the patent file where it;
S33, be based on optimized selection method, and in conjunction with candidate phrase concentrate each candidate phrase conspicuousness score and Uniqueness score, important phrases the collection S, the important phrases collection S for extracting target patent and comparison patent include and target patent Relevant target patent important phrases collection and comparison patent important phrases collection relevant to comparison patent.
As a further improvement of the present invention, the step S33 specifically: define the number that important phrases concentrate important phrases Amount threshold value is K, and the conspicuousness score and uniqueness score for concentrating candidate phrase using the candidate phrase are established as extraction standard Optimum target, and target patent is obtained by the optimum target and compares the important phrases collection of patent, the important phrases Collection includes target patent important phrases collection and comparison patent important phrases collection, the target patent important phrases collection include K with The relevant important phrases of the target patent;The comparison patent important phrases collection includes K relevant to the comparison patent Important phrases.
As a further improvement of the present invention, the step S4 is specifically included:
S41, building important phrases-patent file bigraph (bipartite graph);
S42, it calculates in important phrases-patent file bigraph (bipartite graph), the degree of correlation between important phrases and target patent and again It wants phrase and compares the degree of correlation between patent;
S43, it calculates in important phrases-patent file bigraph (bipartite graph), between important phrases and target patent and comparison patent Similarity scores;
S44, it calculates in important phrases-patent file bigraph (bipartite graph), between important phrases and target patent and comparison patent Otherness score.
As a further improvement of the present invention, the step S5 is specifically included:
S51, it is based on optimum target method, and combines important phrases and target patent and comparison in important phrases collection S special Similarity scores between benefit obtain the similar phrase book C between target patent and comparison patent;
S52, it is based on optimum target method, and combines important phrases and target patent and comparison in important phrases collection S special Otherness score between benefit obtains target patent difference phrase book and comparison patent difference phrase book.
As a further improvement of the present invention, the step S51 specifically: define optimum target and at least two similar Property constraint condition so that the sum of similarity scores of similar phrase maximize in similar phrase book C, and about by the similitude Beam condition guarantees that the similarity scores of the similar phrase extracted are respectively greater than the similarity scores of target patent important phrases collection Average value and comparison patent important phrases collection similarity scores average value.
As a further improvement of the present invention, the step S52 specifically: define optimum target and at least three differences Property constraint condition so that target patent difference phrase book and comparison patent difference phrase book in difference phrase otherness score it And maximization, and guarantee that the otherness score of the difference phrase extracted is respectively greater than target by the otherness constraint condition The average value of the otherness score of the average value and comparison patent important phrases collection of the otherness score of patent important phrases collection, and Between target patent and the similar phrase book C, target patent difference phrase book and comparison patent difference phrase book that compare patent Without intersection.
The beneficial effects of the present invention are: the invention patent comparative analysis method, special by being established using web crawlers technology Sharp database, the candidate phrase collection that patent file collection D is established based on participle technique extract important phrases collection based on optimal method S, the similarity scores and otherness score of important phrases and target patent and comparison patent are calculated and are based on optimal method It extracts target patent and compares the similar phrase book and difference phrase book of patent, quickly and efficiently realize Patent Reference's analysis.
Detailed description of the invention
Fig. 1 is the structure function figure of the invention patent comparative analysis method.
Fig. 2 is the flow chart of the invention patent comparative analysis method.
Fig. 3 is important phrases in Fig. 2-patent file bigraph (bipartite graph) structural schematic diagram.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.
Fig. 1 and as shown in connection with fig. 2 is please referred to, present invention discloses a kind of Patent Reference's analysis methods comprising following step It is rapid:
S1, patent database is established based on web crawlers method;
S2, the patent file collection D that target topic is extracted from the patent database, and establish the time of patent file collection D Phrase book P is selected, wherein the patent file collection D includes an at least table of contents mark patent dcAt least one comparison patent d 'c
S3, it is based on optimization selection model, target patent d is extracted in the candidate phrase collection PcWith comparison patent d 'c's Important phrases collection S, and important phrases collection S includes target patent important phrases collectionWith comparison patent important phrases collectionI.e.
S4, important phrases-patent file bigraph (bipartite graph) relativity measurement is established, calculates important phrases in important phrases collection S With target patent dcSimilarity scores and otherness score and important phrases and comparison patent d 'cSimilarity scores and difference Property score;
S5, target patent d is extracted respectively based on optimum target methodcWith comparison patent d 'cSimilar phrase book and difference Phrase book.
Following description part is described in detail only for step S1~S5.
Step S1 specifically: patent database is established using web crawlers method.Web crawlers is a kind of efficient information Sharp weapon are acquired, various data resources can be quickly and accurately acquired, web crawlers method in the prior art has one in website Be easy to be sealed when fixed " counter to climb " strategy so that same IP and same account whithin a period of time crawl number critical constraints, Based on this, the invention patent comparative analysis method constructs crawler by maintenance Agent IP pond and the pond Cookies and pretends module, uses Distributed reptile framework constructs multiple crawler modules, opens multiple crawler threads while crawling target patent website, and uses The library request and bs4 web analysis packet obtain patent information, to form building reasonably according to the patent information got Database table is to store the patent information crawled.
Further, the patent information that web crawlers method crawls includes: patent name, application number, applying date, openly Number, publication date, applicant, inventor, address of the applicant, IPC code, abridgments of specifications, keyword, CPC classification number, applicant Postcode, agency, agent, claims, specification, Figure of description, PDF text, statutory status effective date, law The fields such as state meaning, related application number, related patents publication number, related patents title, and patent information will be according to table Structure is stored into patent database, and the content to guarantee patent database is comprehensive and stable.
Step S2 is specifically included:
S21, the patent file collection D that target topic is extracted from the patent database;
S22, word segmentation processing is carried out to the patent file in patent file collection D, to obtain the participle collection of patent file collection D, The participle collection includes several participles;
S23, deactivated vocabulary is established, is screened and filtered according to the participle that deactivated vocabulary concentrates the participle, obtains Effective participle of the patent file collection D is taken to collect;
S24, the association relationship MI that participle is concentrated in effective participle is calculated, extracts patent text to concentrate in effective participle The candidate phrase collection P of shelves collection D.
In the step s 21, the patent file collection D of target topic mainly pass through screening IPC code or setting keyword from It is extracted in patent database.In the present invention, patent file collection D={ d1, d2..., dn, n is patent file collection D special secondary school The number of sharp document, for any one patent file d mainly include application number, the applying date, applicant, address, inventor, specially Sharp agency, IPC code and summary of the invention etc. define target patent dcWith comparison patent d 'c, wherein dc, d 'c∈ D, And dc≠d′c
By patent file d Writing of Medical Professional require influenced, patent file d generally have text is tediously long, language it is complicated with And word interfere the characteristics of, if therefore directly patent file d is analyzed, will lead to Patent Reference analysis result exist compared with Big error, therefore in step S22~S24 of the invention, it will be based on natural language processing, to special in patent file collection D Sharp document d is handled, and to establish the candidate phrase collection P of the patent file collection D of target topic, following description part will be with special Sharp document d be Chinese text for be illustrated.
In step S22, when carrying out natural language processing, due to Chinese text have the non-structural form of sentence abundant and Sentence word sequence is without apparent rule and boundary, it is therefore desirable to carry out word segmentation processing to the Chinese text of patent file d, preferably , in the present embodiment, general Chinese automatic word-cut can be used, patent file d is segmented, to obtain patent file collection D Participle collection, and segment collection include several participle.
In step S23, stop words is defined, wherein stop words refers to the word of no practical significance, including function word, function Word, conjunction etc., as " ", "Yes", " and ", and establish deactivate vocabulary, while according to deactivated vocabulary to participle concentrate Participle is screened and is filtered, to obtain effective participle collection of the patent file collection D.
In the selection method of traditional phrase, consideration is only the frequency factor segmented, then part is occurred The low participle but with abundant semantic feature of frequency is ignored, to prevent the above-described problem from occurring, in step s 24, passes through meter The association relationship MI that candidate participle is concentrated in effective participle is calculated, to concentrate the candidate for extracting patent file collection D short in effective participle Language collection P, wherein candidate phrase collection P={ p1, p2... pm, p is candidate phrase, and m is of candidate phrase p in candidate phrase collection P Number.
Specifically, in step s 24, definition participle frequency threshold is F, and the mutual information threshold value of participle is I, association relationship The calculation formula of MI is as follows:
Wherein, X, Y are two candidate participles of effectively participle concentration;P (x, y) is the joint point of two candidate participle X, Y Cloth, p (x) are the limit distribution of candidate participle X;P (y) is the limit distribution of candidate participle Y.It is set if the frequency of candidate's participle is greater than The candidate is then segmented and is added in candidate phrase collection P by fixed participle frequency threshold F;If the frequency of candidate's participle is less than setting Frequency threshold F is segmented, then investigates the size of association relationship MI of the candidate participle in corresponding patent file d, if the candidate point The association relationship MI of word is greater than the mutual information threshold value I of setting, then candidate phrase collection P is added, and otherwise candidate participle is dropped.
Step S3 specifically:
S31, each of candidate phrase collection P conspicuousness of the candidate phrase p in the patent file d where it point is calculated Number, to characterize conspicuousness of the candidate phrase p in the patent file d where it;
S32, each of candidate phrase collection P uniqueness of the candidate phrase p in the patent file d where it point is calculated Number, to characterize uniqueness of the candidate phrase p in the patent file d where it;
S33, it is based on optimized selection method, and combines the conspicuousness score of each candidate phrase p in candidate phrase collection P And uniqueness score, extract target patent dcWith comparison patent d 'cImportant phrases collection S, the important phrases collection S includes target Patent important phrases collectionWith comparison patent important phrases collection
After being extracted candidate phrase collection P to entire patent file collection D, if every patent file d can be regarded as by A dry candidate phrase p is formed, in fact, since most of candidate phrase p can not represent patent file d, it need to be to patent text The candidate phrase collection P of shelves collection D is further processed, preferably to characterize each patent file d in patent file collection D.
Specifically, step S31 is mainly used for calculating each of candidate phrase collection P candidate phrase p where it Conspicuousness score r in patent file dP, d, to characterize conspicuousness of the candidate phrase p in the patent file d where it.One The frequency that a candidate phrase p occurs in the patent file d where it is high, and in other patents of patent file collection D text The frequency occurred in shelves d is low, then illustrates that candidate phrase p has stronger conspicuousness about the patent file d where it, therefore Single candidate phrase p can use conspicuousness score r about the conspicuousness of the patent file d where itP, dIt indicates, and in the present invention Conspicuousness score r of the single candidate phrase p in patent file dP, dIt indicates are as follows:
Wherein, PdIndicate the set of all candidate phrase p of patent file d, n (p, d) indicates candidate phrase p where it Patent file d in the frequency that occurs, n (p, D) indicates the frequency that candidate phrase p occurs in patent file collection D.
Step S32 is mainly used for calculating each of candidate phrase collection P patent file d of the candidate phrase p where it In uniqueness score, to characterize uniqueness of the candidate phrase p in the patent file d where it.Specifically, a weight It is necessary to have stronger uniquenesses by other candidate phrases p that the candidate phrase p wanted needs to be different from candidate phrase collection P, therefore, The unique of single candidate phrase p can be calculated in conjunction with the semantic similarity between candidate phrase p.
In step s 32, the uniqueness of single candidate phrase p is that the Semantic Similarity Measurement based on semantic tree obtains, that is, Use information content carries out semantic similarity measure, and constructs semantic tree using semantic dictionary, based between candidate phrase p Path length calculate i-th of candidate phrase piWith j-th candidates phrase pjSemantic similarity Sim (pi, pj), to candidate The uniqueness of phrase p is characterized.
Further, in step S33, it is based on optimized selection method, and combines each in candidate phrase collection P candidate The conspicuousness score r of phrase pP, dAnd uniqueness score, extract target patent dcWith comparison patent d 'cImportant phrases collection S.At this It in invention, defines important phrases and integrates the amount threshold of important phrases p ' in S as K, shown with candidate phrase p in candidate phrase collection P Work property and uniqueness are used as extraction standard, establish optimum target:
Wherein,For important phrase bookIn all important phrases p ' conspicuousness scoreThe sum of,For the conspicuousness score of important phrases p 'Power Weight,For the synthesis of important phrases p ' all in important phrase book The sum of similarity scores, since candidate phrase p and the similarity of other candidate phrases p are higher, candidate phrase p does not have more solely Characteristic, therefore in the optimum target, the sum of comprehensive similarity scores are penalty term;μ is? The weight divided;λ isScore weight.So set, just Extractable target patent dcWith comparison patent d 'cImportant phrases collectionWith
Step S4 is specifically included:
S41, building important phrases-patent file bigraph (bipartite graph);
S42, it calculates in important phrases-patent file bigraph (bipartite graph), important phrases p ' and target patent dcWith comparison patent d 'c Between the degree of correlation;
S43, it calculates in important phrases-patent file bigraph (bipartite graph), important phrases p ' and target patent dcWith comparison patent d 'c Between similarity scores;
S44, it calculates in important phrases-patent file bigraph (bipartite graph), important phrases p ' and target patent dcWith comparison patent d 'c Between otherness score.
In step S41, important phrases-patent file bigraph (bipartite graph) can be used for characterizing important phrases collection S and patent file collection D Correlation (such as Fig. 3) connect the weight on side wherein having connection side between each important phrases p ' node and patent file d It can be obtained by BM25 relatedness computation.
Further, in step S42, important phrases-patent text can be calculated using the simrank algorithm of random walk Important phrases p ' and target patent d in shelves bigraph (bipartite graph)cDegree of correlation f (p ', dc) and important phrases p ' and comparison patent d 'cPhase Pass degree f (p ', d 'c)。
Step S43 is mainly used for calculating important phrases p ' and target patent dcWith comparison patent d 'cSimilarity scores Φ (p ', dc, d 'c):
Φ (p ', dc, d 'c)=ln (1+f (p ', dc) f (p ', d 'c))
Wherein, f (p ', dc) it is important phrases p ' and target patent dcBetween the degree of correlation;F (p ', d 'c) it is important phrases P ' and comparison patent d 'cBetween the degree of correlation.
In fact, when important phrases p ' simultaneously with target patent dcWith comparison patent d 'cBetween have very high correlation When spending, then show that the important phrases p ' has stronger importance in important phrases collection S, therefore for a certain important phrases P ', with target patent dcDegree of correlation f (p ', d 'c) and with comparison patent d 'cDegree of correlation f (p ', d 'c) bigger, then table Bright important phrases p ' and target patent dcWith comparison patent d 'cBetween similarity scores Φ (p ', dc, d 'c) higher.And Similarity scores Φ (p ', d of the inventionc, d 'c) calculating process in, use important phrases p ' and target patent dcIt is special with comparison Sharp d 'cThe product of the degree of correlation take logarithm again, comprehensively considered important phrases p ' and target patent dcDegree of correlation f (p ', dc) with And important phrases p ' and comparison patent d 'cDegree of correlation f (p ', d 'c) two, it is special with target preferably to characterize important phrases p ' Sharp dcWith comparison patent d 'c
Step S44 is mainly used for calculating important phrases p ' and target patent dcWith comparison patent d 'cOtherness score Ω (p ', dc|d′c):
Wherein, γ is smoothing parameter, to prevent important phrases p ' and target patent dcBetween degree of correlation f (p ', dc) and again Want phrase p ' and comparison patent d 'cBetween degree of correlation f (p ', d 'c) it is intended to 0.
Specifically, target patent d is being calculatedcWith comparison patent d 'cOtherness score Ω (p ', dc|d′c) when, it is important Phrase p ' should be with target patent dcWith comparison patent d 'cIn a degree of correlation it is very high, and it is very low with another degree of correlation, And important phrases p ' should in important phrases collection S importance with higher, therefore for target patent dcIn it is a certain Important phrases p ', otherness score Ω (p ', dc|d′c) there are following two situations: if one, important phrases p ' and target patent dcThe degree of correlation is very high, and with comparison patent d 'cThe degree of correlation it is relatively low, then the otherness score Ω of important phrases p ' (p ', dc|d′c) higher;If important phrases p ' and target patent dcThe degree of correlation is relatively high, and with comparison patent d 'cThe degree of correlation It is very low, then otherness score Ω (p ', the d of important phrases p 'c|d′c) also higher.
Two, important phrases p ' and target patent dcIt is non-significant similar, but its significant is different from compares patent d 'c, then this is important Phrase p ' can also be used as difference phrase Ω (p ', dc|d′c), to embody target patent dcWith comparison patent d 'cBetween difference Property.And as important phrases p ' and comparison patent d 'cIt is non-significant similar, but its significant is different from target patent dc, then the important phrases P ' can also be used as difference phrase Ω (p ', dc|d′c), to embody target patent dcWith comparison patent d 'cBetween otherness.
Step S5 is specifically included:
S51, it is based on optimum target method, and combines important phrases p ' and target patent d in important phrases collection ScWith it is right Than patent d 'cBetween similarity scores Φ (p ', d, d '), obtain target patent dcWith comparison patent d 'cBetween similar phrase Collect C;
S52, it is based on optimum target method, and combines important phrases p ' and target patent d in important phrases collection ScWith it is right Than patent d 'cBetween otherness score Ω (p ', dc|d′c), it obtains target patent difference phrase book Q and comparison patent difference is short Language collection Q '.
Step S51 specifically: define optimum target and at least two similarity constraint conditions, and in step S51 most Optimization aim are as follows:
Wherein, piFor i-th of similar phrase in similar phrase book C;For target patent important phrases collection, andTo compare patent important phrases collection, andFor decision variable, xi=0 or 1 Indicate whether i-th of phrase to be selected is similar phrase, xi=1 indicates to be similar phrase, xi=0 indicates not being similar phrase.
Further, the purpose for defining optimum target is so that similar phrase p in similar phrase book CsSimilitude Fractions phi (ps, dc, d 'c) the sum of maximize, and the similar phrase p that extracts is guaranteed by similarity constraint conditionsSimilitude Fractions phi (ps, dc, d 'c) it is respectively greater than target patent important phrases collectionSimilarity scores Φ (p ', dc, d 'c) average value With comparison patent important phrases collectionSimilarity scores Φ (p ', dc, d 'c) average value, to limit similar phrase book C's Scale.
It should be noted that be only illustrated there are two for by similarity constraint condition setting in the present invention, Certainly in other embodiments of the invention, similarity constraint condition may also be configured to other quantity.
Step S52 specifically: define optimum target and at least three otherness constraint conditions, and the optimization in S52 Target are as follows:
C ∩ Q=C ∩ Q '=φ
Wherein, Q is target patent difference phrase book;Q ' is comparison patent difference phrase book;yi, yi' it is decision variable, It and is 0-1 variable, yiIndicate target patent dcIn phrase to be selected whether be otherness phrase, yi' indicate comparison patent d 'cIn Phrase to be selected whether be otherness phrase.
Specifically, in step S52 optimum target establish meaning be: so that target patent difference phrase book Q and Compare the sum of the otherness score in patent difference phrase book Q ' maximum;On the one hand otherness constraint condition is used to guarantee to extract Difference phrase piOtherness score Ω (pi, dc|dc') it is respectively greater than target patent important phrases collectionOtherness score Ω (p ', dc|dc) average value and comparison patent important phrases collectionOtherness score Ω (p ', dc|dc) average value;Separately On the one hand, so that target patent dcWith comparison patentSimilar phrase book C, target patent difference phrase book Q and comparison it is special Without intersection between the different phrase book Q ' of profit.
In conclusion Patent Reference's analysis method of the invention, by using web crawlers technology establish patent database, The candidate phrase collection P of patent file collection D is established based on participle technique, important phrases collection S is extracted based on optimal method, calculates weight Want phrase p ' and target patent dcWith comparison patent d 'cSimilarity scores Φ (p ', dc|d′c) and otherness score Ω (p ', dc| dc) and based on optimal method extraction target patent dcWith comparison patent d 'cSimilar phrase book and difference phrase book, quickly, Have effectively achieved the comparative analysis of patent.
The above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to preferred embodiment to this hair It is bright to be described in detail, those skilled in the art should understand that, it can modify to technical solution of the present invention Or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. a kind of Patent Reference's analysis method, which comprises the following steps:
S1, patent database is established based on web crawlers method;
S2, the patent file collection D that target topic is extracted from the patent database, and the candidate for establishing patent file collection D is short Language collection, wherein the patent file collection D includes an at least table of contents mark patent and at least one comparison patent;
S3, it is based on optimization selection model, the important phrases extracted target patent and compare patent is concentrated in the candidate phrase Collection, and the important phrases collection includes target patent important phrases collection and comparison patent important phrases collection;
S4, important phrases-patent file bigraph (bipartite graph) relativity measurement is established, calculates important phrases and concentrates important phrases and target The similarity scores and otherness score and important phrases of patent and the similarity scores and otherness score of comparison patent;
S5, it extracts target patent respectively based on optimum target method and compares the similar phrase book and difference phrase book of patent.
2. Patent Reference's analysis method according to claim 1, which is characterized in that the step S1 specifically: selection is more A target patent website constructs multiple crawler modules using distributed reptile framework, opens multiple crawler threads while crawling mesh Patent website is marked, and according to the composition of the patent information crawled, establishes the patent information that database table storage crawls, is constructed Patent database.
3. Patent Reference's analysis method according to claim 1, which is characterized in that the step S2 is specifically included:
S21, the patent file collection D that target topic is extracted from the patent database;
S22, word segmentation processing is carried out to the patent file in patent file collection D, it is described to obtain the participle collection of patent file collection D Participle collection includes several participles;
S23, deactivated vocabulary is established, is screened and filtered according to the participle that deactivated vocabulary concentrates the participle, obtains State effective participle collection of patent file collection D;
S24, the association relationship MI that participle is concentrated in effective participle is calculated, extracts patent file collection to concentrate in effective participle The candidate phrase collection of D.
4. Patent Reference's analysis method according to claim 3, which is characterized in that the step S24 specifically: definition point Word frequency threshold is F, and the mutual information threshold value of participle is I, and Joint Distribution and the side of candidate participle are concentrated by calculating effectively participle Border distribution, to calculate the association relationship MI for obtaining candidate participle;If the frequency of candidate's participle is greater than the participle frequency threshold of setting The candidate is then segmented and candidate phrase concentration is added by F;If the frequency of candidate's participle is less than the participle frequency threshold F of setting, examine The size of the association relationship MI of candidate participle is examined, if the association relationship MI of candidate participle is greater than the mutual information threshold value I of setting, Candidate phrase collection is then added, otherwise candidate participle is dropped.
5. Patent Reference's analysis method according to claim 1, which is characterized in that the step S3 specifically:
S31, each of candidate phrase collection conspicuousness score of the candidate phrase in the patent file where it is calculated, with table Levy conspicuousness of the candidate phrase in the patent file where it;
S32, each of candidate phrase collection uniqueness score of the candidate phrase in the patent file where it is calculated, with table Levy uniqueness of the candidate phrase in the patent file where it;
S33, it is based on optimized selection method, and concentrates the conspicuousness score and uniqueness of each candidate phrase in conjunction with candidate phrase Property score, extract target patent and compare patent important phrases collection S, the important phrases collection S includes related to target patent Target patent important phrases collection and to the relevant comparison patent important phrases collection of comparison patent.
6. Patent Reference's analysis method according to claim 5, which is characterized in that the step S33 specifically: definition weight The amount threshold for wanting important phrases in phrase book is K, and the conspicuousness score and uniqueness of candidate phrase are concentrated with the candidate phrase Property score establish optimum target as extraction standard, and target patent is obtained by the optimum target and compares patent Important phrases collection, the important phrases collection include target patent important phrases collection and comparison patent important phrases collection, the target Patent important phrases collection includes K important phrases relevant to the target patent;The comparison patent important phrases collection includes K A important phrases relevant to the comparison patent.
7. Patent Reference's analysis method according to claim 1, which is characterized in that the step S4 is specifically included:
S41, building important phrases-patent file bigraph (bipartite graph);
S42, it calculates in important phrases-patent file bigraph (bipartite graph), the degree of correlation between important phrases and target patent and important short The degree of correlation between language and comparison patent;
S43, it calculates in important phrases-patent file bigraph (bipartite graph), it is similar between important phrases and target patent and comparison patent Property score;
S44, it calculates in important phrases-patent file bigraph (bipartite graph), the difference between important phrases and target patent and comparison patent Property score.
8. Patent Reference's analysis method according to claim 7, which is characterized in that the step S5 is specifically included:
S51, be based on optimum target method, and combine in important phrases collection S important phrases and target patent and compare patent it Between similarity scores, obtain target patent and compare patent between similar phrase book C;
S52, be based on optimum target method, and combine in important phrases collection S important phrases and target patent and compare patent it Between otherness score, obtain target patent difference phrase book and comparison patent difference phrase book.
9. Patent Reference's analysis method according to claim 8, which is characterized in that the step S51 specifically: definition is most Optimization aim and at least two similarity constraint conditions, so that the sum of similarity scores of similar phrase are most in similar phrase book C Bigization, and guarantee that the similarity scores of the similar phrase extracted are respectively greater than target patent by the similarity constraint condition The average value of the similarity scores of the average value and comparison patent important phrases collection of the similarity scores of important phrases collection.
10. Patent Reference's analysis method according to claim 8, which is characterized in that the step S52 specifically: definition Optimum target and at least three otherness constraint conditions, so that target patent difference phrase book and comparison patent difference phrase book The sum of otherness score of middle difference phrase maximizes, and the difference phrase extracted by otherness constraint condition guarantee Otherness score be respectively greater than target patent important phrases collection otherness score average value and comparison patent important phrases The average value of the otherness score of collection, and target patent and the comparison similar phrase book C of patent, target patent difference phrase book with And without intersection between comparison patent difference phrase book.
CN201910067706.2A 2019-01-24 2019-01-24 Patent comparative analysis method Active CN109903198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910067706.2A CN109903198B (en) 2019-01-24 2019-01-24 Patent comparative analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910067706.2A CN109903198B (en) 2019-01-24 2019-01-24 Patent comparative analysis method

Publications (2)

Publication Number Publication Date
CN109903198A true CN109903198A (en) 2019-06-18
CN109903198B CN109903198B (en) 2022-08-30

Family

ID=66944149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910067706.2A Active CN109903198B (en) 2019-01-24 2019-01-24 Patent comparative analysis method

Country Status (1)

Country Link
CN (1) CN109903198B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291816A (en) * 2020-02-17 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out feature processing aiming at user classification model
CN111552783A (en) * 2020-04-30 2020-08-18 深圳前海微众银行股份有限公司 Content analysis query method, device, equipment and computer storage medium
CN117112735A (en) * 2023-10-19 2023-11-24 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547739A (en) * 2016-11-03 2017-03-29 同济大学 A kind of text semantic similarity analysis method
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547739A (en) * 2016-11-03 2017-03-29 同济大学 A kind of text semantic similarity analysis method
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291816A (en) * 2020-02-17 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out feature processing aiming at user classification model
CN111552783A (en) * 2020-04-30 2020-08-18 深圳前海微众银行股份有限公司 Content analysis query method, device, equipment and computer storage medium
CN117112735A (en) * 2023-10-19 2023-11-24 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment
CN117112735B (en) * 2023-10-19 2024-02-13 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment

Also Published As

Publication number Publication date
CN109903198B (en) 2022-08-30

Similar Documents

Publication Publication Date Title
US11334726B1 (en) Applied artificial intelligence technology for using natural language processing to train a natural language generation system with respect to date and number textual features
Karimi et al. Learning hierarchical discourse-level structure for fake news detection
US11341330B1 (en) Applied artificial intelligence technology for adaptive natural language understanding with term discovery
Inzalkar et al. A survey on text mining-techniques and application
US9336192B1 (en) Methods for analyzing text
US8676815B2 (en) Suffix tree similarity measure for document clustering
US20160357854A1 (en) Scenario generating apparatus and computer program therefor
US10095685B2 (en) Phrase pair collecting apparatus and computer program therefor
Faria et al. OAEI 2016 results of AML
US10387805B2 (en) System and method for ranking news feeds
CN109903198A (en) Patent Reference's analysis method
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
Yuan-jie et al. Web service classification based on automatic semantic annotation and ensemble learning
CN107463616A (en) A kind of business information analysis method and system
Ameer et al. Author profiling for age and gender using combinations of features of various types
Gong et al. Phrase-based hashtag recommendation for microblog posts.
Kmail et al. MatchingSem: online recruitment system based on multiple semantic resources
Phan et al. A sentiment analysis method of objects by integrating sentiments from tweets
CN110110013B (en) Entity competition relation data mining method based on space-time attributes
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
CN104679836B (en) A kind of automatic extending method of Agricultural ontology
EP2605150A1 (en) Method for identifying the named entity that corresponds to an owner of a web page
CN115269816A (en) Core personnel mining method and device based on information processing method and storage medium
Yang et al. Research on the Sentiment analysis of customer reviews based on the ontology of phone
KR20180060871A (en) Device and method for building of emotional dictionary based on securities company reports, recording medium for performing the method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant