CN105224520B

CN105224520B - A kind of Chinese patent document term automatic identifying method

Info

Publication number: CN105224520B
Application number: CN201510623936.4A
Authority: CN
Inventors: 吕学强; 董志安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2015-09-28
Filing date: 2015-09-28
Publication date: 2018-03-13
Anticipated expiration: 2035-09-28
Also published as: CN105224520A

Abstract

The present invention relates to a kind of Chinese patent document term automatic identifying method, comprise the following steps：Step 1)：Part-of-speech rule is automatically generated based on patent title；Step 2)：Manual construction disables vocabulary；Step 3)：The part-of-speech rule of generation is classified according to the number of contained part of speech；Step 4)：Candidate terms are ranked up using TermRank sort algorithms.The present invention learns from patent title the part-of-speech rule of composition term automatically first with statistical method, solves the artificial deficiency for summarizing term part-of-speech rule；Candidate terms are ranked up using TermRank sort methods, linguistics and statistics feature in patent document has been considered, can preferably distinguish term and non-term, there is higher reliability, the needs of practical application can be met well.

Description

A kind of Chinese patent document term automatic identifying method

Technical field

The invention belongs to Chinese terminology automatic identification technology field, and in particular to a kind of Chinese patent document term is known automatically Other method.

Background technology

Contain a large amount of field terms in Chinese patent document, it is information extraction, text to carry out automatic identification to these terms The vital task in the fields such as excavation.Automatic term identification (Automatic Term Recognition, ATR) is that information extraction is ground Study carefully the important component in field.It refers to by no manual intervention or as far as possible few Manual intervention method, from free text Automatically identify the process for the vocabulary string that can represent universal in some professional domain.Pass through term automatic identification technology structure The terminology bank built is very important basic data resource, is Chinese word segmentation, ontological construction, dictionary writing and renewal, automatic mark Draw, information retrieval and machine translation etc. provide indispensable data and supported.In addition, sent out along with the high speed of information technology Exhibition, digital information source increase severely with day, and these resources are carried out with the automatic identification of term for holding the newest hair in field in time Exhibition situation and future developing trend tool are of great significance.

Chinese patent document is important digital information source, they recite the last word of each ambit into Fruit, wherein there is substantial amounts of technical term.With reference to the observation analysis to Chinese patent document and the research of forefathers, it can be found that Term in patent document has following several evident characteristics：(1) the term nesting phenomenon in patent document is relatively conventional；(2) Term in patent document has stronger field correlation, i.e. high-frequency appears in the term in a certain field in other field Middle low frequency occurrence even occurs without；(3) term in patent document has the characteristics of repeating, i.e., term is in whole patent text Offer and occur in more documents of concentration；(4) patent term length is longer, is generally made up of 2-5 words；(5) patent term is mostly It is made up of noun or complex noun.The characteristics of terms above be to Chinese patent document carry out term automatic identification it is important according to According to.

At present, conventional term automatic identifying method mainly has two kinds：

The first is traditional rule with counting the term recognition methods being combined：In the process of generation candidate terms collection In, first Chinese text is segmented and part-of-speech tagging processing, the language material marked by observation sum up the word for forming term Property rule set, generation candidate terms collection is matched in language material using these part-of-speech rules；By the side of manual compiling part-of-speech rule Too big to the linguistic knowledge dependence of author although formula accuracy of identification is higher, different people is write to same language material Part-of-speech rule is not consistent；Although part-of-speech rule need not be utilized obtaining these methods of candidate terms stage, in distich Too big to the dependent resource of outside during the son thick cutting of progress, the quality of external resource often determines obtained candidate terms collection Quality；Aspect is being ranked up to candidate terms collection, currently conventional sort algorithm is present for the shorter art of identification length Language or the frequency of occurrences it is relatively low term it is undesirable the defects of；

The method of second of identification term is to use gradually to tend to the machine of study hotspot in information extraction field in recent years Learning algorithm, it is that its scale and quality requirement to training corpus is higher the defects of machine learning algorithm, and needs artificial mark Mass data is noted, the training of language material is also required to spend longer time.

In addition, the candidate terms sort algorithm of main flow is undesirable to the shorter term recognition effect of length at present.

The content of the invention

For above-mentioned problems of the prior art, it is an object of the invention to provide one kind can avoid the occurrence of above-mentioned skill The Chinese patent document term automatic identifying method of art defect.

In order to realize foregoing invention purpose, the technical solution adopted by the present invention is as follows：

A kind of Chinese patent document term automatic identifying method, comprises the following steps：

Step 1)：Part-of-speech rule is automatically generated based on patent title, cut patent title using Chinese lexical analysis system It is divided into substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and as life Into the part-of-speech rule of candidate terms；

Step 2)：Manual construction disables vocabulary, and stop words is added and disabled in vocabulary；

Step 3)：The part-of-speech rule of generation is classified according to the number of contained part of speech, to every one kind word Property rule according to frequency of occurrences descending arrange, and only take the rules of Top 5 be applied to Chinese patent document body part carry out Part of speech matches, and generates candidate terms set, then the candidate terms extracted are classified according to the number of included word；

Step 4)：Candidate terms are ranked up using TermRank sort algorithms, the TermRank sort algorithms are determined Justice is as shown in formula (1)：

Wherein, T_iFor candidate terms, TR (T_i) it is candidate terms T_iTermRank values；M is to include candidate terms T_iIt is special Sharp quantity of document；To include candidate terms T_iPatent document d_jMiddle T_iWord frequency；C(d_j) it is patent document d_jIn take out The candidate terms quantity of taking-up；|T_i| it is candidate terms T_iLength, count (T_i) it is candidate terms T_iIn the stop words that includes Quantity；

Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, through row After sequence, Top-N bars are taken as final nomenclature.

Further, the step 2) specifically chooses stop words to build deactivation vocabulary using following three kinds of methods：

Method one：Word frequency statisticses are carried out after being segmented to patent title, stop words of the frequency higher than 20 is will appear from and adds deactivation Vocabulary；

Method two：The part of speech substantially not appeared in term is added and disables vocabulary；

Method three：The deactivation vocabulary generated using methods described one and methods described two filters to the patent title Afterwards, the remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to deactivation In vocabulary.

Further, in the step 3), the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word parts of speech Rule, 4 word part-of-speech rules and 5 word part-of-speech rules.

Further, in the step 3), the candidate terms are divided into four classes, i.e. 2 word candidate terms, 3 word candidates Term, 4 word candidate terms and 5 word candidate terms.

Further, in the step 4), when M values are larger or smaller, formula (2) and formula (3) are utilized respectively The Section 1 and Section 2 of the formula (1) are normalized, wherein, the formula (2) and formula (3) are respectively：

{|T_i|×count(T_i)-min|T_i|×count(T_i)}/{max|T_i|×count(T_i)-min|T_i|×count (T_i)} (3)。

Chinese patent document term automatic identifying method provided by the invention, first with statistical method from patent title In learn the part-of-speech rule of composition term automatically, solve the artificial deficiency for summarizing term part-of-speech rule；Using TermRank Sort method is ranked up to candidate terms, has considered linguistics and statistics feature in patent document, can be preferable Differentiation term and non-term, there is higher reliability, the needs of practical application can be met well.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is Chinese patent title formalization representation schematic diagram.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation The present invention will be further described for example.It should be appreciated that specific embodiment described herein is only to explain the present invention, and do not have to It is of the invention in limiting.

As shown in figure 1, a kind of Chinese patent document term automatic identifying method, comprises the following steps：

Patent document be usually to invention, utility model, appearance design record, its title is the height to whole document Summarize, therefore often directly give the object to be described.A correct term is all comprised at least in the title of patent document. According to the above feature of Chinese patent title, title is formally expressed as shown in Fig. 2 wherein, w_i(i=1,2 ... n) Represent the word that patent title is syncopated as by ICTCLAS, w₁…w_a, w_c…w_dAnd w_f…w_nFor the term in title, represent respectively For CT1, CT2, CT3；w_bAnd w_eIt is to be not belonging to the word that any term forms part, is called stop words herein, its construction method exists 3.2 sections are introduced.

With stop words ST1, ST2 is separator, and by substring CT1, CT2, CT3 part-of-speech rule extract, you can as under One step generates the part-of-speech rule of candidate terms.For example, patent title：" one/m kinds/q is electronic/b automobiles/n /ude1 electricity/n shows Show and term is included in device/n devices/n "：" electronic/b automobiles/n ", " electricity/n displays/n ".Extract their part-of-speech rule： " b+n ", " n+n ", and they are concentrated added to part-of-speech rule, the part-of-speech rule of candidate terms is generated as next step.

Stop words is the valuable source that part-of-speech rule is automatically generated from patent title.Present invention selection manual construction disables Vocabulary, rather than ready-made general deactivation vocabulary is directly used, it is because of some stop words in ready-made general deactivation vocabulary The part of term is likely to be in the patent literature.For example, " row/v " exists in general deactivation vocabulary, but " entirely certainly In dynamic/b rows/v paper/n machines/ng ", it is the part for forming term again, therefore can not be added into deactivation vocabulary.Similar " row/ V " is this kind of to be existed in general deactivation vocabulary, but in Chinese patent document be again the part for forming term word it is big in language material Amount is present.

Stop words is specifically chosen using following three kinds of methods to build deactivation vocabulary：

Step 3)：The part-of-speech rule of generation is classified according to the number of contained part of speech, the part of speech automatically generated Regular quantity is more, all of which can not be applied to progress term matching in document, it is therefore desirable to selectively therefrom select Go out part part-of-speech rule, in step 3), the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word part-of-speech rules, 4 Word part-of-speech rule and 5 word part-of-speech rules；Then every one kind part-of-speech rule is arranged according to frequency of occurrences descending, and only taken The body part that Top5 rules are applied to Chinese patent document carries out part of speech matching, generates candidate terms set, then will take out The candidate terms of taking-up are classified according to the number of included word, and the candidate terms are divided into four classes, i.e. 2 word candidate's arts Language, 3 word candidate terms, 4 word candidate terms and 5 word candidate terms, so the purpose to candidate terms classification is each in order to allow The term of class length is all separately formed a candidate terms table, and the TermRank sort algorithms in it using step 4) are carried out It can not be influenceed during sequence by the term of other length, so as to which ranking results are more fair；

Step 4)：Purpose to candidate terms sequence is to determine final nomenclature.One good sort algorithm can The term correctly or incorrectly disperseed in candidate terms list is resequenced, increases the weight of correct term, ranking position Put as far as possible forward, vice versa.Candidate terms are ranked up using TermRank sort algorithms, the TermRank sequences are calculated Method definition is as shown in formula (1)：

Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, through row After sequence, taking Top-N bars, N values 5, that is, take Top-5 bars as final nomenclature herein as final nomenclature.

Section 1 and Section 2 in formula (1) might not on the same order of magnitude, as M ＞ 1000 or M ＜ 2, The TermRank values of candidate terms are influenceed and little, now need that them are normalized respectively.Present invention choosing Linear transformation method for normalizing is selected, is respectively shown in formula (2) and formula (3) to wherein first, second normalization formula：

9725 patent documents are handled using method proposed by the present invention, remove form and picture therein, are protected It is 123M to save as the language material size after plain text.Patent document is segmented using ICTCLAS and part-of-speech tagging is handled.Part of speech Mark uses Computer Department of the Chinese Academy of Science's two level part-of-speech tagging collection, i.e.,《ICTPOS3.0 Chinese part of speech label sets》.Using in step 2) The method for disabling vocabulary is built, stop words totally 246 is included in the deactivation vocabulary finally built.Table 1 lists which part and stopped Word.

Table 1：The artificial constructed part stop words disabled in vocabulary

Experimental result is judged using manual type.It is right to avoid the limitation of the subjectivity of people and domain knowledge Respective markers are directly marked in term substantially correctly or incorrectly, and the candidate terms for being difficult discrimination correctness then utilize Google search engine is judged.As long as meeting any one of situations below, then the candidate terms are labeled as correct art Language, otherwise labeled as wrong term：1) corresponding entry in knowledge websites such as Wikipedia, Baidupedia, interactive encyclopaedias be present； 2) this entry be present in patent search system；3) Google search engine is not filtered or beaten to any composition in candidate terms The processing such as random order.

Because experimental result collection is too big, it is difficult to total evaluation is carried out to the list after whole sequence, therefore using P@N evaluations Method, that is, judge the accuracy rate (Precision) of preceding N bars in final nomenclature, and its calculation formula is as follows：

Using the method for automatically generating part-of-speech rule in step 1), from patent document title symbiosis into 2832 without weight Multiple part-of-speech rule.Table 2 lists the Top5 bars after being sorted according to frequency.The statistical result demonstrates most of art from experimental data Language is the characteristics of composition noun or complex noun.

Table 2：Automatically generate the citing of part-of-speech rule Top5 bars

Table 3 is the total frequency (2832) percentage shared by its frequency of occurrence after classifying to part-of-speech rule according to different length Statistical information.The part-of-speech rule that wherein length is 4 and 5 accounts for 71.5% altogether, demonstrates the spy that term length is partially long in patent document Point.

Table 3：The part-of-speech rule ratio of different length

This method by the automatic summary part-of-speech rule from the title of patent document is relative to traditional part-of-speech rule Generation method, there is following both sides advantage：1) redundancy is greatly reduced：Part of speech rule are summarized relative to from patent text Then, the part-of-speech rule that redundancy will be greatly reduced in part of speech is summarized from title；2) to participle and part-of-speech tagging instrument precision according to Rely and reduce：No matter the term in title is correctly or incorrectly segmented and word frequency marks, its part-of-speech rule pattern all will be by Part-of-speech rule is added to concentrate.When extracting candidate terms, if candidate terms by false segmentation and mark, will be also extracted.

Because the part-of-speech rule automatically generated is more, strictly all rules is applied in patent document and extracts candidate terms simultaneously It is unnecessary.Therefore for the part-of-speech rule of every a kind of length, according to the height of frequency of occurrence, Top5 bars are only taken.Table 4 is different length Spend the Top5 bars of part-of-speech rule.

Table 4：Different length part-of-speech rule Top5 bars

Extracted using the part-of-speech rule listed in table 4, then to patent document text.Extract 2 word candidate terms 493286；3 word candidate terms 152274；4 word candidate terms 31809；5 word candidate terms 3966.Table 5 is to extract Part candidate terms and Corresponding matching part-of-speech rule.

The candidate terms quality extracted using part-of-speech rule is higher, but there is also partial noise.For example, candidate terms Although " it is not real term in itself with reference to/v accompanying drawings/n " matching " V+N " part-of-speech rules；Candidate terms " displacement/v sensings The part of speech of " displacement " in device/n " should be n, " voice/n declines/V-shaped/k mammary gland/n somascopes/n " correctly participle and part of speech Mark should be " voice/n formulas/k is miniature/a mammary gland/n somascopes/n ".Although these word strings are mistakenly segmented or part of speech mark Note, but remains as term in itself, and is correctly identified out, and this is exactly of the present invention to automatically generate part-of-speech rule It is in place of advantage, i.e., smaller to the precision dependence of participle and part-of-speech tagging.

Table 5：Part candidate terms and the part-of-speech rule of matching

Candidate terms are divided into different candidate terms tables according to different words length, because only identification length is 2-5 herein Word term, therefore obtain 4 candidate terms tables.Sequence to candidate terms is individually carried out on each candidate terms table, It is in order to avoid causing iniquitous phenomenon to occur so as to be sorted to entirety because the identification of the candidate terms of certain class length is more.This TF and C-Value methods as a comparison are chosen in invention.Table 6 is using P@N evaluation methods to final candidate terms ranking results Statistical information, wherein N value 100,200,400,800,1000 successively.

Table 6：The P@N of candidate terms ranking results are evaluated

Candidate art of the TermRank methods proposed by the present invention to different length it can be seen from the experimental result in table 6 Language sequence effect is all significantly better than other two kinds of sort methods.On P@1000, knowledge of the TermRank methods to 3 word length terms Other accuracy reaches more than 80%.The rule gradually successively decreased from the accuracy on P 100~P of@@1000 is also confirmed TermRank has the ability for preferably distinguishing term and non-term.

Embodiment described above only expresses embodiments of the present invention, and its description is more specific and detailed, but can not Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection model of the present invention Enclose.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of Chinese patent document term automatic identifying method, it is characterised in that comprise the following steps：

Step 1)：Part-of-speech rule is automatically generated based on patent title, is by patent title cutting using Chinese lexical analysis system Substring and stop words, using the stop words as separator, the part-of-speech rule of the substring is extracted, and waited as generation Select the part-of-speech rule of term；

Step 3)：The part-of-speech rule of generation is classified according to the number of contained part of speech, every one kind part of speech is advised Then arranged according to frequency of occurrences descending, and only take the rules of Top 5 to be applied to the body part of Chinese patent document and carry out part of speech Matching, candidate terms set is generated, then the candidate terms extracted are classified according to the number of included word；

Step 4)：Candidate terms are ranked up using TermRank sort algorithms, the TermRank sort algorithms definition is such as Shown in formula (1)：

<mrow> <mi>T</mi> <mi>R</mi> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>|</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>&times;</mo> <mi>c</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>(</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow>

Wherein, T_iFor candidate terms, TR (T_i) it is candidate terms T_iTermRank values；M is to include candidate terms T_iPatent text Offer quantity；To include candidate terms T_iPatent document d_jMiddle T_iWord frequency；C(d_j) it is patent document d_jIn extract Candidate terms quantity；|T_i| it is candidate terms T_iLength, count (T_i) it is candidate terms T_iIn the stop words quantity that includes；

Its TermRank value is calculated according to formula (1) to each described candidate terms in candidate terms list, after sorted, Top-N bars are taken as final nomenclature.

2. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that the step 2) tool Body chooses stop words to build deactivation vocabulary using following three kinds of methods：

Method three；After the deactivation vocabulary generated using methods described one and methods described two filters to the patent title, Remaining word string in the patent title is manually observed, if finding new stop words again, also adds it to stop words In table.

3. Chinese patent document term automatic identifying method according to claim 1 or 2, it is characterised in that in the step It is rapid 3) in, the part-of-speech rule is divided into four classes, i.e. 2 word part-of-speech rules, 3 word part-of-speech rules, 4 word part-of-speech rules and 5 word parts of speech Rule.

4. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that in the step 3) In, the candidate terms are divided into four classes, i.e. 2 word candidate terms, 3 word candidate terms, 4 word candidate terms and 5 word candidate terms.

5. Chinese patent document term automatic identifying method according to claim 1, it is characterised in that in the step 4) In, when M values are larger or smaller, it is utilized respectively formula (2) and formula (3) Section 1 and Section 2 to the formula (1) It is normalized, wherein, the formula (2) and formula (3) are respectively：

<mrow> <mrow> <mo>{</mo> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mi>min</mi> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> <mo>}</mo> </mrow> <mo>/</mo> <mrow> <mo>{</mo> <mrow> <mi>max</mi> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mi>min</mi> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mfrac> <mrow> <msub> <mi>TF</mi> <msub> <mi>T</mi> <mi>i</mi> </msub> </msub> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> <mo>}</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

{|T_i|×count(T_i)-min|T_i|×count(T_i)}/{max|T_i|×count(T_i)-min|T_i|×count(T_i)} (3)。